To lemmatize means to get the normal form of the word. For example, ‘go’ is the lemma for ‘went’. It is a common task in language processing to count words in a text. Usually, we want to count ‘go’ and ‘went’ as one and the same word. That’s why we need to lemmatize the words before we proceed to counting.
Let's learn how to filter only the verbs from text using the NLTK library. Actually, it’s quite simple. You should tokenize the text with word_tokenize, then tag all the words with nltk.pos_tag. After that, filter all the verbs (their tags start with ‘VB’), and then lemmatize all the words with WordNetLemmatizer().
import nltk
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
f = open("books/carrol/through_the_looking-glass.txt", "r", encoding="utf-8")
text = f.read()
# tokenize the text
text_tokens = word_tokenize(text)
# tag all the words
tagged_tokens = nltk.pos_tag(text_tokens)
# filter all the verbs
verbs = [word for word, tag in tagged_tokens if tag.startswith('VB')]
lemmatizer = WordNetLemmatizer()
#lemmatize all the words
lemmatized_verbs = [lemmatizer.lemmatize(verb, pos='v') for verb in verbs]
text = nltk.Text(lemmatized_verbs)