Here, we will begin exploring natural language processing. We will count words in a text and learn how to create a word cloud.
Lets read the file and test if everything works fine:
f = open('books/carrol/alice_in_wonderland.txt', "r", encoding="utf-8")
text = f.read()
print(text[:150])
The output will be something like:
Alice's Adventures in Wonderland
ALICE'S ADVENTURES IN WONDERLAND
Lewis Carroll
Let’s clear the text little bit:
# import inbuild Python module
import string
import re
# lowercase text
text = text.lower()
# get punctuation chars
spec_chars = string.punctuation
# delete punctuation chars from the text
text = "".join([ch for ch in text if ch not in spec_chars])
# substitute line breaks with the spaces
text = re.sub('\\n', ' ', text)
print(text[:100])
The output:
alices adventures in wonderland alices adventures in wonderland
We have a lot of spaces, we have prepositions and we want to expel them.
Next, we'll need NLTK — a library for text processing.
Tokenization is segmentation, or splitting text into individual components, and tokens are those components themselves. Since we're looking for the most popular words, we need to tokenize at the word level. For example, if we tokenize the sentence "The quick brown fox jumps over the lazy dog” by words, the tokens will be: 'The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog’.
So we need to install the library:pip install nltk
Let’s write some code:
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
# tokenize text
text_tokens = word_tokenize(text)
# save tokens in text format
text = nltk.Text(text_tokens)
# get stopwords such as prepositions
english_stopwords = stopwords.words("english")
# add arbitrary stopwords
english_stopwords.extend(['one', 'im', 'said'])
# exclude the stopwords from tokens
text_tokens = [token.strip() for token in text_tokens if token not in english_stopwords]
# save tokens in text format
text = nltk.Text(text_tokens)
# FreqDist calculates the frequency of each element in the dataset.
fdist_sw = FreqDist(text)
# print the most common words
print(fdist_sw.most_common(5))
Output: