This tutorial will guide you through how to implement its most popular algorithm, the Latent Dirichlet Allocation (LDA) algorithm, step by step in the context of a complete pipeline. First, we will be learning about the inner works of LDA. Then, we will be using scikit-learn for data preprocessing and model implementation, and pyLDAvis for visualization. As a little extra, we will also be doing our own data collection with newspaper3k.
Topic Modeling is a technique that you probably have heard of many times if you are into Natural Language Processing (NLP). Topic Modeling in NLP is commonly used for document clustering, not only for text analysis but also in search and recommendation engines.
Sounds good? Let’s start!
Latent Dirichlet Allocation (LDA) is an unsupervised algorithm that assigns each document a value for each defined topic (let’s say, we decide to look for 5 different topics in our corpus). Latent is another word for hidden (i.e., features that cannot be directly measured), while Dirichlet is a type of probability distribution.
LDA considers each document as a mix of topics and each topic as a mix of words. It iterates through the total number of topics and each word. It will randomly assign each word to a topic and evaluate how often the word occurs in that topic together with which other words.
This approach follows a similar way of thought as we humans would. This makes LDA easier to interpret and one of the most popular methods out there. The trickiest part of it though is to figure out the optimal number of topics and iterations.
Latent Dirichlet Allocation is not to be confused with Latent Discriminant Analysis (also referred to as LDA). Latent Discriminant Analysis is a supervised dimensionality reduction technique used for the classification or preprocessing of high-dimensional data.
Now, let’s see LDA in action to make some sense out of this introduction.
To spice things up, let’s use our own dataset! For this, we will use the newspaper3k library, a wonderful tool for easy article scraping.
!pip install newspaper3k import newspaper from newspaper import Article
We will be using the build functionality to collect the URLs on our chosen news website’s main page.
# Save URLs from main page. news = newspaper.build("https://www.theguardian.com/us", memoize_articles=False)
By passing the memoize_articles argument as False, we ensure that, if we call the function a second time, all the URLs will be collected again. Otherwise, only the new URLs would be returned. We can check news.size() to get the number of collected news URLs. In our case, 143.
Next, we need to simply pass each URL through Article(), call download() and parse(), and finally, we can get the article’s text. We also pass a length condition to avoid storing some previously spotted exceptions. That way, we ensure adding only long texts to our dataset.
texts = [] # For each URL, for article in news.articles: # Get the corresponding article. article = Article(article.url) article.download() if article: article.parse() # Get text only if has more than 60 characters -- to avoid undesired exceptions. if len(article.text) > 60: texts.append(article.text)
After running these lines, the total number of news articles is 132.
The next step is to prepare the input data for the LDA model. LDA takes as input a document-term matrix.
We will be using Bag of Words, specifically the CountVectorizer implementation from scikit-learn.
from sklearn.feature_extraction.text import CountVectorizer bow_vectorizer = CountVectorizer(stop_words=stopwords, lowercase=True, max_df=0.5, min_df=10) bow_matrix = bow_vectorizer.fit_transform(texts)
There are a couple of things to mention here. First, it is essential not to forget to remove stopwords. We call the lowercase method for increased normalization, and we set a series of parameters to avoid high-frequency words (common words not in the stopwords list that do not add any meaning overall) or too low-frequency terms.
Our resulting Bag of Words has a shape of (132, 438).
With that in place, it is time to use the LDA algorithm.
Using scikit-learn’s implementation of this algorithm is really easy. However, this abstraction can make it really difficult to understand what is going on behind the scenes. It is important to have at least some intuition on how the algorithms we use actually work, so let’s recap a bit on the explanations from the introduction.
from sklearn.decomposition import LatentDirichletAllocation as LDA lda_bow = LDA(n_components=5, random_state=42) lda_bow.fit(bow_matrix)
LDA needs three inputs: a document-term matrix, the number of topics we estimate the documents should have, and the number of iterations for the model to figure out the optimal words-per-topic combinations.
n_components corresponds to the number of topics, here, 5 as a first guess.
The number of iterations is 10 by default, so we can omit that parameter.
Having the configurations of our LDA model set up under the lda_bow variable, we fit (train) on the BOW.
lda_bow.transform(bow_matrix[:2])
By calling transform, we get to see the results of the trained model. This gives us a good picture of how it actually works. We pass only the first two rows of our BOW matrix as an example.
array([[0.76662544, 0.01858679, 0.0183296 , 0.17813906, 0.01831911],
[0.00103261, 0.00102449, 0.001021 , 0.00102753, 0.99589436]])
As you can see, we have 5 values in each of the two vectors. Each value represents a topic (remember we told the model to find 5 different topics). In specific, it illustrates how much of that topic is covered in that document (vector). This makes sense since a document is usually made up of several (sub)topics.
Let’s now print the most common words for each topic:
for idx, topic in enumerate(lda_bow.components_): print(f"Top 5 words in Topic #:") print([bow_vectorizer.get_feature_names()[i] for i in topic.argsort()[-5:]]) print('')
The output looks like this:
Top 5 words in Topic #0:
[‘time’, ‘years’, ‘life’, ‘says’, ‘like’]
Top 5 words in Topic #1:
[‘public’, ‘york’, ‘new’, ‘police’, ‘trump’]
Top 5 words in Topic #2:
[‘white’, ‘decision’, ‘international’, ‘black’, ‘uk’]
Top 5 words in Topic #3:
[‘like’, ‘year’, ‘food’, ‘police’, ‘city’]
Top 5 words in Topic #4:
[‘bill’, ‘democrats’, ‘rights’, ‘voting’, ‘biden’]
This type of visualization is actually an excellent indicator of how well our topic model is being trained. Having words such as “like” or “says” does not provide much meaning. One way around this is to do lemmatization and add these undesired words to our stopwords list. Let’s improve our current model next.
Coming back to the preprocessing step is something very common and often necessary. After all, Machine Learning is an iterative process. In our case, we need to improve our Bag of Words not take into account some very frequent words that could not be filtered out with the previous approach.
Furthermore, it would be good to add a lemmatizer to avoid repeated words under different forms. For the first case, we just need to add our new list of stopwords to the already defined set of stopwords. For the second step though, CountVectorizer does not integrate a lemmatizer, so we have to create our own lemmatizer class and pass it to the tokenizer parameter. No need to worry much here, scikit-learn has you covered with their documentation on how to customize your vectorizer in this particular case.
nltk.download('punkt') nltk.download('wordnet') from nltk import word_tokenize from nltk.stem import WordNetLemmatizer class LemmaTokenizer: def __init__(self): self.wnl = WordNetLemmatizer() def __call__(self, doc): return [self.wnl.lemmatize(t) for t in word_tokenize(doc) if (t.isalpha() and len(t) >= 2)]
We download first some necessary packages and import the corresponding dependencies. The LemmaTokenizer class is the same as in the documentation except for two extra conditions we add to account only for tokens with alphabetic characters and with more than one letter. Otherwise, your topics will be flooded with punctuation and other undesired tokens. Now, we only have to pass our new parameter to the vectorizer. The rest remains as before.
bow_vectorizer = CountVectorizer(stop_words=stopwords, tokenizer=LemmaTokenizer(), lowercase=True, max_df=0.5, min_df=10) bow_matrix = bow_vectorizer.fit_transform(texts)
If we run all again, we see that indeed the most common words for our topics do change.
Top 5 words in Topic #0:
[‘experience’, ‘event’, ‘life’, ‘year’, ‘city’]
Top 5 words in Topic #1:
[‘republican’, ‘voting’, ‘right’, ‘trump’, ‘biden’]
Top 5 words in Topic #2:
[‘film’, ‘life’, ‘new’, ‘time’, ‘year’]
Top 5 words in Topic #3:
[‘year’, ‘vaccine’, ‘food’, ‘city’, ‘police’]
Top 5 words in Topic #4:
[‘week’, ‘governor’, ‘new’, ‘state’, ‘woman’]
That is looking good, well done!
One last step in our Topic Modeling analysis has to be visualization. One popular tool for interactive plotting of Latent Dirichlet Allocation results is pyLDAvis.
!pip install pyldavis import pyLDAvis import pyLDAvis.sklearn pyLDAvis.enable_notebook()
Make sure to import the corresponding module to the main library you are using for Topic Modeling (in our case, scikit-learn).
Again, this step will help us determine how well our model is performing. Let’s take a look at the visualizations as they were before improving our vectorizer with the lemmatizer.
NLP Topic modeling – Source: Omdena
There are two main parts to pyLDAvis. On the left side, the Intertopic Distance Map shows each topic as a bubble. The bigger the bubble, the higher the number of documents in our corpus belonging to that topic. The more distanced the bubbles are from each other, the more different their topics are. On the right side, the Top-30 Most Relevant Terms for Topic N consist of a barplot with two indicators. In blue, is the total frequency of that word in the corpus, and in red, is the frequency of that word in that topic.
NLP Topic modeling – Source: Omdena
It seems we did not have a bad result after all! Let’s see how it shows after lemmatization. The sizes of the bubbles are more irregular among them, and Topic 1 has a very large bubble that overlaps in great part with Topic 5. One thing we could explore further is the number of topics. It possibly is that five topics are much for our limited dataset. After some tweaking, we conclude that 3 topics without lemmatizer gives the best results for our case. The topics may still not make entire sense, or may sound repetitive or weak to us. There is no wrong in that. On the other hand, gathering more data can help the variety of our results and solidify the output. Feel free to explore with a larger amount of news articles or with your previously scraped tweets from Part 1.
In this tutorial, we learned about Latent Dirichlet Allocation. We built some intuition of the whole process and are ready to improve our first outputs by observing the performance of several parameters in our LDA implementation with the help of pyLDAvis. Now it’s time to put this into practice! Happy coding!
This article is written by Jessica Becerra Formoso.
If you’re interested in collaborating, apply to join an Omdena project at: https://www.omdena.com/projects
Want to learn more? Check out the tutorials below: