Topic Modeling with Python

Lab 4: Code-Along

Agenda

Topic Modeling and LDA Model
Python gensim Library for LDA
Perform Topic Modeling with gensim Library
- Import and Preprocess Data (Lab 1.2; Lab 1.3)
- Preprocess Data for LDA Modeling
- Fit a LDA Model
- Find optimal # of topics - K
- Topic Summaries and Visualization
  - pyLDAvis

Topic Modeling and LDA Model

What is topic modeling?

Topic Modeling is a type of statistical model used to uncover the abstract “topics” that occur in a collection of documents. The primary goal of topic modeling is to discover the hidden thematic structure in large archives of text data. This helps in organizing, understanding, and summarizing large datasets of textual information.

What is LDA?

Latent Dirichlet Allocation (LDA) is one of the most popular algorithms for topic modeling. Such topic modeling algorithms assume that any document is a mixture of topics and that any topic is a mixture of words. By analyzing the patterns of word co-occurrence across the documents, topic models can identify groups of words that frequently appear together and assign them to topics.

Python LDA Library - Gensim

Gensim

Gensim is designed for natural language processing (NLP) tasks such as topic modeling, document indexing, and similarity retrieval, particularly with large text corpora. Gensim provides efficient implementations of popular topic modeling algorithms such as Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA).

Perform LDA Modeling with Gensim Library

LDA Topic Modeling with gensim

!pip install nltk gensim matplotlib pyLDAvis
import pandas as pd
import nltk
import gensim
import matplotlib.pyplot as plt
import pyLDAvis

Requirement already satisfied: nltk in /opt/anaconda3/lib/python3.11/site-packages (3.8.1)
Requirement already satisfied: gensim in /opt/anaconda3/lib/python3.11/site-packages (4.3.0)
Requirement already satisfied: matplotlib in /opt/anaconda3/lib/python3.11/site-packages (3.8.0)
Requirement already satisfied: pyLDAvis in /opt/anaconda3/lib/python3.11/site-packages (3.4.1)
Requirement already satisfied: click in /opt/anaconda3/lib/python3.11/site-packages (from nltk) (8.1.7)
Requirement already satisfied: joblib in /opt/anaconda3/lib/python3.11/site-packages (from nltk) (1.2.0)
Requirement already satisfied: regex>=2021.8.3 in /opt/anaconda3/lib/python3.11/site-packages (from nltk) (2023.10.3)
Requirement already satisfied: tqdm in /opt/anaconda3/lib/python3.11/site-packages (from nltk) (4.65.0)
Requirement already satisfied: numpy>=1.18.5 in /opt/anaconda3/lib/python3.11/site-packages (from gensim) (1.26.4)
Requirement already satisfied: scipy>=1.7.0 in /opt/anaconda3/lib/python3.11/site-packages (from gensim) (1.11.4)
Requirement already satisfied: smart-open>=1.8.1 in /opt/anaconda3/lib/python3.11/site-packages (from gensim) (5.2.1)
Requirement already satisfied: FuzzyTM>=0.4.0 in /opt/anaconda3/lib/python3.11/site-packages (from gensim) (2.0.9)
Requirement already satisfied: contourpy>=1.0.1 in /opt/anaconda3/lib/python3.11/site-packages (from matplotlib) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /opt/anaconda3/lib/python3.11/site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /opt/anaconda3/lib/python3.11/site-packages (from matplotlib) (4.25.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/anaconda3/lib/python3.11/site-packages (from matplotlib) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /opt/anaconda3/lib/python3.11/site-packages (from matplotlib) (23.1)
Requirement already satisfied: pillow>=6.2.0 in /opt/anaconda3/lib/python3.11/site-packages (from matplotlib) (10.2.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/anaconda3/lib/python3.11/site-packages (from matplotlib) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in /opt/anaconda3/lib/python3.11/site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: pandas>=2.0.0 in /opt/anaconda3/lib/python3.11/site-packages (from pyLDAvis) (2.1.4)
Requirement already satisfied: jinja2 in /opt/anaconda3/lib/python3.11/site-packages (from pyLDAvis) (3.1.3)
Requirement already satisfied: numexpr in /opt/anaconda3/lib/python3.11/site-packages (from pyLDAvis) (2.8.7)
Requirement already satisfied: funcy in /opt/anaconda3/lib/python3.11/site-packages (from pyLDAvis) (2.0)
Requirement already satisfied: scikit-learn>=1.0.0 in /opt/anaconda3/lib/python3.11/site-packages (from pyLDAvis) (1.2.2)
Requirement already satisfied: setuptools in /opt/anaconda3/lib/python3.11/site-packages (from pyLDAvis) (68.2.2)
Requirement already satisfied: pyfume in /opt/anaconda3/lib/python3.11/site-packages (from FuzzyTM>=0.4.0->gensim) (0.3.1)
Requirement already satisfied: pytz>=2020.1 in /opt/anaconda3/lib/python3.11/site-packages (from pandas>=2.0.0->pyLDAvis) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in /opt/anaconda3/lib/python3.11/site-packages (from pandas>=2.0.0->pyLDAvis) (2023.3)
Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/anaconda3/lib/python3.11/site-packages (from scikit-learn>=1.0.0->pyLDAvis) (2.2.0)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/anaconda3/lib/python3.11/site-packages (from jinja2->pyLDAvis) (2.1.3)
Requirement already satisfied: simpful in /opt/anaconda3/lib/python3.11/site-packages (from pyfume->FuzzyTM>=0.4.0->gensim) (2.12.0)
Requirement already satisfied: fst-pso in /opt/anaconda3/lib/python3.11/site-packages (from pyfume->FuzzyTM>=0.4.0->gensim) (1.8.1)
Requirement already satisfied: typing-extensions in /opt/anaconda3/lib/python3.11/site-packages (from pyfume->FuzzyTM>=0.4.0->gensim) (4.9.0)
Requirement already satisfied: miniful in /opt/anaconda3/lib/python3.11/site-packages (from fst-pso->pyfume->FuzzyTM>=0.4.0->gensim) (0.0.6)

LDA Topic Modeling with gensim

from gensim import corpora, models
from gensim.models.ldamodel import LdaModel

# Example corpus
corpus = [
    ["coke", "pepsi", "comparasion", "hypothesis", "differences"], 
    ["sample", "population", "variable", "statistic"], 
    ["data", "water", "juice", "soda"]
]

# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(corpus)

# Convert the documents to a bag-of-words corpus 
corpus_bow = [dictionary.doc2bow(doc) for doc in corpus]

# Train the LDA model
lda_model = models.LdaModel(corpus_bow, num_topics=2, id2word=dictionary, passes=10)

# Print the topics found
for topic_id, topic in lda_model.print_topics():
    print(f"Topic {topic_id}: {topic}")

Topic 0: 0.141*"variable" + 0.141*"population" + 0.141*"sample" + 0.141*"statistic" + 0.049*"water" + 0.049*"comparasion" + 0.048*"hypothesis" + 0.048*"coke" + 0.048*"data" + 0.048*"juice"
Topic 1: 0.097*"differences" + 0.097*"pepsi" + 0.097*"soda" + 0.096*"juice" + 0.096*"data" + 0.096*"coke" + 0.096*"hypothesis" + 0.096*"comparasion" + 0.096*"water" + 0.033*"statistic"

Find optimal # of topics - K

CoherenceModel Approach

Coherence in topic modeling refers to the interpretability and semantic consistency of the topics generated by the model. The higher, the better. Higher coherence scores indicate that the words within a topic are more semantically similar and coherent, which means the topic is more meaningful and distinct.

Find optimal # of topics - K

How to implement CoherenceModel approach?

# Import needed libraries
from gensim.models import CoherenceModel
import matplotlib.pyplot as plt
from gensim import corpora, models
from gensim.models.ldamodel import LdaModel

# Example corpus
corpus = [
    ["coke", "pepsi", "comparasion", "hypothesis", "differences"], 
    ["sample", "population", "variable", "statistic"], 
    ["data", "water", "juice", "soda"]
]

# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(corpus)

# Convert the documents to a bag-of-words corpus 
corpus_bow = [dictionary.doc2bow(doc) for doc in corpus]

# Define a function to compute coherence values
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=1):
    coherence_values = []
    model_list = []
    
    for num_topics in range(start, limit, step):
        model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v', processes=1)
        coherence_values.append(coherencemodel.get_coherence())
        
    return model_list, coherence_values

# Set parameters
limit = 5  # Maximum number of topics
start = 2  # Minimum number of topics
step = 1   # Step size for the number of topics

# Compute coherence values
model_list, coherence_values = compute_coherence_values(dictionary=dictionary, corpus=corpus_bow, texts=corpus, start=start, limit=limit, step=step)

# Plot the result
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Number of Topics")
plt.ylabel("Coherence Score")
plt.title("Optimal Number of Topics")
plt.xticks(x)
plt.grid()
plt.show()

Your Turn

Fit a LDA model with the optimal # of topics

# YOUR CODE IS HERE

Topic Summaries and Visualization

pyLDAvis

What is pyLDAvis?

The pyLDAvis is primarily designed for visualizing the results of LDA topic modeling, it provides an interactive dashboard that helps users interpret the topics discovered by LDA.

How to create a visualization with pyLDAvis?

# Import needed libaries
import pyLDAvis
import pyLDAvis.gensim

# Prepare the visualization
vis = pyLDAvis.gensim.prepare(lda_model, corpus_bow, dictionary)

# Display the visualization
pyLDAvis.display(vis)

Your Turn

Create a pyLDAvis with the optimal model that you got.

# YOUR CODE IS HERE