INM-EXPLAIN

Image 1

TOPIC MODELING

The main goal of this analysis is to use Topic Modeling to explore and understand controversial topics in a set of tweets. This approach identifies clusters of significant words, revealing key themes in the tweets. We use the LDA (Latent Dirichlet Allocation) model, developed by David Blei, to extract word clusters that can be interpreted as main themes in textual documents. Applying it to our tweet corpus, we aim to uncover and explore the key themes in these online discussions.

First, we combined three tweet datasets to build an LDA model on non-medical interventions for cancer treatment, sports practice, cannabis use, and intermittent fasting. After text extraction, the following steps were performed: - Removal of stop words and words with fewer than three letters, such as "the" and "a." - Lemmatization and stemming to reduce words to their canonical forms and roots, e.g., "happiness" becomes "happy." - Conversion of text into a dictionary and a bag-of-words frequency representation. We initiated the LDA models using the Gensim library, adjusting parameters that impact the distribution of topics and words, including the number of iterations, topics, and the alpha and beta values. Alpha represents the Dirichlet concentration for the distribution of topics per document, and beta for the distribution of words per topic. Higher values induce greater uniformity, affecting the coherence and diversity of topics across documents. Finally, we chose to use the scikit-learn library, adjusting the number of topics and the learning method. This resulted in a coherence score of 0.53, considered a good result, as a model is deemed appropriate when its score ranges between 0.4 and 0.7. This allowed us to define clear and distinct themes, which we will develop in the next section of this analysis.

Top Words by Label