INM-EXPLAIN

Image 1

COMMUNITY DETECTION

The main objective was to identify patterns and relationships between hashtags in tweets and quantify them to understand discussions and perceptions around these themes. We began our analysis by loading data from three distinct sources, all related to cancer. To balance the datasets, we performed a 1.5% random sampling on the largest dataset, Cancer and Cannabis, ensuring a more balanced representation with a total of 2,391 tweets. For tweet data analysis, we used the K-Means clustering algorithm, a popular choice for grouping data based on similarity. The goal was to cluster tweets so those within the same cluster are more similar to each other than to those in other clusters. We first cleaned the tweets by converting hashtags to character strings, simplifying the analysis by eliminating unnecessary variations. Next, we applied TF-IDF vectorization on the hashtags to assess their importance in the tweet set while reducing the impact of frequently appearing terms. Determining the optimal number of clusters is crucial; we used the elbow method, calculating the within-cluster sum of squares (WSS) for different cluster counts and identifying the point where adding more clusters does not significantly reduce the WSS. With the number of clusters determined, we applied the K-Means algorithm, fixing the random state parameter to ensure reproducibility. The algorithm assigns each tweet to a cluster, minimizing intra-cluster variance.