The Communalytic Topic Analyzer Module can automatically identify and group together social media posts that are semantically similar (i.e., similar in their meaning). It can spot latent topics in a dataset (i.e., abstract topics that may not be directly observable from just reading the posts).
The Topic Analyzer Module is designed to help researchers make sense of their social media dataset without having to scroll through endless Excel files, read and review every post, or even have prior knowledge about the dataset’s content.
Watch a recent webinar on using the Topic Analysis module in Communalytic #
Note: If you plan to follow the hands-on exercise presented in the webinar, you can download the provided sample dataset from https://bit.ly/hicss25-dataset.
Creating and Clustering Embeddings #
The Module uses a sentence-transformers model to represent posts as vector embeddings. These embeddings capture the semantic meaning of the text and can be used for various natural language processing tasks, such as sentence similarity, clustering, and retrieval. For more information on embedding, see here and here.
Communalytic analyzes posts using VoyageAI multilingual embedding models (Voyage-3-lite in Communalytic EDU and Voyage-3 in Communalytic PRO). These are general-purpose models optimized for multilingual retrieval. They have been shown to outperform similar models on texts in 27 languages: Arabic, Bengali, Czech, Danish, Dutch, English, French, Georgian, German, Greek, Hungarian, Italian, Japanese, Korean, Kurdish, Norwegian, Persian, Polish, Portuguese, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Urdu, Vietnamese. Communalytic explicitly opted out of allowing Voyage AI to use submitted content for training and requested its immediate deletion after processing.
Since embeddings are vectors in a multidimensional space, Communalytic uses a dimension reduction technique called UMAP to reduce the dimensions of embeddings to 3 for visualization purposes. The final step is to group posts/embeddings located close to each other in a 3D space based on a clustering algorithm. Communalytic currently supports the following clustering options: HDBScan, KMeans, and Gaussian Mixture. See more details in the “How To” section below.
Visualizing Embeddings #
Once embeddings are projected into a 3D space and grouped based on semantic similarity, they are visualized using a built-in visualizer – the 3D Semantic Similarity Map. Below is a sample visualization. Each dot in the Map represents a post, and colors are automatically assigned based on the selected clustering algorithm.
To examine latent topics within your dataset, you can manually review sample posts in each cluster and assign a descriptive label. To help you with this process, the 3D Semantic Similarity Map has a feature to automatically suggest a label based on one of the available LLMs (such as gemma or mistral).
How To #
1) Log in to your Communalytic account and click on the ”Topic Analysis” icon located next to the dataset you want to analyze (found under “My Datasets”):

2) Click the “Visualize Embeddings” button to generate and visualize embeddings using the default clustering settings. If you are dissatisfied with the resulting visualization based on the default settings (e.g., getting too few or too many clusters), you can change the clustering algorithm or adjust its settings as outlined below.

Clustering Algorithms #
Communalytic currently supports three clustering algorithms:
- Fast HDBScan (the default) is best for data with varying densities and outliers but requires parameter tuning (see more details below).
- HDBSCAN: Similar to Fast HDBSCAN but slower, as it automatically estimates Epsilon and Minimum Sample Size.
- KMeans is efficient and easy to use but assumes spherical clusters and is sensitive to outliers.
- Gaussian Mixture offers flexibility in cluster shapes with probabilistic memberships but assumes Gaussian distributions.
Choosing the “right” clustering algorithm depends on the nature of your data and the specific requirements of your analysis. If you expect an approximate number of clusters based on your familiarity with the dataset content, we suggest using Gaussian Mixture. Otherwise, use Fast HDBScan or HDBScan to determine the “optimal” number of clusters automatically.
When using Fast HDBScan for clustering, you can adjust the following parameters to control the number of resulting clusters:
- Epsilon controls the distance for neighborhood consideration, impacting how close points need to be to form a cluster. Hint: Values closer to 1 will produce fewer clusters. We suggest starting with 0.1 (the default) and then gradually increasing it by 0.1 until satisfied with the number of resulting clusters. (You will know that the value is too high if you end up with a single cluster.)
- Minimum Cluster Size sets a threshold for the minimum number of posts required to form a cluster. This helps to avoid creating clusters with too few posts, capturing overly granular topics that are time-consuming to review and label manually. Hint: This parameter largely depends on the size of your dataset. For smaller datasets (fewer than 1k posts), use the default value of 10. However, if you are getting too many clusters (over 100), consider increasing this value.
- Minimum Sample Size determines the density requirement for a point to be considered a core point and thus part of a cluster. Hint: Setting higher values will result in more outliers (posts that are not assigned to any cluster), but this is not a problem if the goal is to identify groupings of strongly similar posts. However, if the aim is to reduce the number of isolates, lower this value to 10 (the default) or below. This will assign ‘borderline’ posts (semantically speaking) to the most relevant cluster.
Below are some additional considerations when setting Minimum Cluster Size and Minimum Sample Size for HDBSCAN:
| Desired Cluster Configuration | Minimum Cluster Size | Minimum Sample Size |
|---|---|---|
| More clusters (highly specific) | Small (10-50) | Small (1-10) |
| Fewer clusters (generalized clusters with some specificity) | Large (>50) | Small (1-10) |
| Very general clusters (more posts labelled as ‘outliers’) | Large (>50) | Large (>10) |
Davies-Bouldin Index (DBI) #
Choosing and iteratively trying out different parameters is crucial in guiding the clustering algorithm to identify the optimal number of clusters, striking a balance between having too many clusters representing granular topics and too few overly abstract ones.
To help you determine whether changes to clustering parameters improve cluster quality, refer to the Davies-Bouldin Index (DBI), displayed in the title of the 3D visualization (see below).

The DBI is a metric used to assess the structure of clusters and provides an overall “quality” score. It evaluates two features: i) the density of each cluster, or how close points are within a cluster, and ii) the separation between clusters, or how far apart clusters are from each other. DBI’s “ideal” score is 0, representing perfectly compact and distinct clusters. However, achieving a score of 0 is often impractical. Instead, aim to reduce the DBI score as much as possible while keeping the number of clusters manageable for human review and maintaining a reasonable number of “outliers” (up to 10% of the total dataset).
3) Once the process starts, you will see a progress bar. Since generating embeddings and projecting them in a 3D space is computationally intensive, Communalytic supports the analysis of only three datasets in parallel. If you are a fourth user starting a new analysis, your request will be placed in a queue and automatically begin when it is your turn. You can close the browser and check on its progress later.

4) When the data processing is done, you will see a screen with three buttons: “Open Visualization”, “Change Clustering Parameters”, and “Download Embeddings & Clusters”.

- The “Open Visualization” button will open the 3D Semantic Similarity Map within your browser, allowing you to visualize embeddings in a 3D space.
- The “Change Clustering Parameters” button allows you to adjust the clustering parameters to fine-tune the current visualization.
- The “Download Embeddings & Clusters” button will create and help you download your embeddings and cluster labels for the complete dataset as a CSV file. Note: Due to the potentially large size of the output CSV file, it will be exported as a ZIP file.
5) The 3D Semantic Similarity Map is an interactive visualization that allows you to examine the resulting map from different angles and at different Zoom-in levels within your browser without the need to download any additional software. For the best user experience, we recommend using one of the latest browsers (e.g., Chrome, Edge, or Firefox).
Any posts not containing text (such as those featuring only photos or videos) will be excluded from the map. This means that the number of records visualized in the map may be smaller than the total dataset. Retweets are also excluded from Twitter data to focus on unique topics.

The 3D view is controlled with a mouse or touch screen. Below is the list of navigation options:
- Left click mouse and move to rotate the camera.
- Right click mouse and move to pan the camera.
- Scroll mouse wheel to zoom in and out.
- Hover on a dot to preview the corresponding post.

In order to increase or decrease the number of clusters, go to the “Adjust Clustering Parameters” section in the right-side panel to change the clustering parameters and click the “Apply Changes” button.

6) In the 3D Semantic Similarity Map, manually review sample posts in each cluster and assign a descriptive label following the steps below:
- Use the drop-down menu in the “Examine & Label Clusters” panel to select and zoom in on a specific cluster.
- Under the “Posts in the Selected Cluster” panel, preview up to 1k posts within a given cluster using the “<“ and “>” buttons.
- In the text box under “Label Cluster“, enter a short description of the selected cluster (up to 100 characters), then click “Save Cluster Label” to associate the label with all posts in the selected cluster. This label will be stored as part of your dataset and can be renamed later.

7) To help you with this process, you can use one of the available LLMs (such as gemma or mistral) to automatically suggest a label. To use this feature, select a cluster and click the “Suggest a label” button. Communalytic will generate a summary description using a sample of posts from the selected cluster. The sample size is set to 10% of the cluster, with a minimum of 10 posts and a maximum of 100 posts.

The current usage cap is set at up to 30 calls per day in Communalytic EDU and up to 100 calls in Communalytic PRO. The cap will be adjusted based on the actual resource utilization and the cost of using an external LLM service for data processing.
LLMs are provided as a service by Cloudflare. Their availability is not guaranteed, and the selection may change from time to time. Cloudflare does not use submitted content to train AI models or improve its services.
In addition to using Communalytic’s built-in 3D visualization, you can explore your dataset and visualize it with external visualization tools such as Nomic Atlas. This third-party tool allows users to represent and explore embeddings in a 2D space, with features complementary to Communalytic’s built-in visualization, such as semantic search and automatic topic labelling at multiple levels of granularity. You can learn more about this option here.