The Topic Analyzer Module automatically identifies and groups together posts in your dataset that are semantically similar (i.e., similar in their meaning). It can spot latent topics (i.e., abstract topics that may not be directly observable from just reading the posts).
š Quick Start: How it Works #
- Semantic Embeddings: The module uses VoyageAI multilingual models (the Voyage-4 family) to represent text as vector embeddings. These embeddings capture the semantic meaning of the text and can be used for various natural language processing tasks, such as sentence similarity, clustering, and retrieval. For more information on embedding, see here and here.Ā
- VoyageAI multilingual embedding modelsĀ are general-purpose models optimized for multilingual retrieval. They have been shown to outperform similar models on texts in 27 languages: Arabic, Bengali, Czech, Danish, Dutch, English, French, Georgian, German, Greek, Hungarian, Italian, Japanese, Korean, Kurdish, Norwegian, Persian, Polish, Portuguese, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Urdu, Vietnamese. Communalytic instructed Voyage AI not to use submitted content for training and requested its immediate deletion after processing.
- Dimension Reduction: Since embeddings are high-dimensional, Communalytic uses UMAP to project them into a 3D space while preserving the semantic relationships between posts.
- Clustering: The module uses the HDBSCAN algorithm to identify groups of posts that are physically close to each other in that 3D space. These groups represent the “topics” within your dataset.
- Visualization: Results are displayed in an interactive 3D Semantic Similarity Map. Below is a sample visualization. Each dot in the Map represents a post, and colors are automatically assigned based on the selected clustering algorithm. To examine latent topics within your dataset, you can manually review sample posts in each cluster and assign a descriptive label. To help you with this process, the 3D Semantic Similarity Map has a feature to automatically suggest a label based on an LLM.
š Step-by-Step Instructions #
1) Click on the āTopic Analysisā icon located next to the dataset you want to analyze (found under āMy Datasetsā):

2) Click the āStart Topic Analysisā button to generate and visualize embeddings using the default clustering settings. If you are dissatisfied with the resulting visualization based on the default settings (e.g., getting too few or too many clusters), you can change the clustering algorithm or adjust its settings as outlined below.

ā¬ļø

ā¬ļø

Topic Analysis Workflow #
Once the results are ready, follow the Topic Analysis workflow as listed below:
Step 1: Get Started – Auto-Label Clusters #
Click the greenĀ Auto-Label Top 10 ClustersĀ button. This will use AI to automatically assign initial topic labels to your clusters based on a random sample of your posts (typically based on ~5-10% sample). The auto-labelling requires keeping your browser window open. Youāll see a notification about your daily AI call limit, which resets at midnight (UTC).

Step 2: Quick Review – Preview Your Topics #
Look at theĀ Topic Cluster DistributionĀ bar chart. Click on any bar (e.g., Cluster 1 or Cluster 2) to instantly see a preview of the actual posts belonging to that topic. This helps you validate if the AIās initial labels make sense.


Step 3: Explore & Validate – Deep Dive into Clusters #
Click the blueĀ Visualize 3D Similarity MapĀ button, which will open an interactive visualization that allows you to examine the resulting clusters from different angles and at different Zoom-in levels within your browser without the need to download any additional software. The visualization will also help you confirm and refine the AIās suggested labels.
Each dot in the visualization is a post from the selected dataset. However, posts not containing text (such as those featuring only photos or videos) will not appear in the visualization. As a result, the number of dots/posts visualized in the map may be smaller than the total dataset size. X’s retweets are also excluded from the map to focus on unique topics.

To review posts grouped under each cluster, use of the two options:
- Select one of the clusters from the dropdown menu in left-side panel; or
- Click on any of the dots (=posts) in the visualization.
Either of these actions will open a popup panel with the options to: a) rename the selected cluster; b) search for posts within the selected cluster based on a given keyword(s); or c) preview all posts assigned to the selected cluster. The example below shows the list of posts assigned to the cluster labelled “Canadian Music Celebration [AI label]”.
To help users remember which of the labels were recommended by AI versus those assigned by the user, Communalytic automatically adds the “[AI label]” suffix to the end of the auto-generate label. Once you review the label and a sample of the corresponding posts, you can manually remove “[AI label]” from the label by renaming it.
If you are collaboratively annotating clusters with another Communalytic user via the Folders or Team account feature, Communalytic will record and display the email address of the user who made the latest revisions to the cluster labels.

Step 4: Refine – Improve Clustering Quality #
Check theĀ Clustering Quality ScoreĀ (Silhouette Score). In the above example, itās 0.696, which is marked as āGood Quality.ā If the score were below 0.5, youād clickĀ Change ParametersĀ to adjust the clustering settings and re-run the analysis.




Communalytic currently only one clustering algorithm – HDBScan, which is best for data with varying densities and outliers but requires tuning the following parameters:
- Epsilon controls the distance for neighborhood consideration, impacting how close points need to be to form a cluster. Hint: Values closer to 1 will produce fewer clusters. We suggest starting with 0.1 (the default) and then gradually increasing it by 0.1 until satisfied with the number of resulting clusters. (You will know that the value is too high if you end up with a single cluster.)
- Minimum Cluster Size sets a threshold for the minimum number of posts required to form a cluster. This helps to avoid creating clusters with too few posts, capturing overly granular topics that are time-consuming to review and label manually. Hint: This parameter largely depends on the size of your dataset. For smaller datasets (fewer than 1k posts), use the default value of 10. However, if you are getting too many clusters (over 100), consider increasing this value.
- Minimum Sample Size determines the density requirement for a point to be considered a core point and thus part of a cluster. Hint: Setting higher values will result in more outliers (posts that are not assigned to any cluster), but this is not a problem if the goal is to identify groupings of strongly similar posts. However, if the aim is to reduce the number of isolates, lower this value to 10 (the default) or below. This will assign āborderlineā posts (semantically speaking) to the most relevant cluster.
Below are some additional considerations when setting Minimum Cluster Size and Minimum Sample Size for HDBSCAN:
| Desired Cluster Configuration | Minimum Cluster Size | Minimum Sample Size |
|---|---|---|
| More clusters (highly specific) | Small (10-50) | Small (1-10) |
| Fewer clusters (generalized clusters with some specificity) | Large (>50) | Small (1-10) |
| Very general clusters (more posts labelled as ‘outliers’) | Large (>50) | Large (>10) |
Hint: You can also adjust the clustering parameters inside the 3D Similarity MapĀ by clicking on the Edit Icon (next to “HDBSCAN”) in the left-side panel, and then change one or more the clustering parameters, followed by clicking on the “Apply Changes” button.

Step 5: Download Results #
Once youāre satisfied with your topic clusters and labels, click the greenĀ Download ResultsĀ button to export your dataset with the final, assigned cluster labels for use in your projects.

ā¬ļø
