Thanks for a great look into some of the details behind this approach to topic modelling. For a long time LDA and variants thereof have dominated the topic modelling field, but more recently the embedding and clustering based approach, pioneered by Top2Vec, is rapidly growing, with BERTopic an excellent example. I do think this is just the beginning however.
The same approach allows for topic modelling of images via CLIP; multi-lingual embeddings allow for topic modelling of corpora in multiple languages; other rich multi-modal embeddings allow for even more diverse approaches.
There is a lot of scope for richer interactive maps of corpora, with overload hierarchical topic annotations – something like OpenSyllabus Galaxy, but with Top2Vec/BERTopic topic labels. There is also a lot of scope for ANN search to enrich those interactive exploration experiences by linking to a larger corpus than can be worked with in an interactive plot. I think there are exciting times ahead in this space.
2 Likes
That’s great to hear, and much of that thanks to UMAP and your implementation of HDBSCAN, both are incredible. It’s awesome to see tools like BERTopic being built on top of your work in these fields. Concept looks very interesting too.
Although BERTopic/Top2Vec has found a lot of popularity already, as you said, it is just the start and the potential is much greater. It’s great to get your insight on corpora maps + coupling all of this with ANN, there seems to be a lot of progress and it’s exciting to see where this goes!