Document embeddings

alberto · February 7, 2022, 8:50pm

Based on your experience when creating a document embedding from a list of sentence embeddings what is the best way to combine them? i. e. Are better approaches than average pooling? The goal is to use similarity search at document level.

jamesbriggs · February 9, 2022, 4:03pm

Unfortunately I’m not aware of anything, there were methods in the past like doc2vec that dealt with this but I’m not sure how effective they were. Same goes for average pooling of many sentences over a document, I don’t believe it would be hugely effective, particularly for longer documents.

NilsReimers · February 9, 2022, 8:56pm

Hi,
doc2vec doesn’t work. Not even the co-authors were able to re-produce the results presented in doc2vec.

Do you really need a single embedding for a doc? This is usually quite problematic. You can see it as a compression. When you embed to 512 dimensions, you compress to 512 * 2bytes per dimension = 1024bytes. The longer your document, the more information you will loose. It doesn’t matter if your doc has 10k, 100k, or 1M characters, you always encode to 1024 bytes => the longer your document, the more information is lost.

The better solution is in most cases to split your document into paragraphs and then to encode the paragraphs individually. This gives you for each document a flexible number of embeddings.

It then depends on your task how you work with this setup. If you do retrieval, the setup is quite easy: Retrieve the top-k paragraphs matching your query, and map the paragraphs back to document ids.

You might need to retrieve more paragraphs. E.g. if you want top-10 documents, it make sense to e.g. retrieve top-20 or top-50 paragraphs as several paragraphs from the same document might be returned.

For clustering or document similarity you have to use a slightly different strategy.