Make semantic search on long documents

Making a semantic search for documents can be troubling. What has worked for you in the past?

I laid down some ideas here but looking for more


You can always do a tf-idf weighted combination of word embeddings. If your corpus is big enough, then train your own word2vec/Glove embedding, and compute idf values for each word. I do find that the idf weight for extremely rare terms tend to be too large, especially for document-query similarity, where the rare terms probably don’t appear in a query. An approach like described here has worked well for me:

(I used the log information gain as a weight.)



Thanks for the answer! 1 challenge is to maintain this emb when the index is updated which changes the IDF value of words and will not have IDF for unseen words.

