Make semantic search on long documents

nlpguy · February 4, 2022, 5:03pm

Making a semantic search for documents can be troubling. What has worked for you in the past?

I laid down some ideas here but looking for more

cschmidt · February 5, 2022, 6:18pm

You can always do a tf-idf weighted combination of word embeddings. If your corpus is big enough, then train your own word2vec/Glove embedding, and compute idf values for each word. I do find that the idf weight for extremely rare terms tend to be too large, especially for document-query similarity, where the rare terms probably don’t appear in a query. An approach like described here has worked well for me:

(I used the log information gain as a weight.)

-Craig

nlpguy · February 9, 2022, 3:23pm

Thanks for the answer! 1 challenge is to maintain this emb when the index is updated which changes the IDF value of words and will not have IDF for unseen words.

sophiem · April 15, 2022, 4:58pm

If you have found any new solutions to this and have code/an app you’d like showcased, we are happy to showcase your work on our Community page! Feel free to share with sophie@pinecone.io if interested

sophiem · April 15, 2022, 4:59pm

This is fantastic Craig! We’d love to showcase any of your work on our Community page, if you want to share