Make semantic search on long documents

Making a semantic search for documents can be troubling. What has worked for you in the past?

I laid down some ideas here but looking for more

2 Likes

You can always do a tf-idf weighted combination of word embeddings. If your corpus is big enough, then train your own word2vec/Glove embedding, and compute idf values for each word. I do find that the idf weight for extremely rare terms tend to be too large, especially for document-query similarity, where the rare terms probably don’t appear in a query. An approach like described here has worked well for me:

(I used the log information gain as a weight.)

-Craig

2 Likes

Thanks for the answer! 1 challenge is to maintain this emb when the index is updated which changes the IDF value of words and will not have IDF for unseen words.

If you have found any new solutions to this and have code/an app you’d like showcased, we are happy to showcase your work on our Community page! Feel free to share with sophie@pinecone.io if interested :slight_smile:

This is fantastic Craig! We’d love to showcase any of your work on our Community page, if you want to share :slight_smile: