Can I share vectors of the same words from different documents?

tom.sykora · July 19, 2023, 10:44am

Hi, we’re trying to come up with some cost optimizations as with the common approach we have too many embeddings. And we got to this idea:

We need to embed every word separately, that’s our hard requirement. If there are same words in different documents, it means they have the same vectors. Could we create only one common vector for all the same words, and store their original locations in some attributes to filter with? That would obviously decrease the number of pods needed incomparably.

Theoretically it makes sense. Is that a reasonable thought or an “abuse” of a vector database?

Thank you!

Jasper · July 21, 2023, 6:55am

Hi @tom.sykora

I can’t see a reason to have each word as a separate vector, but if that is your use case that is fine. What embedding methods are you using? Currently the methods that are used (OpenAI, BERT etc.) take into account the meaning of the word based on its surrounding context. That can cause problems with your idea of having one vector for the same word in multiple documents.

Other than the above thinking. You can sure do this I would be cautious how you are going to determine where and in which document the word is, if that is just a path and char number where the word starts… yeah that could work.

Hope this helps

tom.sykora · July 21, 2023, 8:28am

The reason we’re playing around with this is we’re trying to combine exact search, synonyms search, semantic search and NER search into one feature, which is unexpectedly quite an interesting challenge.

The problem I’d be worried about is that if one word is used in too many documents, this “location” attribute would get too big, and I’m not sure if that wouldn’t introduce a new speed bottleneck, as that’d be basically performing another search, but not among vector database rows, but in this one attribute, which isn’t optimized to be searched in.

We could create a separate database/table of “locations” with a foreign key to the shared vectors db, but that’s still “another search operation” so possibly a slow down.

But we’re gonna try to see how it works. It may not be that bad after all.

Jasper · July 21, 2023, 9:54am

Dang, that IS an interesting challenge!

I see. Well. Do you need this info for search (will you be filtering by this?) so the location metadata? If not you can create a new index with metadata_config where you set this metadata field to NOT be indexes, so the search won’t be affected. If you want to use this field for search… yeah, you will have to do some other workaround as you said a separate database would be interesting to try.

But then again. You have a very specific task(s) at hand and I think you will have to experiment to see what will be possible with your return speed limitations

If you can maybe identify the problematic word with too many locations and work from there. Can you omit some? Can you group them in some way? Hm! Maybe… when you create a vector out of a word, maybe check if the one you already have stored is very different in meaning.

Dumb example. The word “and” (if you aren’t removing such common words) will be used. A lot. But most of the time this word will have the same meaning in almost all cases. Can you omit them? Only storing one location, or lets say you stop and 10, 100 etc.

The other way, maybe you can do the reverse, only the same word if it has a totally different meaning/vector. So before you store a word in Pinecone, query its vector and check top 1 result, if the similarity is > 0,9 do not store it (or some other cutoff threshold). This way, you will remove a lot of words that are causing you problems. This only works if you do NOT need all of the locations.

Sorry for the long answer with some brainfarts

Good luck and hope you find a solution!