Averaging vector dimensions

jan.vansteenlandt · January 5, 2024, 8:34am

Hi @Cory_Pinecone , I was googling whether it was possible to make multi-vector query requests and stumbled upon this thread. I was not aware that you could just average the dimensions… I have a use case where a user can mark several blog posts as interesting, and our job is then to recommend similar looking posts based on that set. Would that still be an ok solution? We found ourselves having to deal with long loading times even on P2 pods using parallel requests but maybe averaging them out would help.

I was also thinking about pre-filtering the vectors, maybe only averaging out the vectors that are very similar so that the average over 20 vectors for example doesn’t end up in an non-sensical averaged vector.

Cory_Pinecone · January 5, 2024, 5:12pm

Hi @jan.vansteenlandt, welcome to the Pinecone community!

Yes, averaging the vectors you’re searching with is the usual approach for use cases like this. Unless the end-user has a way to set different weights for the blogs, it’s a pretty straightforward solution. Presumably, you already have the vectors for the blogs stored in Pinecone and can either fetch them by their ID, or filter for them with their metadata. Return the values of each, average them together into a new vector, and use that as your new query.

If you have options for adding weight to one blog post over another (for instance, if you value more recent ones than older for relevance) you’ll have to adjust the averaging algorithm a bit.

I was also thinking about pre-filtering the vectors, maybe only averaging out the vectors that are very similar so that the average over 20 vectors for example doesn’t end up in an non-sensical averaged vector.

This is an interesting idea. Again, presuming you’re fetching the existing vectors for the associated blogs, you could compare the values of the existing vectors to each other and cluster them based on similarity, then use those averages for the next set of queries. This would have to be done outside of Pinecone in the app itself, though, and would likely slow things down considerably.

A simpler technique would be to use tags on the blogs stored as metadata with their vectors and then cluster selected blogs based on those tags to get averages for each tag. So if someone chooses four blogs that are tagged “politics,” three that are tagged “wine,” and two more that are tagged [“politics”,“wine”] you’d end up with two sets of recommendations: politics blogs and wine blogs. Those last two blogs would be included in the average of each.

jan.vansteenlandt · January 6, 2024, 7:21am

Thanks for the reply @Cory_Pinecone ! I’ve read that averaging vectors might also end up with a vector that is not really making sense anymore, so that’s why I’m a bit reluctant to just average 20-ish vectors and assume the resulting vector would be a good fit to fetch similar results. That’s why I took the approach of removing very similar looking vectors from the set in order to reduce the amount of queries to be made.

I’ll do a comparison between the similarity results I get back based on a single averaged vector versus not averaging them out and running parallel requests.

+1 for moving this to a separate discussion!