I’m trying to (for fun) build a matching platform where I create imbedding for each user using their full profile, index it in vector database, and recommend the closest user every day. This is fairly easy and I was able to implement this (cosine similarity).
However, the question is, if I want to add more features (e.g. I want to lower the score/ranking for not active users by some coefficient) to the data before sorting to find the best match, I will have to set topK as n (n = number of existing users) and find topK for n times. Is this the only approach or is there a more efficient way to do it. Thank you!
Hi @gongchen.liu, thanks for the post. I’m unsure why you would need to both set the “topK as n” and query for this “topK for n times”? If I’m misunderstanding the proposed approach, please let me know.
Overall, changing the actual similarity score based on whether the user is active or not is not achievable with a vector database unless you were to change the underlying embedded information.
Instead, you can accomplish this using metadata filtering, post-processing steps, or a combination. For example, you could have a metadata field that denotes whether or not a user is active and then filter your queries to only return active users.
Alternatively, you could not use a filter and instead return the metadata by setting include_metadata=True
in the query, then re-rank your similarity search results based on whether the user is active, altering each score by a factor of your choosing. These are just a few possibilities amongst plentiful possible approaches. Best of luck!