Searches not returning remotely similar values

danthegoodman · March 12, 2023, 12:25am

I’m using openai’s text-embedding-ada-002 model to create embeddings for queries and for stored documents like

const embedding = await openai.createEmbedding({
    model: "text-embedding-ada-002",
    input: query.slice(0, MaxQueryCharacters)
  })

If I created embeddings on the following lines (each a different document):

I think we are going to use pinecone for vector search
the reason is because it has a free tier
and also it’s nice to use, and scalable
e
e
e
some crap
blah
e
e
e
eifwoijf
eiriewioe

Then perform a query for pinecone, generating embeddings then using the query call:

const search = await pinecone.Index(process.env.PINECONE_INDEX).query({
    queryRequest: {
      vector: embedding.data.data[0].embedding,
      includeMetadata: true,
      topK: 4,
      namespace: "test"
    }
  })

The results I get back are 4 messages of e.

I would easily expect the first message to not only show but be the first result. Why might this be?

For my index I am using 1536 dimensions and cosine distance function.

rschwabco · March 13, 2023, 6:37pm

@danthegoodman It’s a bit hard to tell given the code you provided. Could you show examples of what you’re getting back in embedding.data.data[0].embedding vs what you indexed? Depending on you’re passing as your query, it might be reasonable that you’re getting the e as the result.

danthegoodman · March 13, 2023, 10:16pm

It’s the vector from the openai: OpenAI API

The embedding.data.data[0].embedding is a 1536 length array of numbers. I can grab an example if you think that valuable but it’d be pretty massive. Not sure if that’s appropriate to post here.

kbutler · March 14, 2023, 8:10pm

In some cases, a short document may actually show higher in a vector space for a given query, even if it is not as relevant as a longer document. This is because short documents typically have fewer words, which means that their word vectors are more likely to be closer to the query vector in the high-dimensional space. As a result, they may have a higher cosine similarity score than longer documents, even if they do not contain as much information or context. This phenomenon is known as the “curse of dimensionality” and it can affect the performance of vector semantic similarity search in certain scenarios.

danthegoodman · March 14, 2023, 9:08pm

Thanks, is there a recommended minimum character length for which I should ignore? For context these are slack messages.

kbutler · March 14, 2023, 9:21pm

There is no universally recommended minimum character length to ignore, as it depends on the nature of the queries and the content of the messages in your specific Slack workspace. However, you can experiment with different thresholds to find an appropriate balance between excluding very short messages (which may be less relevant) and preserving meaningful content.

For example, you can start with a minimum character length of 20 or 30 characters and evaluate the search results. If you find that many irrelevant short messages are still showing up, consider increasing the threshold. On the other hand, if you notice that some useful information is being excluded, you might want to lower the threshold.

danthegoodman · March 14, 2023, 10:10pm

Thanks, will play around with this and report back with findings!

danthegoodman · March 15, 2023, 12:22am

Still seems to be missing a bit, but the results are more consistent now. perhaps the search term is still too short but the correct result being second was pretty consistent for things around IaC. I also had it summarize threads (type: summary) which were well in the hundreds of characters long