I just tested the new OpenAI model text-embedding-3-large for generating embeddings. In order to avoid extra refactoring I took advantage of the new “dimensions” parameter to keep the embedding at 1536 instead of the native size of 3072.
OpenAI claims
“…developers can shorten embeddings (i.e. remove some numbers from the end of the sequence) without the embedding losing its concept-representing properties.”
So great. I tried it. What I am finding is that the Pinecone API:
https://sanitized-zmb21lt.svc.apw5-4e34-81fa.pinecone.io/query
When passed one of these new vectors, text-embedding-3-large, truncated to 1536, the search results are wonky. Note that the vectors being searched were all generated by OpenAI text-embedding-ada-002.
By “wonky” what I have observed is that the “score” returned is abnormally low. Possibly 10x low. I normally set the threshold at about 0.88. In my first test there were zero hits which surprised me greatly. I then tried 0.08 and that returned plenty of results but plenty of garbage as well. I also noticed that some of the scores were negative numbers. I’m not sure if that is supposed to be “possible” or not. I have not found any sort of definitive documentation on the exact semantics of this number. I have just been using it to rank the hits.
I have a few questions:
- Maybe it is working “correctly” and I just have to adjust my expectations about the similarity score number?
- Maybe I just got unlucky? I was not directly comparing identical texts.
- Should I be able to pass a “new style” 3072 vector to the /query API and have it successfully compare it to “old” 1536 data?
- What is the coexistence or migration strategy? Regenerating all the old 1536 vectors might be acceptable for my use case, but for many Pinecone customers I suspect it’s a no-go.