Python Client Upsert Issue - mismatch in values once upserted

lewis.wooster · November 27, 2023, 8:25pm

Upserting a list[tuple(str,list[float])] per Python Client. For example original vector of length 1536 like so:
[(‘1’, [0.0031443795, 0.012362102, -0.01929016,…)]

once upserted equals the following in the pinecone web ui / also confirmed when querying the vector store via the python client:
id=1 | values=[0.00314437947, 0.012362102, -0.0192901604,…]

please observe the differences in the original vector compared to that stored in pinecone

could it be an implicit data type conversion error?

zeke_pinecone · November 27, 2023, 9:08pm

Hi @lewis.wooster, could you please specify what aspect of the upsert result you find to be unexpected?

In other words, what data is being improperly converted to a different type? Furthermore, please share where you are viewing the results of the upsert (for example, in the Console UI or via a Fetch operation).

lewis.wooster · November 27, 2023, 9:26pm

Hi @zeke_pinecone, please see edited post.

zeke_pinecone · November 28, 2023, 5:40pm

Thanks for the additional information, @lewis.wooster. In the future, please try to refrain from editing the initial post when you are adding new details, especially after a response has been made, as this will alter the relevancy of other contributions.

With that said, can you please share exactly how you are collecting the information of the vector prior to upserting it to Pinecone? For example, after creating the embedding, how are you printing out the information [(‘1’, [0.0031443795, 0.012362102, -0.01929016,…)].

Pinecone will never alter the values of the upserted vectors, so I am curious to determine if there is some logging procedure here that may be rounding the actual values of the vector.

lewis.wooster · November 30, 2023, 8:06pm

the pre-upsert values initially referenced are the actual values per the source (json) viewed in a vscode terminal.

Source JSON

[{
  "id": "1",
  "embedding": [
    0.0031443795,
    0.012362102,
    -0.01929016,
    ...
  ],
  "file_name": "...",
  "text_chunk": "..."
}]

Something appears to be going so I am taking a look at the python client Index class source code to see what type conversions might be occuring.

Cory_Pinecone · November 30, 2023, 9:30pm

@lewis.wooster could you share the next five or more numbers in that embedding list? It’s odd that the precision in your vscode window goes from 10 digits, to 9, to 8. While the precision shown in the Pinecone output is up to 11 in the first instance (one decimal place greater), the same in the second, and 10 in the last (two decimals greater than the vscode output).

I want to eliminate something odd with how vscode is displaying the numbers. As @zeke_pinecone said, Pinecone won’t change the values you upsert, but the way those values are presented on your screen could be effected by something else.

And are you using the stock Python JSON library for loading the data or something else? Again, we want to eliminate any other factors that could effect the precision of your embeddings.

lewis.wooster · November 30, 2023, 10:34pm

Hello,
I am just using json python package to import the json file.

Raw embedding from JSON file and how it is also being presented in a vscode terminal:

[{
  "_id": "1",
"embedding": [0.0031443795, 0.012362102, -0.01929016,-0.01580181,-0.029213198,0.02329273,0.0075673577,0.0066848467,-0.032576468,-0.024029315,...],
"file_name": "...",
"text_chunk": "..."
}]

Upserted embedding in pinecone:
[0.00314437947, 0.012362102, -0.0192901604, -0.0158018097, -0.0292131975, 0.0232927296, 0.00756735774, 0.00668484671, -0.0325764678, -0.0240293145, …]

silas · December 1, 2023, 1:14am

Hey @Cory_Pinecone, it’s worth pointing out that you can reproduce this issue directly in the Pinecone portal.

Create a new index with dimension=10
Click Add a record, and copy in the embeddings: 0.0031443795, 0.012362102, -0.01929016,-0.01580181,-0.029213198,0.02329273,0.0075673577,0.0066848467,-0.032576468,-0.024029315
Click query to find the embedding, and observe it is rendered in the portal as 0.00314437947, 0.012362102, -0.0192901604, -0.0158018097, -0.0292131975,... etc.
You can see the same (incorrect) coming back via the API (the below is for Fetch, but Query results in the same)

$ curl 'https://xxxxxxxxxxxx.pinecone.io/vectors/fetch?ids=1' -H 'Api-Key: xxxxxxxxxxx' | jq .
{
  "vectors": {
    "1": {
      "id": "1",
      "values": [
        0.00314437947,
        0.012362102,
        -0.0192901604,
        -0.0158018097,
        -0.0292131975,
        0.0232927296,
        0.00756735774,
        0.00668484671,
        -0.0325764678,
        -0.0240293145
      ]
    }
  },
  "namespace": ""
}

Hope this helps!
-Silas

lewis.wooster · December 1, 2023, 2:32am

thanks Silas for confirming that you can reproduce the error. The original index dimension where this issue was first observed was 1536.

silas · December 1, 2023, 5:08pm

Just easier to reproduce with a small number.

tim · December 1, 2023, 11:11pm

@Cory_Pinecone To add onto @silas findings

This has got to be a floating point conversion issue during transit into Pinecone. I can 100% confirm that when I post data to an index it leaves my network with full precision defined (used your example).

But sure enough, it looks like when I request it back I get a 100% match score, but get the incorrect raw values back. Given that the rounding

Here is for your eng team (although I’m sure they know this issue already)

You can send up to around 8 places of precision. If the 9th digit is non-zero it will round up >= 5
0.00314437 => 0.00314437
0.00314437000 => 0.00314437
0.003144371 => 0.00314437109
0.0031443711 => 0.00314437109
0.00314437111 => 0.00314437109
0.00314437411 => 0.00314437412

I suppose now the question to ask is does it matter? because if you send the near-same vector you get a full match so I’m going to guess this doesn’t matter since trying to maintain some arbitrary level of precision can be annoying since it happens on client and server.

Worth noting that metadata is also limited to 17 signifant-figures. This makes sense because even if storing geo data like

 "metadata": {
    "lat": 319.16490707397463
  }

that is like a micrometer of precision.

system · December 15, 2023, 11:12pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

zeke_pinecone · December 19, 2023, 7:54pm

Hi @lewis.wooster, @silas, and @tim!

Thanks a bunch for the continued investigation and discussion on this post. We’re currently looking into this and will report back with clarification on the expected behavior.

zeke_pinecone · December 21, 2023, 9:52pm

After discussing internally, we can confirm that the behavior we’ve observed and shared on this thread is expected.

What’s happening here is a result of the fact that fractions just can’t be accurately represented in fixed amounts of memory. Different numbers might be mapped to the same bit representation, according to a standard known as IEEE 754. More specifically, at Pinecone, we use Rust’s f32 type, which is more commonly known as a float in Java, C, C++, and Python.

If you take these numbers and look at their physical memory representation, you’ll see that each float maps to the same representation before and after we upsert the vector to Pinecone. Our team built a small demonstration in Rust here that you can use to explore some examples.

This behavior is common across every system in the world, and the general trend in machine learning has been to reduce accuracy even more (see Google’s bfloat16, which is now also standardized). Thank you again for the engagement on this thread, and please let us know if you have any questions!