Do not understand results of Euclidean metric search

john_beale · August 26, 2023, 12:44am

I uploaded a dataset with 150 vectors, with setDimension=16 elements, metric=“euclidean”

indexName = "c1"
setDimension=16
pinecone.create_index(indexName, dimension=setDimension, metric="euclidean")

I confirmed using the online console at Pinecone Console that I have 150 vectors, Dimensions=16, and my metric is indeed set to euclidean.

Now when I do a query asking for 3 matches,

res = index.query(
  vector=testVec,
  top_k=3,
  include_values=True
)

I do get three items back, but with “score” values I do not understand. I thought it would be simply the Euclidean distance or L2 norm (v1-v2) between my test vector and the query result vector, but I calculated the norm myself using this code:

# get Euclidean distance between vectors at index i1 and i2
def getDist(i1, i2):
    tvec1 = emVectors[i1,1:cols]
    tvec2 = emVectors[i2,1:cols]
    dist = np.linalg.norm(tvec2-tvec1)
    return dist

and found that is not true. I thought the matches would be presented in order of match, and they are in fact shown from low ‘score’ to high ‘score’, but they are not in order by actual Euclidean distance.

Also my test vector is actually a member of the dataset, but that element is not one of those returned, despite it being obviously the best match as the (vTest - vResult) distance would be zero. Am I misunderstanding how this is supposed to work?

chastaineric6 · August 26, 2023, 11:11pm

Clarify Score Interpretation:

Refer to Pinecone’s documentation or support to understand the returned score.
Improve the Query Process:

Ensure the exact search (if available in Pinecone) is enabled to guarantee precision.
python
Copy code

Assuming Pinecone has an ‘exact_search’ parameter (this is hypothetical; consult documentation)

res = index.query(
vector=testVec,
top_k=3,
include_values=True,
exact_search=True
)
Verification Utility:
Let’s create a utility to verify the results from Pinecone against the manual calculation:
python
Copy code
import numpy as np

def calculate_euclidean_distance(vec1, vec2):
“”“Calculate the Euclidean distance between two vectors.”“”
return np.linalg.norm(vec1 - vec2)

def verify_pinecone_results(testVec, results, emVectors):
“”"
Verify the returned results from Pinecone against manual calculation.
Assumes results contain vector data and scores.
“”"
for item in results:
# Extract vector from the result
result_vector = item[‘vector’] # This is hypothetical; structure might vary
manual_distance = calculate_euclidean_distance(testVec, result_vector)

    # Compare manual distance with the returned score
    print(f"Manual Distance: {manual_distance}, Pinecone Score: {item['score']}")

Call the verification function

verify_pinecone_results(testVec, res[‘items’], emVectors)
Address Missing Test Vector:

If the test vector itself isn’t being returned and you expect it to be the closest match, ensure it’s in the dataset and the exact match isn’t being omitted by Pinecone.
Regular Maintenance:

Depending on the dynamic nature of your vector data, consider periodic re-indexing or cleanup. Some vector databases tend to degrade in performance or accuracy as more data is inserted or removed. This step might be more relevant for very large datasets or databases. Idk if how much of actually quantasizing youd want but I could direct a seperate scheme it’s not normal for me to actually know how to get this to match your data set as I haven’t haduch experience on pinecone.

chastaineric6 · August 26, 2023, 11:13pm

In the context of if Pinecone (or any system in use) employs quantization techniques, this could explain some of the discrepancies in distances or scores. Quantized vectors will not yield the exact same distances as the original vectors, but the results should still be reasonably accurate for most applications.

john_beale · August 28, 2023, 3:52pm

Updated information: here is another test with a ten-element, dimension=2 test dataset:

[[0.         0.89596237 0.70978513]
 [1.         0.70082368 0.83607366]
 [2.         0.49262817 0.95895796]
 [3.         0.2060746  0.01474949]
 [4.         0.85170152 0.28698518]
 [5.         0.12843261 0.58501157]
 [6.         0.2578227  0.11497755]
 [7.         0.58453423 0.47747947]
 [8.         0.86946807 0.02560292]
 [9.         0.54156107 0.70378163]]

Here is that same dataset which I sorted in order of Euclidean distance from element 2 (printed with 4 decimals just to save space, I didn’t truncate the representation). If I were to request the four closest vectors to [2], I would expect to get [2,9,7,1] in that order.

Idx  (   x0 ,    x1 )   dist
2    (0.4926, 0.9590) 0.0000
9    (0.5416, 0.7038) 0.0489
7    (0.5845, 0.4775) 0.0919
1    (0.7008, 0.8361) 0.2082
6    (0.2578, 0.1150) 0.2348
3    (0.2061, 0.0147) 0.2866
4    (0.8517, 0.2870) 0.3591
5    (0.1284, 0.5850) 0.3642
8    (0.8695, 0.0256) 0.3768
0    (0.8960, 0.7098) 0.4033

Here is what I sent, which is the exact value of vector [2] and what I got back:

testVec = [0.4926281651664598, 0.9589579552076507]

res = index.query(
  vector=testVec,
  top_k=4,
  include_values=True,
  exact_search=True
)
print(res)

{'matches': [{'id': '2',
              'score': 0.00358533859,
              'values': [0.492628157, 0.95895797]},
             {'id': '1',
              'score': 0.0639438629,
              'values': [0.700823665, 0.836073637]},
             {'id': '9',
              'score': 0.0679972172,
              'values': [0.541561067, 0.703781605]},
             {'id': '0',
              'score': 0.22275424,
              'values': [0.895962358, 0.709785104]}],
 'namespace': ''}

As you can see, I got back the set [2,1,9,0] and you may recall, vector [0] is the very farthest one in all ten elements from my test vector, although it did return [2] as the closest. It is hard for me to explain this result. Pinecone apparently stores the vectors with single-precision float accuracy (~ 7 decimal digits) but does the actual calculation with far less accuracy(?)