The presence of metadata affects vector retrieval negatively

zigguratt · May 18, 2023, 12:54pm

Hey, all! This is my first post to the Pinecone community. I posted this on the LangChain Discord first, but it doesn’t seem to be a LangChain issue.

Using Pinecone to store my vector embeddings goes perfectly well until I try to add metadata to the vectors. Adding the metadata works great. It’s when I then query the vector store that things go weird. The similarity search is suddenly terrible at retrieving vectors similar to the query. Sometimes it returns four of the same one! And in general it does a poor job. As soon as I index my data without the metadata it works perfectly.

I have two indexes set up in Pinecone. They’re identical except that one has metadata and the other doesn’t. When used in my LangChain-based chatbot, the one without metadata behaves exactly as I would expect, answering questions correctly. The one with the metadata seems to be brain dead. Instead of answering, the chatbot says “I’m sorry, but the document parts provided do not contain information about…”.

In my debug output I can see the vectors returned from the similarity search and the metadata associated with each. But as described, the vectors are very badly chosen, sometimes returning the same one multiple times.

This doesn’t make any sense to me. Does adding metadata to vectors really affect the results of a similarity search? I’m not the only one with this problem.

Cory_Pinecone · May 18, 2023, 3:10pm

Hi @zigguratt. This is odd behavior; metadata should improve your search results by filtering out the irrelevant parts of the corpus and only querying against vectors that match the filter.

Can you share some examples of the results you’re seeing? Please include the queries themselves as well as the results. If your metadata contains sensitive information, you can obfuscate it; as long as the schema matches what you’re using that should be enough.

Also, did you set the metadata_config property when you created your index?

Sean · May 18, 2023, 5:32pm

@zigguratt - have you tried to use a simple python script and execute pinecone queries against your two indexes using the vector alone and review the results (with scores)? I have difficulty believing that metadata woud skewed the results, but if you are correct, this is really important to know. My thought is that it might be post-processing with Langchain.

zigguratt · May 18, 2023, 5:33pm

Thank you for the response, @Cory_Pinecone. I created my indexes in the web interface. I don’t recall seeing any setting for metadata_config there.

As far as improving results by filtering, I haven’t even got to the point of using the metadata yet. But I’ll be using it to provide links to the source material, not for filtering results (yet).

I’ll put together examples of this behaviour and come back here.

zigguratt · May 18, 2023, 5:35pm

Thanks, I’ll try that. It’s always important to isolate behaviour before coming to any conclusions.

Sean · May 18, 2023, 5:40pm

Yes, just something simple like this:

   pinecone.init(
        api_key=YOUR_PINECONE_API_KEY,
        environment=YOUR_PINECONE_ENV
    )
  question = "some meaningful query against your vector"
  query = your_model.encode(question).tolist() 
  indexes = ['index1', 'index2']
  for index_name in indexes:
        index = pinecone.Index(index_name)
        xc = index.query(query, top_k=4, include_metadata=True)
        for result in xc['matches']:
            print(f"{round(result['score'], 2)} (id: {result['id']}) : \n {result['metadata']['some_content']} \n {result['metadata']['other_content']}")

zigguratt · May 18, 2023, 8:23pm

Here are two appropriately-named text files that capture the output of an interaction with my chatbot. Hopefully it will be clear from these files what is happening.

http://syrinx.net/no-metadata.txt
http://syrinx.net/with-metadata.txt

zigguratt · May 18, 2023, 9:24pm

I created two indexes in a new project in Pinecone. They’re the same except for the presence or absence of metadata. Here’s the code I ended up using. Please excuse any naiveté. Long time developer, new to vectors and LLMs.

from keys import PINECONE_API_KEY, PINECONE_ENV
import openai, pinecone, json

pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENV)
query = openai.Embedding.create(input="What is BC's ESG rating?",
    model="text-embedding-ada-002")["data"][0]["embedding"]
indexes = ['test-with-metadata', 'test-no-metadata']

for index_name in indexes:
      index = pinecone.Index(index_name)
      xc = index.query(query, top_k=4, include_metadata=True)
      for result in xc['matches']:
          print(f"{'='*80}\nIndex: {index_name}")
          print(f"{round(result['score'], 2)} (id: {result['id']}):")
          print(json.dumps(result['metadata'], indent=4))

Here’s a link to the output of that code. Note the repetition in the “text” field with metadata and the variety without metadata.

Sean · May 18, 2023, 9:47pm

Well that is a little concerning if everything else is status-quo as it implies metadata may have an impact on the results of the vector matching. Looks like the scores are the same .91,.90,.90,.89 - however, the text is not the same per each of the ranking (I could understand the .90/.90 potentially swapping positions due to storage).

Is it at all possible you included the metadata fields within your text encoding when you upserted the data? That would obviously influence the results.

If not, I’d tend to agree the database shouldn’t behave like this.

zigguratt · May 18, 2023, 10:07pm

This incomplete code extract is how I added the embeddings:

import os
from langchain.vectorstores import Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

with open("metadata.json") as file:
    json_data = json.load(file)

files = os.listdir("data")

for filename in files:
    metadata = json_data[filename]
    metadatas = []
    for i in range(len(texts)):
        metadatas.append(metadata)

    Pinecone.from_texts(texts, embeddings, metadatas=metadatas,
        index_name=cfg.index_name)

texts are the chunks of data after splitting the files. The metadata looks like this (but longer in the real file, of course):

{
    "2022-arc-esg-report.pdf": {
        "pdf": "2022-arc-esg-report.pdf",
        "name": "ARC Resources Ltd 2022 ESG Report",
        "org": "ARC Resources Ltd",
        "date": "2022",
        "url": "https://www.arcresources.com/wp-content/uploads/2022/09/2022-ARC-ESG-Report.pdf"
    },
    "2022-bc-esg-report.pdf": {
        "pdf": "2022-bc-esg-report.pdf",
        "name": "B.C. Environmental, Social and Governance (ESG) Summary Report",
        "org": "British Columbia Ministry of Finance",
        "date": "2022",
        "url": "https://www2.gov.bc.ca/assets/gov/british-columbians-our-governments/government-finances/debt-management/bc-esg-report.pdf"
    }
}

I hope it’s just something I’ve done wrong!

Sean · May 18, 2023, 10:23pm

I’m not sure how the from_texts works, but in the example on pinecone’s site using Upsert, you need to 1) get your model and 2) embed your text ( OpenAI ):

    # create embeddings
    **res = openai.Embedding.create(input=lines_batch, engine=MODEL)**
**    embeds = [record['embedding'] for record in res['data']]**
    # prep metadata and upsert batch
    meta = [{'text': line} for line in lines_batch]
    to_upsert = zip(ids_batch, embeds, meta)
    # upsert to Pinecone
    index.upsert(vectors=list(to_upsert))

Perhaps try using Index.Upsert().

Here is an example I used:

def create_pinecone_vector(vectors, filename, fragment, id):
   device = 'cuda' if torch.cuda.is_available() else 'cpu'
   model = SentenceTransformer("all-MiniLM-L6-v2", device=device)  <= Not good form, just to illustrate you need to 1) get the model, and then 2) encode the embeddings.
    vector = model.encode(fragment).tolist()   <=== THIS IS WHERE I ENCODE THE EMBEDDING
    _id = str(id)
    metadata = {'text': fragment, 'url': filename}  <=== WHILE I HAVE MY TEXT INCLUDED AS METADATA, THE vector IS WHAT IS MATCHED AGAINST.
    vectors.append((_id, vector, metadata))  <== BATCH THE VECTORS, THEN DO AN UPSERT:

...
index.upsert(vectors=vectors)

Try that out - keep the embedding (vector) separate from the meta data.

zigguratt · May 19, 2023, 2:13am

I’m not familiar enough with using Pinecone natively yet to get your code sample going. But this is how LangChain does the upsert behind the scenes. I simplified it and gave it contrived data for this demonstration but it works like this:

from keys import OPENAI_API_KEY, PINECONE_API_KEY, PINECONE_ENV
from langchain.embeddings.openai import OpenAIEmbeddings
import pinecone, uuid

embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENV)

index = pinecone.Index("contrived")

batch_size = 32
texts = ["hey", "ho", "let's go"]
ids = [str(uuid.uuid4()) for n in range(3)]
metadata = [{"name": "Report One"},
            {"name": "Report Two"},
            {"name": "Report Three"}]

for i in range(0, len(texts), batch_size):
    i_end = min(i + batch_size, len(texts))
    lines_batch = texts[i:i_end]
    ids_batch = ids[i:i_end]
    embeds = embedding.embed_documents(lines_batch)
    metadata = metadatas[i:i_end]

    for j, line in enumerate(lines_batch):
        metadata[j]["text"] = line

    to_upsert = zip(ids_batch, embeds, metadata)
    index.upsert(vectors=list(to_upsert))

Printing out the to_upsert zip gives this:

[
    [
        "fb31d36e-eae6-4a3e-94ed-a75999760303",
        [
            -0.02625627155432552,
            -0.012488057308738962,
            -0.015884390613657165,
            .
            .
            .
        ],
        {
            "name": "Report One",
            "text": "hey"
        }
    ],
    [
        "8f5c44ad-f33f-4cab-9980-f53ef333951d",
        [
            -0.002752859861633589,
            -0.007076400377751422,
            -0.000402674547197368,
            .
            .
            .
        ],
        {
            "name": "Report Two",
            "text": "ho"
        }
    ],
    [
        "4ba31bc3-a37d-4df6-86c8-f365aa0c19e8",
        [
            -0.007139169753224881,
            -0.015229348396588342,
            -0.010289386119732305,
            .
            .
            .
        ],
        {
            "name": "Report Three",
            "text": "let's go"
        }
    ]
]

This goes into the contrived index with no problems.

zigguratt · May 19, 2023, 10:41am

It looks like whoever wrote the LangChain vectorstore/pinecone.py code followed the example at the OpenAI link you provided. So it’s probably doing things correctly.

zigguratt · May 22, 2023, 11:02am

Any thoughts on this, @Cory_Pinecone? It’s really holding me up.

zigguratt · May 24, 2023, 10:12am

I haven’t heard from you or anyone from Pinecone in five days, @Cory_Pinecone. This seems to be a rather important flaw, either in my code or in Pinecone itself. The metadata is very important to what I’m building. I can’t wait this long to get a response from Pinecone so I’m going to have to look at alternative vector stores. I was encouraged by your quick initial response but disappointed by the lack of response since.

Fred · May 28, 2023, 12:15am

Hi,
I have the same issue. To me it looks like the issue is with loading in Pincone from Langchain. especially how the text part is handled then loading. I am looking into it. Let me know if you found something in the meantime.

zigguratt · June 2, 2023, 10:28pm

I really would have liked to use Pinecone for my project. But this metadata issue is a showstopper for me. I’ve given all of the information requested and more. It has now been two weeks with no response from @Cory_Pinecone or anyone else at Pinecone. It’s a shame. I guess I’ll try out Weaviate.

n.novak360 · October 7, 2023, 4:54pm

Hey, did you ever figure this out? We ran into the same issue and spent hours banging our heads against our desks.

What db did you switch to?

zigguratt · October 10, 2023, 3:08pm

The problem was that the LangChain’s vectorstores/pinecone.py was modifying the metadata object that was passed to it. So doing a deepcopy before passing the metadata object to LangChain preserves the original object and its data.

doddys · October 19, 2023, 7:54am

can you provide a sample code to illustrate the fix?