Pinecone results are wrong

roberto.sanchez · July 5, 2023, 1:45am

Hi everyone, lately I’ve been having some trouble getting good results. I am developing an assistant based on pinecone and openAI that is responsible for answering questions to documents, the text is extracted from the documents, fragmented and through the use of OpenAI embeds it is transformed to later be inserted into an index in Pinecone. I have made sure that no pieces of text are lost during this process. There is a specific case when my query is about some alarm codes of a tool (the documents are manual) where I do not get the expected response even using words that are the same as the text fragment I want to get. I have noticed that the vectors that I want to obtain start to be part of the query result when I increase the top_k parameter but when using similar or the same words I don’t understand why the score is so low. I tried using other metric but I got the same result

Here some configuration data that I use:

Index:
metric: cosine
dimension: 1536

query:
top_k: 20

dukemagus · July 5, 2023, 2:30pm

when splitting the base documents, try to keep an overlap of 200 characters or 100 tokens (depends on how you measure it) or break chunks in a paragraph end.

this allows a better link of a a sequence of statements.

You can also use metadata to label each part of your document, so you could prioritize querying the category “alarm codes”, for example

roberto.sanchez · July 5, 2023, 6:56pm

Thanks for answering. I’m using a chunk size of 1000 and an overlap of 200 to split the text, I tried with another configuration of these parameters but it doesn’t work for me, and in the metadata of the vector I’m saving the name of the document it belongs to, Is there anything else I could try?