Decent success for sentence sentiment embedding/querying with all-MiniLM-L6-v2 and rolling window

Hope this information might be useful for people hoping to use Pinecone for text chat/NLP:

For context, I’m using the https://github.com/Azure-Samples/azure-search-openai-demo/data PDF documents as my data repository.

I used PDFPlumber to extract text from each PDF page (code can be found in the above repo). Unfortunately, the query results were less than optimal - often query results were returned with no relevance to the original query (for instance, there is a demo “Northwind Health Plus Benefits Details” document - asking “are glasses covered under my healthplan?” would return a paragraph about prosthetics).

I tried several approaches (using the all-MiniLM-L6-v2 model):

  1. page with 256 max_seq_length
  2. page with 512 max_seq_length
  3. chunk of 4 sentences of a page with 256 or 512 max_seq_length.
  4. single sentence embedding.
  5. rolling window paragraph

I finally settled on a hybrid of embedding each sentence, as well as embedding a rolling window of the sentence PLUS 4 subsequent sentences. Using this method, I was able to achieve reasonable results.

The reason I chose this approach is because occasionally, the initial sentence has a high cosine similarity, however additional data in the following sentences improve the intended result of the query. I batched these up into vectors of 700 calls (Pinecone has a limit of 1000 vector calls or 2MB per request).

The balance seems to provide a good balance of relevant information to then push the data to a chatGPT prompt for summarization of the top XX results.

Please comment if you’ve found other approaches!