Our evals show that our accuracy for certain questions doesn’t meet the threshold as the data required for these are represented as tables in pdf’s , the semantic meaning is lost as part of the RAG pipeline, are there any techniques to allow preserving of semantic meaning within a vector record.
We evaluated ColPali and CLIP early on but their dimensions are different than other text embedding models. Additionally we are investigating vision models and frameworks but need some techniques to effectively vectorize these tables for reuse in Pinecone.
Yes, there are documented examples and strategies for representing tables (whether originally in HTML or JSON format) as vectors for use in Pinecone. The recommended approach is to first extract the table data and convert it into a structured format like a dataframe. You can then try several vectorization strategies:
- Concatenate all values per row into a single string for each row of the table, and embed that string.
- Concatenate row values and column headers to give more context to each row before embedding.
- Add table descriptions to the concatenated row and header data for richer context.
- Convert each row into a natural language phrase that describes the data in that row for embedding.
You can find a complete demo notebook that illustrates these approaches for tabular data here: vectorizing-structured-data.ipynb
ETA : For more advanced experimentation, you can also systematically test different embedding models, chunking strategies, and retrieval settings, ensuring the best fit for your data and use case