Choosing the Optimal Embedding Strategy for CSV Columns: Question, Category, and Answer

albint3r · January 16, 2024, 6:36pm

Good day, everyone,

I’m reaching out today with a straightforward query. Currently, I’m working on generating a CSV file with the following columns: question, category, answer. My dilemma revolves around the embedding process, and I’d appreciate some guidance on the best approach:

Embedding on All Three Columns: Should I create embeddings based on all three columns (e.g., question + category + answer)?
Embedding on Question Only: Alternatively, I’m contemplating embedding only the question and utilizing metadata for the category and answer (e.g., question).
Embedding on Answer Only: Lastly, another option is to focus solely on embedding the answer to the question (e.g., Question).

Your insights and recommendations on the most effective strategy would be highly valuable. Thanks in advance!

Cory_Pinecone · January 31, 2024, 8:45pm

Hi @albint3r. It really depends on what you want to match on. If you’re matching on questions, you’ll return results that are similar to previous questions people had but that might not contain the context for an actual answer.

I think, in this case, you would be better served by embedding answers only. You could embed both questions and answers and use metadata to tie them together. But that would be considerably more complex, and I’m not actually sure how much more accurate your answers would end up being.

What is it you’re building? Is this something that Canopy might help with, so you’re not reinventing the wheel?