Embedding dimensionality

ahyagh · February 8, 2022, 6:42am

Does the embedding size influence the quality of the results?

jamesbriggs · February 8, 2022, 2:01pm

It can, generally speaking the larger the embedding size the more information you can encode into your embeddings. But it doesn’t necessarily mean bigger is better. There are excellent SOTA semantic search models that embed vectors at a dimensionality of 786, and others (eg. OpenAI’s Davinci) that use 12288. Both work, and there can be an accuracy benefit for larger dimensionality, but not always. Another point to consider is that there will be a trade-off between high-accuracy (larger dims) and speed+storage (lower dims), your use-case may require that you prioritize one or the other.

simistern · February 12, 2023, 8:25pm

I am new to vector databases and I have to say this didn’t make much sense to me. Is there a simpler explanation I can find on how dimensionality works, and how tweaking it can get me to my desired outcomes? (high quality semantic search capabilities)

greg · February 13, 2023, 5:05am

Hi @simistern, the dimensionality is usually determined by the embedding model you use. For example, OpenAI’s ada-002 model outputs embeddings with 1536 dimensions. For someone just getting started with this, there is no need — and in some cases there is no way — to change the dimensions of those embeddings.

Cory_Pinecone · February 17, 2023, 7:44pm

Hi @simistern, it might help if you think of a real-world example of using vectors to identify a point. Say you’re tracking an airplane, where you need at least four dimensions of data: latitude, longitude, altitude, and time. Using those four, you can determine the plane’s location at a given point in time.

But say your original model used hours as the atomic unit of time, and now you need more accuracy. You could refine your model to have more discrete time-tracking or add dimensions to include minutes and seconds. So now you have a vector with six dimensions, giving you more accuracy on where exactly the plane was.

As your need for accuracy increases, you can track more dimensions to yield more precise measurements. So if your original lat/long coordinates were accurate to degrees, you could also add dimensions to record the minute and second of the degree. You could even add dimensions representing the aircraft’s length, width, and height, so you can model where in space at a particular time one of its wings was or where the flight deck was.

The more dimensions of data you add, the more precise your measurements of all these details become.

But there comes a point of diminishing returns. If you add dimensions to account for every seat on the plane down to a millisecond of time, for instance, the storage requirements to track all of these data can become greater than you have available. And that level of minute representation may not give you any meaningful conclusions. So as @jamesbriggs said, you have to balance the trade-offs between high accuracy and speed of response and storage of your vectors.

Gregg · March 3, 2023, 6:44pm

I am working with the LayoutLM model and I want to create my vector DB with the embeddings of the SER model. This model’s outputs combined with the embeddings would include:
[
ID: 22
question_tokens[2534, 4234, 3423] # the tokenized question
question_box[x,y,x,y]
answer_tokens[4322, 5233, 234] # the tokenized answer
answer_box[x,y,x,y]
]
Is the correct process to flatten these into one long vector:
[22 534 4234 3423 4322 5233 234 x y x y x y x y]
If so, does Pinecone do this for me?
Also, the ID should probably be in the metadata and not in the vector, right?

Gregg · March 17, 2023, 4:23pm

Does anyone have any thoughts on this question?

Cory_Pinecone · March 20, 2023, 3:57pm

Hi @Gregg,

Pinecone stores the vectors for you but doesn’t flatter or manipulate them. We can only store what you send us.

As for the ID, that’s best stored in the vector-id field directly rather than in metadata. This will allow you to fetch the specific vector later and more easily map your vectors to other data sources.

I’m not familiar with the LayoutLM model, so I can’t offer any constructive advice about it specifically.