What is a Vector Database?

Complex data is growing at break-neck speed. These are unstructured forms of data that include documents, images, videos, and plain text on the web. Many organizations would benefit from storing and analyzing complex data, but complex data can be difficult for traditional databases built with structured data in mind. Classifying complex data with keywords and metadata alone may be insufficient to fully represent all of its various characteristics.


This is a companion discussion topic for the original entry at https://www.pinecone.io/learn/vector-database/

Thanks for the informative article. I’m a bit curious as to why the Pinecone vector indexes do not seem to use data structures that are common in the field of study called metric indexing.

For example, several years ago I had to work with a large set of vectors, each vector with 10,000 coordinates (Hypervectors - Pentti Kanerva et al.), and so I decided to just stick them in a simple BS-Tree (Kalantari and McDonald, 1983). The kNN searches worked well at discarding tree branches that were “too far away” and we could do exact or approximate cosine distance searches with error guarantees. There are more recent metric indexes that are probably better, but I did not see examples of any of them here.

Cheers,
John

I find the article overgeneralized. Using words like “robust” what does this mean in this context.

Using words like “fast” what does this mean 1k, 4K 100K transactions per second on a core i7, 32 Gig of Ram such word use is an indication that this article was at least edited by a marketing or Sales person not an engineer. The information about similarity indices is useful as far as it goes, but again no data is provided about the performance, that all managers and engineers need to be able to make comparisons.
In this post you talk about real time, in video that is 30 frames per sec, in medical imaging that terms has come to mean 60 frames per sec or 60 fields per second, but there is nothing to indicate what real time means in this context, nor is there any indication of latency.
Overall reading this post provided only a modest amount of unknown information and provided nothing that allowed me to make a decision for or against this database, so my default position is not to pursue using this database but to move onto more well known databases like postgresql, mongo, Maria etc which would allow me to provide measurements to my managers.
Time and again I exhort my padawans, younglings people who I am mentoring “Engineers Quantify”, Marketers generally Qualify. Just read an ISP brochure with words like “Up to”, and “best” and other words that have no meaning and provide no assurance of Quality of Service, UpTime, % downtime.
My best to you and your loved ones