How much space to leave for metadata?

cschmidt · March 11, 2022, 8:15pm

I’m about to upsert a new index with 66.1 million vectors of dimension 100. That works out to 25GB of data, which might fit a minimum of 2 S1 pods (of 20GB SSD each).

I would also like to add two pieces of metadata to use in filtering. One is just a string with two possible values: “products” and “homepage”. The second is a list of country codes, which has from 1 to 245 values for a given vector. There are 1,981,977,906 country entries overall, for an average of 30 countries per vector. Each country code is a two-letter string. So a worst-case vector might have an entry like this:

[“AD”, “AE”, “AF”, “AG”, “AI”, “AL”, “AM”, “AN”, “AO”, “AR”, “AT”, “AU”, “AW”, “AX”, “AZ”, “BA”, “BB”, “BD”, “BE”, “BF”, “BG”, “BH”, “BI”, “BJ”, “BL”, “BM”, “BN”, “BO”, “BQ”, “BR”, “BS”, “BT”, “BV”, “BW”, “BY”, “BZ”, “CA”, “CC”, “CD”, “CF”, “CG”, “CH”, “CI”, “CK”, “CL”, “CM”, “CN”, “CO”, “CR”, “CU”, “CV”, “CW”, “CX”, “CY”, “CZ”, “DE”, “DJ”, “DK”, “DM”, “DO”, “DZ”, “EC”, “EE”, “EG”, “EH”, “ER”, “ES”, “ET”, “FI”, “FJ”, “FK”, “FO”, “FR”, “GA”, “GB”, “GD”, “GE”, “GF”, “GG”, “GH”, “GI”, “GL”, “GM”, “GN”, “GP”, “GQ”, “GR”, “GS”, “GT”, “GW”, “GY”, “HK”, “HM”, “HN”, “HR”, “HT”, “HU”, “ID”, “IE”, “IL”, “IM”, “IN”, “IO”, “IQ”, “IR”, “IS”, “IT”, “JE”, “JM”, “JO”, “JP”, “KE”, “KG”, “KH”, “KI”, “KM”, “KN”, “KP”, “KR”, “KW”, “KY”, “KZ”, “LA”, “LB”, “LC”, “LI”, “LK”, “LR”, “LS”, “LT”, “LU”, “LV”, “LY”, “MA”, “MC”, “MD”, “ME”, “MF”, “MG”, “MK”, “ML”, “MM”, “MN”, “MO”, “MQ”, “MR”, “MS”, “MT”, “MU”, “MV”, “MW”, “MX”, “MY”, “MZ”, “NA”, “NC”, “NE”, “NF”, “NG”, “NI”, “NL”, “NO”, “NP”, “NR”, “NU”, “NZ”, “OM”, “PA”, “PE”, “PF”, “PG”, “PH”, “PK”, “PL”, “PM”, “PN”, “PS”, “PT”, “PY”, “QA”, “RE”, “RO”, “RS”, “RU”, “RW”, “SA”, “SB”, “SC”, “SD”, “SE”, “SG”, “SH”, “SI”, “SJ”, “SK”, “SL”, “SM”, “SN”, “SO”, “SR”, “SS”, “ST”, “SV”, “SX”, “SY”, “SZ”, “TC”, “TD”, “TF”, “TG”, “TH”, “TJ”, “TK”, “TL”, “TM”, “TN”, “TO”, “TR”, “TT”, “TV”, “TW”, “TZ”, “UA”, “UG”, “UM”, “US”, “UY”, “UZ”, “VA”, “VC”, “VE”, “VG”, “VN”, “VU”, “WF”, “WS”, “XK”, “YE”, “YT”, “ZA”, “ZM”, “ZW”]

That string is 1452 bytes, but obviously it is stored in a different form internally.

Questions:

Is this too many categories to have for the list metadata? Will a filter looking for a particular country be efficient in this case?
How much extra storage will this require? It depends a lot on how this data is indexed.
How full should I make my pods? I believe latency will go down the more pods we have. Are there any numbers about how much performance varies with the number of pods? (For example, does doubling the number of pods cut the latency in half? More? Less?)

Best,
Craig

dave · March 14, 2022, 5:30pm

Craig,

Sorry to hear you ran into this. This is Dave from Product at Pinecone. I believe you’re running into two issues:

Our online pricing calculator estimates the number of pods best for vectors with dimensions around 500-2000. We will update the online calculator for larger or smaller dimensions soon.
Our metadata is indexed to allow filtering. A large number of unique field values creates a large index that consumes a large amount of memory.

CORRECTION: #2 should not be a problem in your case since you only have 100+ countries. The fact that you have a variable list up to 245 such values is less of an issue.

So I suspect it’s the number of vectors you’re storing per pod. I suggest trying 14 s1 pods.