I’m about to upsert a new index with 66.1 million vectors of dimension 100. That works out to 25GB of data, which might fit a minimum of 2 S1 pods (of 20GB SSD each).
I would also like to add two pieces of metadata to use in filtering. One is just a string with two possible values: “products” and “homepage”. The second is a list of country codes, which has from 1 to 245 values for a given vector. There are 1,981,977,906 country entries overall, for an average of 30 countries per vector. Each country code is a two-letter string. So a worst-case vector might have an entry like this:
[“AD”, “AE”, “AF”, “AG”, “AI”, “AL”, “AM”, “AN”, “AO”, “AR”, “AT”, “AU”, “AW”, “AX”, “AZ”, “BA”, “BB”, “BD”, “BE”, “BF”, “BG”, “BH”, “BI”, “BJ”, “BL”, “BM”, “BN”, “BO”, “BQ”, “BR”, “BS”, “BT”, “BV”, “BW”, “BY”, “BZ”, “CA”, “CC”, “CD”, “CF”, “CG”, “CH”, “CI”, “CK”, “CL”, “CM”, “CN”, “CO”, “CR”, “CU”, “CV”, “CW”, “CX”, “CY”, “CZ”, “DE”, “DJ”, “DK”, “DM”, “DO”, “DZ”, “EC”, “EE”, “EG”, “EH”, “ER”, “ES”, “ET”, “FI”, “FJ”, “FK”, “FO”, “FR”, “GA”, “GB”, “GD”, “GE”, “GF”, “GG”, “GH”, “GI”, “GL”, “GM”, “GN”, “GP”, “GQ”, “GR”, “GS”, “GT”, “GW”, “GY”, “HK”, “HM”, “HN”, “HR”, “HT”, “HU”, “ID”, “IE”, “IL”, “IM”, “IN”, “IO”, “IQ”, “IR”, “IS”, “IT”, “JE”, “JM”, “JO”, “JP”, “KE”, “KG”, “KH”, “KI”, “KM”, “KN”, “KP”, “KR”, “KW”, “KY”, “KZ”, “LA”, “LB”, “LC”, “LI”, “LK”, “LR”, “LS”, “LT”, “LU”, “LV”, “LY”, “MA”, “MC”, “MD”, “ME”, “MF”, “MG”, “MK”, “ML”, “MM”, “MN”, “MO”, “MQ”, “MR”, “MS”, “MT”, “MU”, “MV”, “MW”, “MX”, “MY”, “MZ”, “NA”, “NC”, “NE”, “NF”, “NG”, “NI”, “NL”, “NO”, “NP”, “NR”, “NU”, “NZ”, “OM”, “PA”, “PE”, “PF”, “PG”, “PH”, “PK”, “PL”, “PM”, “PN”, “PS”, “PT”, “PY”, “QA”, “RE”, “RO”, “RS”, “RU”, “RW”, “SA”, “SB”, “SC”, “SD”, “SE”, “SG”, “SH”, “SI”, “SJ”, “SK”, “SL”, “SM”, “SN”, “SO”, “SR”, “SS”, “ST”, “SV”, “SX”, “SY”, “SZ”, “TC”, “TD”, “TF”, “TG”, “TH”, “TJ”, “TK”, “TL”, “TM”, “TN”, “TO”, “TR”, “TT”, “TV”, “TW”, “TZ”, “UA”, “UG”, “UM”, “US”, “UY”, “UZ”, “VA”, “VC”, “VE”, “VG”, “VN”, “VU”, “WF”, “WS”, “XK”, “YE”, “YT”, “ZA”, “ZM”, “ZW”]
That string is 1452 bytes, but obviously it is stored in a different form internally.
Questions:
- Is this too many categories to have for the list metadata? Will a filter looking for a particular country be efficient in this case?
- How much extra storage will this require? It depends a lot on how this data is indexed.
- How full should I make my pods? I believe latency will go down the more pods we have. Are there any numbers about how much performance varies with the number of pods? (For example, does doubling the number of pods cut the latency in half? More? Less?)
Best,
Craig