Fetching/listing Millions of records with metadata filters

Hey, I have a namespace with a few million vectors with inconsistent metadata. For instance: 1/3rd of the records doesn’t have a field (say ‘collection’).

Here’s what i want to do,
I want to get all the records ids where the collection field does exists.

so far i tried this workaround snippet based on one of the answers: tylers get all records answer
( which either takes too long, or quite possibly doesnt work):

# Get all the Ids present in the index with collections
def get_ids_from_query(index,input_vector, all_ids):
  print("searching pinecone...")
  results = index.query(vector=input_vector, namespace = 'allproducts', top_k=10000,include_values=False, filter = {"collection": { "$in": collection_list }})
  ids = set()
  print(type(results))
  for result in results['matches']:
    ids.add(result['id'])
  return ids

def get_all_ids_from_index(index, num_dimensions, namespace=""):
  all_ids = set()
  while True:
      print("\n Length of ids list is shorter than the number of total vectors...")
      input_vector = np.random.rand(num_dimensions).tolist()
      print("creating random vector...")
      ids = get_ids_from_query(index, None, all_ids)
      print("getting ids from a vector query...")
      all_ids.update(ids)
      print("updating ids set...")
      print(f"Collected {len(all_ids)} ids.")
      if not ids:  # If no IDs are returned, we've collected all vectors with a collection field
          break

  return all_ids

all_ids = get_all_ids_from_index(index, num_dimensions=512, namespace="allproducts")
print(all_ids)

Is there any way for me to achieve this ?

I looked at Index.list() method, it would be awesome if we could add filters here too, or alternatively add “$exists” clause similar to whats present in mongodb as suggested here

Hi @AusafMo, thanks for the post and your feedback. I encourage you to add a post to the Feature Requests community section so that other users can add votes. In the meantime, there’s a potential workaround we can apply to retrieve all the records with a given metadata field.

Assuming you are using a serverless index and thus can leverage the list operation, I encourage you to try the following approach:

from pinecone import Pinecone, PodSpec, ServerlessSpec

# get all the ids in an index
def get_all_ids(index):
    segments = index.list()
    ids_list = []
    for segment in segments:
        for id in segment:
            ids_list.append(id)
    return ids_list

# get all the records from a list of ids
def get_all_records(index, ids):
    records = []
    for i in range(0, len(ids), 1000):
        res = index.fetch(ids[i:i+1000])
        for record in res['vectors'].values():
            records.append(record)
    return records

# check records for existence of a certain metadata field
def check_for_metadata_field(records, field_name):
    matching_ids = []
    for rec in records:
        try:
            if field_name in rec['metadata']:
                matching_ids.append(rec['id'])
        except:
            pass
    return matching_ids

# putting it all together
def scan_index_for_metadata_field(index, field_name):
    ids = get_all_ids(index)
    records = get_all_records(index, ids)
    matching_ids = check_for_metadata_field(records, field_name)
    return matching_ids

# execution
pc = Pinecone(api_key='YOUR_API_KEY')

index = pc.Index('<INDEX_NAME>')

matching_ids = scan_index_for_metadata_field(index, '<FIELD_NAME>')

Please let me know if you have any questions or issues!

Regarding the approach you shared in your post, this code will run infinitely because each query will continuously return records, assuming that there are records in the index.

1 Like

Hey @zeke_pinecone, thanks for the reply.

I have already tried something similar as soon as i found out about index.list(), the issue is, it caps out at 100 Ids per page (if i understood this correctly) taking way too much time, which is far from convenient in case of north of 5 million records given i have to do the filtering post retrieval too.

Anyways though, we’ve taken the route to make the data consistent, so hopefully we wouldn’t need a workaround in future.
Would raise a feature request for sure.

Feature Request : List with Metadata field filtering

1 Like

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.