Checking if an ID exists in Pinecone vector database and retrieving the IDs only

gopalika.sharma · October 1, 2024, 1:17pm

Hi there! I am trying to get the ids in my pinecone index in a list so I can compare it against the ids in my database before upserting new rows to the index, this is to avoid duplication. If a given ID already exisits, we will not be upserting that row of data from the database.
This is what I tried, but was a bust:

def get_existing_ids_from_pinecone(pinecone_namespace: str) -> list:
    """
    Retrieve all indexed IDs from the specified Pinecone namespace.

    :param pinecone_namespace: The Pinecone namespace to retrieve indexed data from.
    :return: List of indexed ID values.
    """
    pinecone_project_name = os.environ['PINECONE_PROJECT_NAME']
    pinecone_index_name = os.environ['PINECONE_INDEX_NAME']
    
    
    index = pinecone_obj.Index(pinecone_index_name)

    # Use Pinecone's query functionality to get all vectors and their metadata
    ids_in_pinecone = []
    fetch_limit = 1000  # Adjust this limit based on your dataset size and performance
    last_retrieved = None  # Used to track pagination

    LOGGER.debug("Fetching existing indexed IDs from Pinecone.")

    while True:
        # Fetch vectors in batches using pagination to avoid overloading the system
        if last_retrieved:
            response = index.query(
                namespace=pinecone_namespace,
                vector=[0] * 3072,  # Use an arbitrary zero vector to query without filtering
                top_k=fetch_limit,
                include_metadata=True
            )
        else:
            response = index.query(
                namespace=pinecone_namespace,
                vector=[0] * 3072,  # Use an arbitrary zero vector to query without filtering
                top_k=fetch_limit,
                include_metadata=True
            )

        # Process the response and extract IDs
        matches = response.get("matches", [])
        if not matches:
            break  # Stop if no more data is available

        for match in matches:
            id_unique = match.get("id")
            if id_unique:
                ids_in_pinecone.append(id_unique)

        # Update last_retrieved for the next iteration (pagination)
        if len(matches) < fetch_limit:
            break  # Stop if fewer than fetch_limit records are returned, meaning we're done

        last_retrieved = matches[-1]["id"]

    LOGGER.debug(f"Number of IDs fetched from Pinecone: {len(ids_in_pinecone)}.")

    return ids_in_pinecone

This doesn’t work, and just gets hung or timed out if I increase the fetch_rate. This is a serverless pod btw and is of 3072 dimensions. I want to also mention that I was testing with 10K vectors but originally we are bound to have about 60K -70 K and the speed is a big issue in this scenario. Being able to fetch all the ids in a small time frame is important for our application.

The highlighted yellow in the image below is how the IDs are in our vector database after embedding and indexing, they are stored as string ‘id’.

I did see a lot of posts on retrieving a list of IDs in an index but its unclear on how to use it for my use case.
We only embed and index the rows which are not already embedded! Any help is appreciated Let me know if you need more clarification or context!

gopalika.sharma · October 1, 2024, 8:47pm

I figured it out! thank you this can be marked done!