Inserting data as vectors from SQL Database

levalencia · June 27, 2023, 7:06am

I have a profiles table in SQL with around 50 columns, and only 244 rows.
I have created a view with only 2 columns, ID and content and in content I concatenated all data from other columns in a format like this:
FirstName: John. LastName: Smith. Age: 70, Likes: Gardening, Painting. Dislikes: Soccer.

Then I created the following code to index all contents from the view into pinecone, and it works so far. However I noticed something strange.

There are over 2000 vectors and still not finished, the first iterations were really fast, but now each iteration is taking over 18 seconds to finish and it says it will take over 40 minutes to finish upserting.

What am I doing wrong? or is it normal?

 pinecone.init(
        api_key=PINECONE_API_KEY,  # find at app.pinecone.io
        environment=PINECONE_ENV  # next to api key in console
    )

    import streamlit as st
    st.title('Work in progress')
    embed = OpenAIEmbeddings(deployment=OPENAI_EMBEDDING_DEPLOYMENT_NAME, model=OPENAI_EMBEDDING_MODEL_NAME, chunk_size=1)
   
    cnxn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+DATABASE_SERVER+'.database.windows.net;DATABASE='+DATABASE_DB+';UID='+DATABASE_USERNAME+';PWD='+ DATABASE_PASSWORD)
    query = "SELECT * from views.vwprofiles2;"
    df = pd.read_sql(query, cnxn)
    index = pinecone.Index("default")
   
    batch_limit = 100

    texts = []
    metadatas = []

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=400,
        chunk_overlap=20,
        length_function=tiktoken_len,
        separators=["\n\n", "\n", " ", ""]
    )
    

    for _, record in stqdm(df.iterrows(), total=len(df)):
        # First get metadata fields for this record
        metadata = {
            'IdentityId': str(record['IdentityId'])
        }
        # Now we create chunks from the record text
        record_texts = text_splitter.split_text(record['content'])
        # Create individual metadata dicts for each chunk
        record_metadatas = [{
            "chunk": j, "text": text, **metadata
        } for j, text in enumerate(record_texts)]
        # Append these to the current batches
        texts.extend(record_texts)
        metadatas.extend(record_metadatas)
        # If we have reached the batch_limit, we can add texts
        if len(texts) >= batch_limit:
            ids = [str(uuid4()) for _ in range(len(texts))]
            embeds = embed.embed_documents(texts)
            index.upsert(vectors=zip(ids, embeds, metadatas))    
            texts = []
            metadatas = []

        if len(texts) > 0:
            ids = [str(uuid4()) for _ in range(len(texts))]
            embeds = embed.embed_documents(texts)
            index.upsert(vectors=zip(ids, embeds, metadatas))

Jasper · June 27, 2023, 8:34am

Hi @levalencia

I see you are using the OpenAI embeddings. Is it possible that the embedding generation is slowing you down with the OpenAi rate limit? Otherwise looking at the code quickly you are following the recommended best practices from the documentation.

Best

levalencia · June 27, 2023, 10:09am

Thanks for quick reply.

What I noticed is that I have 244 rows, but I got almost 8000 vectors?
The content field on my database can vary from 1500 characters up to 5000 characters.

When i checked the index contents, what I noticed that one vector does not have an entire row of information, its splitted into many vectors, and then the query results are really bad.

So I tried the following approach which embeds the entire content at once, but this takes like 20 minutes for each iteration

    texts = df['content'].tolist()
    for i in stqdm(range(len(texts))):
        rowcontent = texts[i]
        metadata = {
           'IdentityId': str(i)
        }
        #embeds = openai.Embedding.create(input = rowcontent, model=OPENAI_EMBEDDING_MODEL_NAME)['data'][0]['embedding']
        embeds = embed.embed_documents(rowcontent)
        index.upsert(vectors=zip(id, embeds, metadata))

Jasper · June 27, 2023, 12:05pm

Yeah you were not upserting the whole rows but chunks of them. You used to split the rows to 400 characters with some overlap so your 250 rows grew to the thousands of vectors.

Splitter you used:

text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=400,
        chunk_overlap=20,
        length_function=tiktoken_len,
        separators=["\n\n", "\n", " ", ""]
    )

The code you provided now looks better to me if you want to upsert the whole rows. 20min per iteration is… waaaay too much. I would suggest you take a look at what the openai embeddings are when they are returned. If they hold multiple embeddings or just one.

Maybe what you could also do is just create a dummy vector, metadata and id and upsert them without creating embeddings and see how long that takes. You can see which part takes you the longest and then continue solving the issue. (I usually put timers and stopwatches everywhere to se how long parts of the process take).

Hope you solve the issue!

dra · June 27, 2023, 2:15pm

Hi! @levalencia

I measured the time it takes to upsert in my environment. Here are the conditions and results.

Environment used:

Used pinecone.GRPCIndex’s upsert
The upsert was performed with a batch size of 200
Test run in my local environment (Japan)
According to Google’s speed test, download speed is 77Mbps and upload speed is 5Mbps

About the data to upsert:

Data file size (in CSV format) is about 369MB
The total number of tokens to upsert is 1,147,909, and the total number of vectors is 9,800

Results under the above conditions:

The time it took to complete the upsert was about 15 minutes. Please note that this time only accounts for the execution of index.upsert(vectors=batch, namespace=‘content’), and does not include the time to generate the vectors.

We hope these results are helpful.

Sean · July 6, 2023, 5:02pm

As people have mentioned it could be an OpenAI issue, why not try a local embedding model and see if that works? As others have noted, your dataset isn’t that long, so it shouldn’t take that long to upsert, especially as you are following best practices and batching the data.

Some snippets of code that might be helpful:

#Local Encoder
device = 'cuda' if torch.cuda.is_available() else 'cpu'
encoder = SentenceTransformer("all-MiniLM-L6-v2", device=device)

# Create the index if it doesn't exist.  Of note: use the encoder get_sentence_embedding_dimension() to create the vector dimension!
def create_search_index():
    if args.verbose: print(f"Ensuring search index {args.index} exists")

    if args.index not in pinecone.list_indexes():
        pinecone.create_index(
            name=args.index,
            dimension=encoder.get_sentence_embedding_dimension(),
            metric='cosine'
        )
    else:
        if args.verbose: print(f"Search index {args.index} already exists")


# Pretty much what you're doing - chunking.  just use the local encoder instead of OpenAI.
def create_sections(filename, page_map):
    for i, (section, pagenum) in enumerate(split_text(page_map)):
        yield (re.sub("[^0-9a-zA-Z_-]","_",f"{filename}-{i}"), 
            encoder.encode(section).tolist(),
            {
            "id": re.sub("[^0-9a-zA-Z_-]","_",f"{filename}-{i}"),
            "content": section,
            "category": args.category,
            "sourcepage": blob_name_from_file_page(filename, pagenum),
            "sourcefile": filename
            })

A vector database is fantastic for ranking similarities in data- a classic example is a large corpus of text, for instance, let’s say you have several wikipedia articles on wild animals. Encoding/decoding can allow the user to ask a question about a “monkey” and receive results for high cosine-simiarities like “ape”, “gorilla”, “chimpanzee”, etc.

While creating a blob of your data and putting it into Pinecone may be a novel exercise, if you already have your data in a normalized, parsed columnar format, you may find your search yields better results with simple SQL statements (select * from table where likes = ‘Gardening’). If you have a column called “Biography” where the person has entered paragraphs of textual data about themselves, that would be a strong candidate for a vector database.

levalencia · July 7, 2023, 3:09pm

Thanks a lot Sean for your reply, the last paragraph on your reply made a lot of sense to me.

However the customer wants to be able to chat with their database, in my case its a profile table with lots of columns, and only one (about me) it may be actual long text, the other fields will be short, Example:
Favorite Food, Favorite Music, Favorite Movie, etc.

In Langchain they have something called the SQLDatabaseToolkit which basically with OpenAI can translate english questions to SQL and the results are great, however this operation can take up to 15 seconds. So the chatbot its very slow.

FOr that reason, I wanted to index everything in Pinecone as Text and then query pinecone with langchain and openai to get answers in plain english too.

However I am not getting good results:

THey are fast, but not accurate.

So I wonder if instead of putting all text into one field maybe I should use each column as metadata attribute and only the long texts as vectors (eg. About me).

Sean · July 7, 2023, 4:41pm

Ha!

I have just the thing for you - I built it in Pinecone’s hackathon last week (thanks @rschwabco for putting it on!!). Check out GitHub - seanconnolly2000/pinecone-hackathon - look at the SQL_Schema.py file and the calls.py to see how I teach ChatGPT the table schema (you could do better!), and then get the sql joins from it using SQL Translate. I also use function_call so langchain is not necessary.

Enjoy!

levalencia · July 10, 2023, 6:47am

page not found, can you check?