Failed to upload PDF file in Pinecone Assistant

Hello community, I’m new here. I want to ask about uploading my PDF file. Why does it keep failing? My PDF has been compressed to 3 MB. Can anyone help? Thank you.

Hey @nerstuff24 I can try to hlep! But I should first clarify that Pinecone doesn’t directly handle PDF file uploads - it’s a vector database that stores and searches vector embeddings

→ To use PDF content with Pinecone, you need to:

  1. Extract and chunk your PDF content into smaller text segments

  2. Convert the text to vector embeddings using an embedding model

  3. Upsert the vectors into your Pinecone index

→ Upsert Limits to Consider

When working with data in Pinecone, there are specific limits for upsert operations:

  • Max batch size: 2 MB or 1000 records with vectors, 96 records with text

  • Max metadata size per record: 40 KB

  • Max length for a record ID: 512 characters

Example Data Structure

Here’s how you would structure your PDF content for Pinecone (link):

python

{

"_id": "document1#chunk1",

"chunk_text": "First chunk of the document content...", // Text to convert to a vector.

"document_id": "document1", // This and subsequent fields stored as metadata.

"document_title": "Introduction to Vector Databases",

"chunk_number": 1,

"document_url": "https://example.com/docs/document1",

"created_at": "2024-01-15",

"document_type": "tutorial"

}

For PDF processing with Pinecone, you’ll need to:

  1. Use a PDF parsing library (like PyPDF2, pdfplumber, etc.) to extract text

  2. Chunk the text into manageable segments

  3. Convert chunks to embeddings

  4. Upsert to Pinecone in batches under the 2MB limit

Could you provide more details about what specific error you’re encountering and what tool or method you’re using to upload your PDF?