Unable to upload amazon public pinecone dataset

Hi, this is the code I use to try to upload the public dataset “amazon-toys-quora-all-minilm-l6-bm25”:


I get this error:

	"name": "PineconeApiException",
	"message": "(400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Date': 'Wed, 15 May 2024 19:16:52 GMT', 'Content-Type': 'application/json', 'Content-Length': '92', 'Connection': 'keep-alive', 'x-pinecone-request-latency-ms': '341', 'x-pinecone-request-id': '4313797377394377488', 'x-envoy-upstream-service-time': '6', 'server': 'envoy'})
HTTP response body: {\"code\":3,\"message\":\"Sparse vector size 2211 exceeds the maximum size of 1000\",\"details\":[]}
PineconeApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Date': 'Wed, 15 May 2024 19:16:52 GMT', 'Content-Type': 'application/json', 'Content-Length': '92', 'Connection': 'keep-alive', 'x-pinecone-request-latency-ms': '341', 'x-pinecone-request-id': '4313797377394377488', 'x-envoy-upstream-service-time': '6', 'server': 'envoy'})
HTTP response body: {\"code\":3,\"message\":\"Sparse vector size 2211 exceeds the maximum size of 1000\",\"details\":[]}

Is this limit fixed? Is it documented somewhere? How can I work around it?


Hi @dhruv.anand, and thanks for your question!

I’m trying to get oriented on the code you shared - is there a specific library you’re trying to use or are you following specific docs or a tutorial?

Anything you can share about the libraries you’re trying to use would help me better debug what might be going wrong.


Hi @ZacharyProser, sorry I forgot to add the initial lines of code I used:

import pinecone_datasets

amz_ds = pinecone_datasets.load_dataset("amazon_toys_quora_all-MiniLM-L6-bm25")

I’m following the directions on this page: Use public Pinecone datasets - Pinecone Docs.

I tried upload the dataset in the other way (using iter_documents and upsert), but got the same error.

Hope this clarifies things.


@ZacharyProser could you have a look at this?

Hi @dhruv.anand,

Based on:

HTTP response body: {\"code\":3,\"message\":\"Sparse vector size 2211 exceeds the maximum size of 1000\",\"details\":[]}

I’m wondering if passing a batch_size as noted here changes anything?

Also, in the future if you can link to a public Google Colab / Jupyter Notebook in GitHub that would greatly assist us in debugging - otherwise we have to assemble bits of code from across your responses that may or may not still be relevant.

