Hello everyone,
Am trying to upsert and preprocess docs using haystack and pinecone.
from haystack.utils import fetch_archive_from_http
# This fetches some sample files to work with
doc_dir = "data/tutorial8"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/preprocessing_tutorial8.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
all_docs = convert_files_to_docs(dir_path=doc_dir)
preprocessor = PreProcessor(
clean_empty_lines=True,
clean_whitespace=True,
split_by="word",
split_length=100,
split_respect_sentence_boundary=True
)
docs_default = preprocessor.process(all_docs) #create a dictionary with the data in the 'content' key
document_store.write_documents(docs_default) #need a dictionary as arg
Then i got this error : ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({‘content-type’: ‘application/json’, ‘Content-Length’: ‘155’, ‘x-pinecone-request-latency-ms’: ‘136’, ‘date’: ‘Wed, 31 Jan 2024 13:33:43 GMT’, ‘x-envoy-upstream-service-time’: ‘32’, ‘server’: ‘envoy’, ‘Via’: ‘1.1 google’, ‘Alt-Svc’: ‘h3=“:443”; ma=2592000,h3-29=“:443”; ma=2592000’})
HTTP response body: {“code”:3,“message”:“Dense vectors must contain at least one non-zero value. Vector ID 1f6ca8a2bd6c9903813607120d8d48bc contains only zeros.”,“details”:}
But when i do this :
from pprint import pprint
pprint(docs_default[0])
its return : <Document: {‘content’: 'BERT: Pre-training of Deep Bidirectional Transformers for\nLanguage Understanding\nJacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova\nGoogle AI Language[n{jacobdevlin,mingweichang,kentonl,kristout}@google.com](mailto:n%7Bjacobdevlin,mingweichang,kentonl,kristout%7D@google.com)\nAbstract\nWe introduce a new language representa-\ntion model called BERT, which stands for\nBidirectional Encoder Representations from\nTransformers. Unlike recent language repre-\nsentation models (Peters et al., 2018a; Rad-\nford et al., 2018), BERT is designed to pre-\ntrain deep bidirectional representations from\nunlabeled text by jointly conditioning on both\nleft and right context in all layers. ', ‘content_type’: ‘text’, ‘score’: None, ‘meta’: {‘name’: ‘bert.pdf’, ‘_split_id’: 0}, ‘id_hash_keys’: [‘content’], ‘embedding’: None, ‘id’: ‘1f6ca8a2bd6c9903813607120d8d48bc’}>
So i really don’t get why this vector is containing only zero values.