I’m trying to split pdf documents into document chunks (using langchain) then convert them to OpenAI embeddings and store them in my Pinecone Index. I’m having trouble with the storing part. The PDFs are being split properly and when I log the embeddings locally they seem to be generating properly, but when I bring it all together and send it to Pinecone, my vectors appear empty in the index dashboard. The expected amount arrives and they are sorted into the proper namespaces, but when I query the vector store, they show up empty. I’ve tried deleting my index and creating a new one but no luck.
Also, if anyone knows how to change my index’s region, it’s less important but some tips would be greatly appreciated.
Thank you!
Here’s my code:
const filePath = "path/to/file";
// Read the PDF file
const pdfData = fs.readFileSync(filePath);
// Parse the PDF and extract the text content
const pdfDocument = await pdfParse(pdfData);
const text = pdfDocument.text;
// Split the text into 1000 character chunks
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
});
const chunks = await textSplitter.createDocuments([text]); // turns an array of strings into documents
console.log('creating vector store...');
const embeddings = new OpenAIEmbeddings();
const pinecone = await initPinecone();
const index = pinecone.Index(process.env.PINECONE_INDEX_NAME);
// // embed the PDF documents
await PineconeStore.fromDocuments(chunks, embeddings, {
pineconeIndex: index,
namespace: 'test-pdf5',
// textKey: 'pageContent',
});
But I’ve also tried the example code from the langchain docs and an example project I found on youtube and both of these have the same issue.
Here’s the example from the docs:
const client = new PineconeClient();
await client.init({
apiKey: process.env.PINECONE_API_KEY,
environment: process.env.PINECONE_ENVIRONMENT,
});
const pineconeIndex = client.Index(process.env.PINECONE_INDEX_NAME);
const docs = [
new Document({
metadata: { foo: "bar" },
pageContent: "pinecone is a vector db",
}),
new Document({
metadata: { foo: "bar" },
pageContent: "the quick brown fox jumped over the lazy dog",
}),
new Document({
metadata: { baz: "qux" },
pageContent: "lorem ipsum dolor sit amet",
}),
new Document({
metadata: { baz: "qux" },
pageContent: "pinecones are the woody fruiting body and of a pine tree",
}),
];
await PineconeStore.fromDocuments(docs, new OpenAIEmbeddings(), {
pineconeIndex,
namespace: "test",
});