BM25 Sparse Encoding in Javascript

kennyliaobiz · August 18, 2024, 8:57pm

I love pinecone! I’m trying to implement hybrid search for my application but it’s a full-stack sveltekit app so I’d like to run the hybrid search function in javascript.

The python text client is amazing and includes a bm25 encoder so that you can easily create sparse vectors. However, I don’t see an equivalent way to do this with the pinecone client in javascript.

Does anyone have experience or know how to create sparse vectors in javascript to use for pinecone hybrid search? Thank you!

ZacharyProser · August 19, 2024, 2:47pm

Hi @kennyliaobiz, and thank you for your question!

It looks like the wink-nlp library has a convenience method for encoding with BM25:

// Sample corpus.
const corpus = ['Bach', 'J Bach', 'Johann S Bach', 'Johann Sebastian Bach'];
// Train the vectorizer on each document, using its tokens. The tokens are
// extracted using the .out() api of wink NLP.
corpus.forEach((doc) =>  bm25.learn(nlp.readDoc(doc).tokens().out(its.normal)));

// Returns the vector of the new document, "Johann Bach symphony", which is
// first tokenized using winkNLP.
bm25.vectorOf(nlp.readDoc('Johann Bach symphony').tokens().out(its.normal));
// -> [0.092717254, 0, 0.609969519, 0, 0]

Hope that helps!

Best,
Zack

edanielwu · August 24, 2024, 12:29am

Hi, i just just this method to generate sparse vectors,but the out()method return the string that the bm25.vectorOf can’t handle,because the param of bm25.vectorOf is Tokens,and then , i removed the out() method, there is an error thrown: wink-nlp: this operation doesn't make sense without any learning; use learn() API first.

kennyliaobiz · August 24, 2024, 12:46am

@edanielwu yeah I was having trouble with the winkNLP library too. I ended up going a different route for my application since I want to leverage pinecone’s python text client. I think it is more active and has more features than the javascript implementation.

When I want to generate a lot of sparse vectors, I do it on my machine with the bm25 encoder from the pinecone python client. Then, for generating sparse vectors from user searches in my app, I created a very simple fastapi endpoint hosted on vercel to use the same pinecone python client to return a sparse vector. Pretty simple and works well.

edanielwu · August 24, 2024, 1:43am

Thanks. Looks like I’ll have to solve this in python