Sparse Vector generation using node package wink-nlp?

dchl · March 26, 2024, 10:55am

Dear Pinecone community,

I’m very new to this and I’m trying to use sparse vectors with nodejs but I’m having trouble generating them as there’s no documentation on how to do it in JS.

There’s no nodejs version of pinecone-text

but I found that wink-nlp can do bm25 vectorization:

BM25 Vectorizer : winkNLP - NLP in Node.js

How can I use it to generate a sparse vector’s indices and the values so I can upsert it to Pinecone?

Using this ecommerce-search example as a guide, I tried to do the following:

// Require wink-nlp, model and its helper.
const model = require('wink-eng-lite-web-model');
const nlp = require('wink-nlp')(model);
const its = nlp.its;

// Require the BM25 Vectorizer.
const BM25Vectorizer = require('wink-nlp/utilities/bm25-vectorizer');

// Instantiate a vectorizer with the default configuration — no input config
// parameter indicates use default.
const bm25 = BM25Vectorizer();

// Sample corpus to train - fit tf-idf values on my corpus
const corpus = [
  'Turtle Check Men Navy Blue Shirt',
  'Peter England Men Party Blue Jeans',
  'Titan Women Silver Watch',
  'Manchester United Men Solid Black Track Pants',
  'Puma Men Grey T-shirt',
];

// Train the vectorizer on each document, using its tokens.
// The tokens are extracted using the .out() api of wink NLP.
corpus.forEach((doc) => {
  const tokens = nlp.readDoc(doc).tokens();
  return bm25
    .learn(tokens.out(its.normal)) // Learns the BM25 token weights from the input document's tokens
});

const v = bm25.vectorOf(nlp.readDoc('Turtle Check Men Navy Blue Shirt').tokens().out(its.normal));
console.log(v);

I believe the last output is the sparse vector values but what about the indices?

Thank you!

dchl · March 29, 2024, 6:28am

media.tenor.com/T22ELGt4yEYAAAAM/bump.gif

tommytickles · June 16, 2024, 6:28pm

Another option you may consider is to write the sparse vector generator in python then call it via child processes from node.

# generate_sparse_vector.py
import sys
import json
from pinecone_text.sparse import BM25Encoder

def generate_sparse_vector(doc):
    bm25 = BM25Encoder()
    doc_tokens = doc.lower().split()
    bm25.fit(doc)
    bm25.encode_queries(doc_tokens)
    sparse_embeds = bm25.encode_documents(doc_tokens)
    return sparse_embeds

if __name__ == "__main__":
    doc = sys.argv[1]
    result = generate_sparse_vector(doc)
    print(json.dumps(result))

import { spawn } from "child_process";

function generateSparseVectors(doc: string): Promise<any> {
  return new Promise((resolve, reject) => {
    const py = spawn("python3", ["src/generate_sparse_vectors.py", doc]);

    let result = "";

    py.stdout.on("data", (data: Buffer) => {
      result += data.toString();
    });

    py.stderr.on("data", (data: Buffer) => {
      console.error(`Error from Python script: ${data.toString()}`);
    });

    py.on("close", (code: number) => {
      if (code !== 0) {
        reject(new Error(`Python script exited with code ${code}`));
      } else {
        try {
          const parsedResult = JSON.parse(result);
          resolve(parsedResult);
        } catch (error) {
          reject(new Error(`Error parsing JSON: ${error}`));
        }
      }
    });
  });
}

const main = async () => {
  const doc = "Turtle Check Men Navy Blue Shirt";
  const embeds = await generateSparseVectors(doc);
  console.log(embeds);
};

main();