Locality Sensitive Hashing (LSH): The Illustrated Guide

Locality sensitive hashing (LSH) is a widely popular technique used in approximate nearest neighbor (ANN) search. The solution to efficient similarity search is a profitable one — it is at the core of several billion (and even trillion) dollar companies.

Big names like Google, Netflix, Amazon, Spotify, Uber, and countless more rely on similarity search for many of their core functions.


This is a companion discussion topic for the original entry at https://www.pinecone.io/learn/locality-sensitive-hashing/

How can you get which values are candidate pairs from the bucket, like the bucket contains the signatures and the candidate pairs are in the form of signatures so i actually do not know which ID should i check

I’m not sure for the LSH index specifically, but for other indexes there is a get_ids method that you can call on the index that returns a “swig object” that you convert to numpy with faiss.vector_to_array. I can also see here that it is possible to retrieve the data from index.codes for LSH, but I believe this the vector codes themselves (not IDs)

The function ‘shingle’ takes a string and an integer. When calling the shingle function, ‘a = shingle(a,k)’, k is not defined. The resulting shingles '{‘e space st’, 'ew by the ', 'h flew by ‘, ‘he space s’,…’ don’t look anything like the example. If I set k=2 and try it again, the resulting shingles look more like the example. '{‘pa’, ‘fi’, 'h ‘, ‘ce’, ‘ta’, ‘yi’, ‘ly’, ‘th’,…’

Is k supposed to be 2?

There is a section that illustrates the creation of a new set ‘vocab’; however, there is no actual code for creating ‘vocab’

Is this all that is needed?

>>> vocab = set()
>>> vocab.update(a,b,c)

This

a_1hot

Is discussed, but never defined in code. It seems to be some transformation of ‘vocab’