Not getting accurate results - ImageBind Image Segmentation Search

Hello, new user here and really enjoying working with pinecone.

I am building an app using nextjs + vercel ai sdk and i am getting very inaccurate results with using cosine search after embedding images. Well let me rephrase that… I am getting great results with the image but when I am describing an image with text and search against pinecone I am getting very very bad results. Here is a code example

Stack: Replicate (ImageBind), Pinecone

Here is how I generate embeddings

const response = await replicate.run(
      "daanelson/imagebind:0383f62e173dc821ec52663ed22a076d9c970549c209666ac3db181618b7a304",
      {
        input: {
            input: imageUrl,
            modality: "vision",
         }
        },
        wait: {
            interval: 500,
        }
      }
    );

when I query it would look something like this

const response = await replicate.run(
      "daanelson/imagebind:0383f62e173dc821ec52663ed22a076d9c970549c209666ac3db181618b7a304",
      {
        input: {
            input: imageUrl,
            modality: "vision",
         }
        },
        wait: {
            interval: 500,
        }
      }
    );



      const queryResponse = await index.query({
        vector: response,
        topK: 5,
        includeMetadata: true,
      });

Or text

const response = await replicate.run(
      "daanelson/imagebind:0383f62e173dc821ec52663ed22a076d9c970549c209666ac3db181618b7a304",
      {
        input: {
            text_input: textInput,
            modality: "vision",
         }
        },
        wait: {
            interval: 500,
        }
      }
    );



      const queryResponse = await index.query({
        vector: response,
        topK: 5,
        includeMetadata: true,
      });

I played around and created other indexes like a dotproduct and was getting significantly better results but i don’t think this is best for what i am attempting to do.

Anything I can do from the pinecone side with nodejs to make the search better?

Another Edit: I was only testing with about 5-10 records… maybe that had something to do with it? Would inserting a lot more data help me get better results?

So just another update. Ended up adding about 300 records and still very inaccurate results. Here are two images I am attempting to compare.

And

And my percentage match is around 20% and the correct result (2nd image) is at the bottom

Hello rosman!

I am Jacky, senior developer advocate at Pinecone. I have played around and love using multimodal embedding models, so seeing someone else apply it is exciting to me. More surprisingly is that you got ImageBind to work; it was one of the first models I tried to get to work, but I could not. I see that you are able to get it running on replicate… I need to try that out.

In terms of your examples, I don’t fully know if this is an issue with Pinecone’s retrieval accuracy, or ImageBind’s way of embedding and representing your image data.

Could you try a base case that you know works, i.e. an example Meta has on their images or data, and compare with your implementation?

Also, like you already mentioned, make sure that you’re using the right similarity metric (cosine, dot product, euclidian) that Meta specifies for your Pinecone Index creation

Oh one more thing… You mentioned that your text description of what you wanted to search yielded poor results, while image-to-image search yielded good results. What was your text description?

From my experience with Google’s multimodal embedding model, image-to-image search tends to give very high percentage matches. While text description do tend to give lower ratings. I wonder if this is a side effect of such embedding models? Like you have to be extremely accurate in your description?

Check this out on one of their Github issues on which algorithm to use: Is OK to use cosine_similarity instead softmax for VISION x TEXT ? · Issue #72 · facebookresearch/ImageBind · GitHub

Just checking back in. I ended up switching to OpenAI Clip Text + Image Embeddings and was getting significantly better results but still not exactly where I want them to be.

The text vector search is SUPER good with pinecone but the image embedding search still isn’t the best. I have to make sure that my ai assistant is extracting some common metadata to give accurate results which is a little disheartening because I cannot always depend on the assistant to give me the correct data.

I am more than happy to personally share some information with you in detail without sharing too much detail in a public forum to maybe get more help? I am just super confused about the image embeddings still

Well before I had to be super accurate, now i can give it some very small details and the text vector search is ON POINT. I would say about 95% of the time it is correct. I am trying to run some benchmarks to get more details