Should I put all data and their different fields in one index or use name spaces?

DonBish · March 29, 2023, 7:00pm

I have a Movie dataset with the following fields:

id title genres original_language overview production_companies release_date budget revenue runtime vote_average vote_count credits keywords

I want to be able to semantic-search over this dataset using all the fields, for example a search term can be:

A movie that is in English, is about space, has the actor Matthew McConaughey and is more than two hours run time.

What is the best practice to index this data?

Should I upsert all the fields into one index or should I separate each field into their own name space and then use aggregation?

Thanks

rschwabco · October 17, 2023, 5:31pm

Hello @DonBish, and sorry for the very late reply.

In most cases, you would use both embeddings a metadata to perform the search. For example, it makes sense to use things like “title” as the field you’d embed and index, and the other categorical data as part of the metadata. You’d most likely use other means to build the query, and assign the categorical values properly. With that said, you can experiment with combining the categorical and semantic data together and see if you get an effective correspondence between your query embeddings and the results.