How to best use Sentence Encoders

gercilasun · July 26, 2023, 2:04pm

How do others handle encoding their data with USE(Universal Sentence Encoder ) or SBERT? Say my dataset is a 4 sentences per item and 100 items, do you process 400 at a time, or combine the 4 into 1 long sentence and process 100 at a time, or process each sentence one at a time. I could not determine when you really want to opt for one over the other. Finally how does it scale to full blown documents say a pdf with 20 pages… I think there is also a limit on the size of the sentence…

Let me give a use case for each scenario:

I have a game with complex rules on each card of the game. The game has thousands of cards. I want to import this data set with additional metadata of each card into the DB for search. Do I process the encoding of the rules on each card once card at a time, several cards batched together, or all of them at once(which may not be possible since its thousands of cards)

I have 10,000 pdf’s i want to do search against them, Do I:

Use the encoder and send in a batch of each paragraph in each pdf?
Use the endoder and send the all sentences in the pdf in one batch. What happens with really large pdf’s.

I’m not quite sure how to best handle this.