About creating /wikipedia_snippets_streamed dataset in other languages

Samreen · December 13, 2022, 3:06am

Hello all
How can we create a dataset similar to vblagoje/wikipedia_snippets_streamedin in any other language? Where does this start_paragraph (int32) paragraph come from?
What should be the format of the gold standard passage retrieval dataset for evaluating DPR?

jamesbriggs · December 16, 2022, 4:57pm

Hi, it should be possible to scrape the Wikipedia website in other languages using a package like Selenium.
I’m not sure, but I’d guess start_paragraph refers to the paragraph within the original Wikipedia page from which the snippet of text contained in passage_text came. Nonetheless, this isn’t used for training or evaluating a retriever model like DPR.
The format should contain pairs of questions and their relevant contexts. With that, you can use a ranking evaluator to evaluate performance. For training (if needed) you can use something like multiple negatives ranking loss.

I hope that helps!