How can we create a dataset similar to vblagoje/wikipedia_snippets_streamedin in any other language? Where does this start_paragraph (int32) paragraph come from?
What should be the format of the gold standard passage retrieval dataset for evaluating DPR?
Hi, it should be possible to scrape the Wikipedia website in other languages using a package like Selenium.
I’m not sure, but I’d guess
start_paragraph refers to the paragraph within the original Wikipedia page from which the snippet of text contained in
passage_text came. Nonetheless, this isn’t used for training or evaluating a retriever model like DPR.
The format should contain pairs of questions and their relevant contexts. With that, you can use a ranking evaluator to evaluate performance. For training (if needed) you can use something like multiple negatives ranking loss.
I hope that helps!