MartinThoma/wili_2018
Viewer • Updated • 235k • 582 • 5
This paper describes the WiLI-2018 benchmark dataset for monolingual written natural language identification. WiLI-2018 is a publicly available, free of charge dataset of short text extracts from Wikipedia. It contains 1000 paragraphs of 235 languages, totaling in 23500 paragraphs. WiLI is a classification dataset: Given an unknown paragraph written in one dominant language, it has to be decided which language it is.
Get this paper in your agent:
hf papers read 1801.07779 curl -LsSf https://hf.co/cli/install.sh | bash No model linking this paper
No Space linking this paper
No Collection including this paper