Corpora
Corpus | Size | Time span | Language | Documents | Download |
---|---|---|---|---|---|
Round 1 | |||||
EU Press Corner | 7.2 Mbyte (compressed) | June 2020 | English French German Greek Italian Spanish Swedish Ukranian |
335 276 266 120 123 115 122 0 |
europresscorner-202006-xml.zip |
EUR-Lex | 23.3 Mbyte (compressed) | June 2020 | English French German Greek Italian Spanish Swedish Ukranian |
352 345 345 344 342 342 343 0 |
eurlex-202006-xml.zip |
Global Voices | 13.6 Mbyte (compressed) | June 2020 | English French German Greek Italian Spanish Swedish Ukranian |
571 446 51 328 539 595 5 66 |
global-voices-20200611-xml.zip |
MEDISYS | 2,036.0 Mbyte (compressed) | December 2019 to April 2020 | English French German Greek Italian Spanish Swedish Ukranian |
1,450,251 325,178 272,645 146,763 661,514 832,639 37,615 15,395 |
medisys-201912-xml_ir.zip medisys-202001-xml_ir.zip medisys-202002-xml_ir.zip medisys-202003-p1-7-xml_ir.zip medisys-202004-xml_ir.zip |
Wikipedia | 13.7 Mbyte (compressed) | June 2020 | English French German Greek Italian Spanish Swedish Ukranian |
731 357 364 103 271 342 111 121 |
wikipedia-20200611-xml.zip |
Total documents by language | English French German Greek Italian Spanish Swedish Ukranian |
1,452,240 326,599 273,761 147,658 662,789 833,763 38,196 15,582 |