NEMLAR Written Corpus
View resource name in all available languages
Corpus écrit NEMLAR
ID:
ELRA-W0042
This corpus was produced within the NEMLAR project (http://www.nemlar.org). Two other resources, produced within the same project, are also available: NEMLAR Broadcast News Speech Corpus (ELRA-S0219) and the NEMLAR Speech Synthesis Corpus (ELRA-S0220).
The NEMLAR Written Corpus consists of about 500,000 words of Arabic text from 13 different categories, aiming to achieve a well-balanced corpus that offers a representation of the variety in syntactic, semantic and pragmatic features of modern Arabic language. The different categories are:
• Political news: 48,000 words
• Political debate: 30,000 words
• Islamic text (Preaching and others): 29,000 words
• Phrases of common words: 8,500 words
• Text from broadcast news: 5,500 words
• Business: 20,000 words
• Arabic literature: 30,000 words
• General news: 100,000 words
• Interviews: 56,000 words
• Scientific press: 50,000 words
• Sports press: 50,000 words
• Dictionary entries explanation: 52,000 words
• Legal domain text: 21,000 words
The time span of the data included goes from late 1990’s to 2005.
The corpus is provided in 4 different versions:
• Raw text
• Fully vowelized text
• Text with Arabic lexical analysis
• Text with Arabic POS-tags
Diacritics, lexical analysis and POS-tags were generated by RDI’s tool Fassieh©. The accuracy of the automatic analysis is around 95%. To reach about the 99% accuracy rate as defined for this corpus, the linguists used the visual revision mode of Fassieh© where the linguist has to either approve the 1st most likely analysis (most of the time) or select another one manually (in the 4% minority of the cases).
The database is distributed on 1 ISO 9660 CD-ROM volume. It has been validated by an external partner and a validation report is provided.
View resource description in all available languages