"Le Monde Diplomatique" Arabic tagged corpus 
View resource name in all available languages
Corpus étiqueté du journal "Le Monde Diplomatique" en arabe
ID:
ELRA-W0049
This corpus contains 102,960 vowelised, lemmatised and tagged words (58 texts from Le Monde Diplomatique Arabic, see also ELRA-W0036-04).
To each text are associated 3 files :
- raw text in Arabic,
- vowelized text in Arabic,
- one XML file containing the morphological annotation of the text.
Each text word associates a certain number of information, such as word size, rank of the word in the text, paragraph number where the word was found, etc. Each word associates a node in the XML file. Each node contains the following positional features of the word in the text:
- Paragraph number in the text, i.e. paragraph where the word can be found,
- Sentence number in the paragraph,
- Sentence number in the text,
- Rank of the word in the text,
- Rank of the first character of the word in the text,
- Word size.
Information about word annotation are added as « sub-nodes »:
- Word of non vowelised text,
- Vowelised word,
- Word lemma,
- Grammatical category of the word.
View resource description in all available languages