NEMLAR Broadcast News Speech Corpus
View resource name in all available languages
Corpus oral d’actualités radiophoniques NEMLAR
This corpus was produced within the NEMLAR project (http://www.nemlar.org). Two other resources, produced within the same project, are also available: NEMLAR Written Corpus (ELRA-W0042) and the NEMLAR Speech Synthesis Corpus (ELRA-S0220).
The Nemlar Broadcast News Speech Corpus consists of about 40 hours of Standard Arabic news broadcasts. The broadcasts were recorded from four different radio stations: Medi1, Radio Orient, RMC – Radio Monte Carlo, RTM – Radio Television Maroc.
Each broadcast contains between 25 and 30 minutes of news and interviews. The recordings were carried out at three different periods between 30 June 2002 and 18 July 2005. All files were recorded in linear PCM format, 16 kHz, 16 bit.
The software used for the transcription is Transcriber with the additional patch for Arabic. Thus the transcriptions were done in Arabic characters and the software automatically generated the transliterations. The following annotation levels are included:
• Orthographic transcription of speech (in news, not in music, commercials, etc.), including Named Entities
• Speakers and speaker turns
• Segment markers (portions of maximum 10 seconds)
• Topic/story boundaries
• Background noises (stationary and instantaneous noise events)
• Change of background
• Word boundaries
A lexicon of 62,000 words with transliterations, frequency and SAMPA for Arabic is also included.
The database is distributed in 1 ISO 9660 DVD-ROM volume. It has been validated by an external partner and a validation report is provided.
View resource description in all available languages