The aim of the corpus is to compile a large amount of quality recordings of spontaneous Estonian and segment it phonetically on different levels. The project started in autumn 2006.
The total size of the corpus is approximately 60 hours of speech from 100 speakers with different dialectological and social background. Speakers are from different age groups. They are asked to participate with face-to-face invitation and they are aware of the purpose of the recordings.
Most of the recordings are made in a recording studio, some also on fieldwork. The signal of each speaker is recorded in a separate channel. The distance between the speakers is about 3 meters to minimize the effect of overlaps. For the field-work recordings head-set microphones are used. Recordings are saved in PCM wav-format and are not compressed. Background information about the recordings is collected in a text-file. Segmentation and annotation files are saved as Praat TextGrid files and get same filenames as recordings segmented.
Segmentation and annotation Segmentation and annotation is done with the Praat program (www.praat.org). Recordings are segmented manually on different levels (automatic segmentation program is also elaborated and tested). Following tiers are used: -Words (in orthographic spelling), -Phonemes (SAMPA adjusted for Estonian is used for transcription), -Syllables (short – long, open – closed), -Prosodic feet, -Intonation phrases or inter-pausal units; -Changes in voice quality (e.g. creaky voice);