Monolingual text corpus Languages
Linguality Linguality type: Monolingual
114,000,000 Tokens
2.6 Gb
Classification Text type: quasi-spoken
Register: formal
Annotation Semantic Annotation - Named Entities StandOff: True
Segmentation level: Word Group
Format: text/xml
Standard practices conformance: TEI
Annotation Mode: Automatic (named entities (person names, organizations, locations compatible with NKJP hierarchy) detected by Nerf)
Start date: 03/01/2011
End date: 11/26/2011
Segmentation StandOff: False
Segmentation level: Utterance
Format: text/xml
Standard practices conformance: TEI
Annotation Mode: Automatic (segmentation into utterances taken over from the transcripts; each utterance is marked with the speaker identifier (resolved in the transcript header))
Annotation Tools: scripts developed internally Start date: 03/01/2011
End date: 11/26/2011
Segmentation StandOff: True
Segmentation level: Sentence
Format: text/xml
Standard practices conformance: TEI
Annotation Mode: Automatic (individual utterances split into sentences)
Start date: 03/01/2011
End date: 11/26/2011
Other StandOff: False
Segmentation level: Paragraph
Format: text/xml
Standard practices conformance: TEI
Annotation Mode: Automatic (representation of the session structure (taken over from the transcripts) in <div> elements)
Annotation Tools: scripts developed internally Start date: 03/01/2011
End date: 11/26/2011
Morphosyntactic Annotation - B Pos Tagging Tagset: NKJP tagset
StandOff: True
Segmentation level: Word
Format: text/xml
Standard practices conformance: TEI
Annotation Mode: Automatic (MSD tag variants (all available morphosyntactic interpretations) output by Morfeusz, then disambiguated by Pantera tagger)
Start date: 03/01/2011
End date: 11/26/2011
Morphosyntactic Annotation - Pos Tagging Tagset: NKJP tagset
StandOff: True
Segmentation level: Word
Format: text/xml
Standard practices conformance: TEI
Annotation Mode: Automatic (POS tag (CTAG) variants (all available interpretations) output by Morfeusz, then disambiguated by Pantera tagger)
Start date: 03/01/2011
End date: 11/26/2011
Segmentation Tagset: NKJP tagset
StandOff: True
Segmentation level: Word
Format: text/xml
Standard practices conformance: TEI
Annotation Mode: Automatic (individual sentences split into tokens (word-like segments – see documentation of Morfeusz SGJP for details))
Start date: 03/01/2011
End date: 11/26/2011
Lemmatization StandOff: True
Segmentation level: Word
Format: text/xml
Standard practices conformance: TEI
Annotation Mode: Automatic (lemma variants (all available interpretations) output by Morfeusz, then disambiguated by Pantera tagger)
Start date: 03/01/2011
End date: 11/26/2011
Structural Annotation StandOff: True
Segmentation level: Word
Format: text/xml
Standard practices conformance: TEI
Annotation Mode: Automatic (syntactic words (word-like compounds) detected by Spejd with NKJP shallow parsing grammar; see NKJP documentation for details)
Start date: 03/01/2011
End date: 11/26/2011
Syntactic Annotation - Shallow Parsing StandOff: True
Segmentation level: Word Group
Format: text/xml
Standard practices conformance: TEI
Annotation Mode: Automatic (syntactic groups (phrase-like constructs) detected by Spejd with NKJP shallow parsing grammar; see NKJP documentation for details)
Start date: 03/01/2011
End date: 11/26/2011
Other StandOff: False
Segmentation level: Paragraph
Format: text/xml
Standard practices conformance: TEI
Annotation Mode: Automatic (transcripts of one session day represented as a set of files: header.xml – header information about the individual session days, text_structure.xml – session structure and individual utterances, ann_segmentation.xml.gz – compressed sentence-level and token-level segmentation, ann_morphosyntax.xml.gz – disambiguated morphosyntactic description (lemma, POS tag and MSD tag), ann_words.xml.gz – syntactic words, ann_groups.xml.gz – syntactic groups, ann_named.xml.gz – named entities)
Annotation Tools: scripts developed internally Start date: 03/01/2011
End date: 11/26/2011
Creation Creation mode: Automatic
Creation mode details: Texts from terms 1-4 coverted from HTML files, terms 5-6 converted from XML files delivered by Sejm. Audio and video sample from day 3 of sitting 89, term 6 added as example of multimodal content.
Original Sources Sprawozdanie Stenograficzne. Kancelaria Sejmu Rzeczypospolitej Polskiej, ul. Wiejska 4/6/8, 00-902, Warszawa, Poland. Wydawnictwo Sejmowe, 1991-2011. ISSN 08672768. Creation Tools Spejd, a shallow parser of Polish Pantera, a Brill tagger for Polish Morfeusz SGJP, a tokenizer, moprhological analyzer and lemmatizer for Polish Nerf, a named entity recognizer for Polish