Polish Sejm Corpus

7 Last view: 2020-10-16

PSC

http://clip.ipipan.waw.pl/PSC

ID:

401 The Polish Sejm Corpus contains annotated utterances of Polish Sejm members from terms of office 1-6 (years 1991-2011). Corpus files contain information about text segmentation (paragraphs, sentences, tokens), disambiguated morphosyntactic description (lemma, POS tag, MSD tag), syntactic description (syntactic words and groups) and named entities (person names, locations, organization).

The data is a valuable source of linguistic information, being a large (100 M segments) collection of quasi-spoken content and making the basis of the audio/video recording of sessions, started in 2011 and planned to be consecutively appended to the corpus.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Unrestricted Use

Licence

BSD - Style

Restrictions: Attribution

Fee: free of charge

Download location: hidden

Distribution Access/Medium: Downloadable

text

Monolingual text corpusLanguages

Polish

Linguality

Linguality type: Monolingual

Size

114,000,000 Tokens

2.6 Gb

Classification

Text type: quasi-spoken

Register: formal

AnnotationSemantic Annotation - Named Entities

StandOff: True

Segmentation level: Word Group

Format: text/xml

Standard practices conformance: TEI

Annotation Mode: Automatic (named entities (person names, organizations, locations compatible with NKJP hierarchy) detected by Nerf)

Annotation Tools:

Nerf

Start date: 03/01/2011

End date: 11/26/2011

Segmentation

StandOff: False

Segmentation level: Utterance

Format: text/xml

Standard practices conformance: TEI

Annotation Mode: Automatic (segmentation into utterances taken over from the transcripts; each utterance is marked with the speaker identifier (resolved in the transcript header))

Annotation Tools:

scripts developed internally

Start date: 03/01/2011

End date: 11/26/2011

Segmentation

StandOff: True

Segmentation level: Sentence

Format: text/xml

Standard practices conformance: TEI

Annotation Mode: Automatic (individual utterances split into sentences)

Annotation Tools:

Pantera

Start date: 03/01/2011

End date: 11/26/2011

Other

StandOff: False

Segmentation level: Paragraph

Format: text/xml

Standard practices conformance: TEI

Annotation Mode: Automatic (representation of the session structure (taken over from the transcripts) in <div> elements)

Annotation Tools:

scripts developed internally

Start date: 03/01/2011

End date: 11/26/2011

Morphosyntactic Annotation - B Pos Tagging

Tagset: NKJP tagset

StandOff: True

Segmentation level: Word

Format: text/xml

Standard practices conformance: TEI

Annotation Mode: Automatic (MSD tag variants (all available morphosyntactic interpretations) output by Morfeusz, then disambiguated by Pantera tagger)

Annotation Tools:

Morfeusz SGJP

Start date: 03/01/2011

End date: 11/26/2011

Morphosyntactic Annotation - Pos Tagging

Tagset: NKJP tagset

StandOff: True

Segmentation level: Word

Format: text/xml

Standard practices conformance: TEI

Annotation Mode: Automatic (POS tag (CTAG) variants (all available interpretations) output by Morfeusz, then disambiguated by Pantera tagger)

Annotation Tools:

Pantera

Start date: 03/01/2011

End date: 11/26/2011

Segmentation

Tagset: NKJP tagset

StandOff: True

Segmentation level: Word

Format: text/xml

Standard practices conformance: TEI

Annotation Mode: Automatic (individual sentences split into tokens (word-like segments – see documentation of Morfeusz SGJP for details))

Annotation Tools:

Morfeusz SGJP

Start date: 03/01/2011

End date: 11/26/2011

Lemmatization

StandOff: True

Segmentation level: Word

Format: text/xml

Standard practices conformance: TEI

Annotation Mode: Automatic (lemma variants (all available interpretations) output by Morfeusz, then disambiguated by Pantera tagger)

Annotation Tools:

Morfeusz SGJP

Start date: 03/01/2011

End date: 11/26/2011

Structural Annotation

StandOff: True

Segmentation level: Word

Format: text/xml

Standard practices conformance: TEI

Annotation Mode: Automatic (syntactic words (word-like compounds) detected by Spejd with NKJP shallow parsing grammar; see NKJP documentation for details)

Annotation Tools:

Spejd

Start date: 03/01/2011

End date: 11/26/2011

Syntactic Annotation - Shallow Parsing

StandOff: True

Segmentation level: Word Group

Format: text/xml

Standard practices conformance: TEI

Annotation Mode: Automatic (syntactic groups (phrase-like constructs) detected by Spejd with NKJP shallow parsing grammar; see NKJP documentation for details)

Annotation Tools:

Spejd

Start date: 03/01/2011

End date: 11/26/2011

Other

StandOff: False

Segmentation level: Paragraph

Format: text/xml

Standard practices conformance: TEI

Annotation Mode: Automatic (transcripts of one session day represented as a set of files: header.xml – header information about the individual session days, text_structure.xml – session structure and individual utterances, ann_segmentation.xml.gz – compressed sentence-level and token-level segmentation, ann_morphosyntax.xml.gz – disambiguated morphosyntactic description (lemma, POS tag and MSD tag), ann_words.xml.gz – syntactic words, ann_groups.xml.gz – syntactic groups, ann_named.xml.gz – named entities)

Annotation Tools:

scripts developed internally

Start date: 03/01/2011

End date: 11/26/2011

Creation

Creation mode: Automatic

Creation mode details: Texts from terms 1-4 coverted from HTML files, terms 5-6 converted from XML files delivered by Sejm. Audio and video sample from day 3 of sitting 89, term 6 added as example of multimodal content.

Original Sources

Sprawozdanie Stenograficzne. Kancelaria Sejmu Rzeczypospolitej Polskiej, ul. Wiejska 4/6/8, 00-902, Warszawa, Poland. Wydawnictwo Sejmowe, 1991-2011. ISSN 08672768. http://www.sejm.gov.pl

Creation Tools

Spejd, a shallow parser of Polish
Pantera, a Brill tagger for Polish
Morfeusz SGJP, a tokenizer, moprhological analyzer and lemmatizer for Polish
Nerf, a named entity recognizer for Polish

Resource Creation

Funding Project

Central and South-East European Resources (CESAR)

URL: http://www.cesar-pro...

Funding Types: Eu Funds, National Funds, Own Funds

Funders: European Commission (50%), Polish Ministry of Science and Higher Education (40%), Institute of Computer Science, Polish Academy of Sciences (10%)

Funding Country: Poland

Project duration: 02/01/2011 - 01/31/2013

Metadata

Created: 10/17/2011

Last Updated: 11/26/2011

Source: CESAR

Version

Version: 1.0

Last Updated: 11/12/2011

People who looked at this resource also viewed the following:

Resources from the same project