N-grams from Hungarian National Corpus – META-SHARE

Last view: 2024-12-20

11 Last view: 2024-12-20

N-grams from Hungarian National Corpus

HNCNgrams

130

ID:

NGRAM-HNC

The national corpus of Hungarian language which is derived into five subcorpora by regional language variants, and into five subcorpora by text genres also. The subcorpus to be studied can be chosen by any combination of these. That makes the HNC an appropriate tool to study the differences not just between text genres but between language variants. HGC wishes to be a representative general-aim corpus of present-day standard Hungarian.
HGC is based on the Hungarian National Corpus with higher quality and ﬁner level of analysis and annotation (detailed morphosyntactic analysis and disambiguation with updated processing toolchain, NP chunking, Named Entity recognition, distributional analysis, built in post-processing (multilevel frequency lists, subsequent searches on previous results)). HGC is extended up to 1 gigaword treshold with extended metadata and cleared IPR.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Unrestricted Use

Licence

MS - NC - No ReD

Restrictions: Academic - Non Commercial Use

Distribution Access/Medium: Downloadable

Licensors:

Distribution rights holders:

Research Institute for Linguistics, Hungarian Academy of Sciences

Contact Person

textngram

Monolingual textngram corpusLanguages

Hungarian

Linguality

Linguality type: Monolingual

Size

200,548,340 wordform 5 - Grams

162,616,966 lemma 4 - Grams

177,712,578 wordform 4 - Grams

194,983,488 lemma 5 - Grams

124,314,331 wordform Trigrams

50,730,372 wordform Bigrams

99,457,540 lemma Trigrams

33,014,184 lemma Bigrams

NGram

Order: 5

Base item: Word

Metadata

Created: 01/25/2013

Last Updated: 02/18/2013

Metadata Creator

People who looked at this resource also viewed the following: