Danish Gigaword Corpus

4 Last view: 2024-12-20

Danish Gigaword Corpus

View resource name in all available languages

Corpus Gigaword Danois

http://catalog.elra.info/product_info.php?products_id=1369

ID:

ELRA-W0318

The Danish Gigaword Project (DAGW) maintains a corpus for Danish with over a billion words. The general goals are to create a dataset that is:
1. representative;
2. accessible;
3. a suitable common starting point for Danish NLP models.
The present version 1.0 was collected from various websites. Domains are distributed as follows:
- Legal : 308.8 million words
- Social Media : 261.4 million words
- Subtitles : 130.1 million words
- Debates : 108.4 million words
- Conversations : 0.7 million words
- Web : 101.02 million words
- Encyclopedia : 55.6 million words
- Literature : 31.3 million words
- Manuals : 2.6 million words
- Books : 2.1 million words
- Religion : 600k words
- News: 40 million words
- Other :1.2 million words
Data is presented in plaintext, UTF8, one file per document. Accompanying metadata gives information about (among others) the author, the time or location of the document's creation, an API hook for re-retrieval of the document.

View resource description in all available languages

Le projet « Danish Gigaword » (DAGW) maintient un corpus pour le danois de de plus d’un milliard de mots. Le but général est de créer une base de données qui soit :
1. représentative,
2. accessible,
3. un point de départ commun et approprié pour les modèles TAL en danois.
La version 1.0 actuelle a été assemblée à partir de plusieurs sites web. Les domaines sont répartis comme suit :
- Juridique : 308,8 millions de mots
- Réseaux sociaux : 261,4 millions de mots
- Sous-titres: 130,1 millions de mots
- Débats : 108,4 millions de mots
- Conversations : 0,7 million de mots
- Web : 101,02 millions de mots
- Encyclopédie: 55,6 millions de mots
- Littérature : 31,3 millions de mots
- Manuels : 2,6 millions de mots
- Livres: 2,1 millions de mots
- Religion : 0,6 millions de mots
- Actualités : 40 millions de mots
- Autre : 1.2 million de mots
Les données sont présentées au format texte, UTF8, avec un fichier par document. Les métadonnées associées donnent des informations (entre autres) sur l’auteur, la période de temps ou la localisation de la création du document, un “hook” API pour la récupération du document.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Start date: 01/28/2022

Licence

ELRA END USER

Restrictions: Academic - Non Commercial Use

For Non Members of ELRA

Fee: 0.00

User Nature: Commercial

ELRA VAR

Restrictions: Commercial Use

For Members of ELRA

Fee: 0.00

User Nature: Commercial

ELRA END USER

Restrictions: Academic - Non Commercial Use

For Members of ELRA

Fee: 0.00

User Nature: Commercial

ELRA VAR

Restrictions: Commercial Use

For Members of ELRA

Fee: 0.00

User Nature: Academic

ELRA END USER

Restrictions: Academic - Non Commercial Use

For Members of ELRA

Fee: 0.00

User Nature: Academic

ELRA VAR

Restrictions: Commercial Use

For Non Members of ELRA

Fee: 0.00

User Nature: Commercial

ELRA VAR

Restrictions: Commercial Use

For Non Members of ELRA

Fee: 0.00

User Nature: Academic

ELRA END USER

Restrictions: Academic - Non Commercial Use

For Non Members of ELRA

Fee: 0.00

User Nature: Academic

Contact Person

Mapelli Valérie

text

Monolingual text corpusLanguages

Danish

Linguality

Linguality type: Monolingual

Size

no size available

Metadata

Created: 05/12/2005

Version

Version: 1.0

Last Updated: 01/28/2022

People who looked at this resource also viewed the following: