Corpora

From MLML
Jump to: navigation, search


This page offers an overview of the MLML corpus holdings, organised by corpus source and by corpus language. In total, we have over 200 corpora providing data from 33 languages, in addition to offering data from several varieties for a number of those languages.

Contents

Corpora by Source

Corpora not from the Language Data Consortium

The corpora listed here are those obtained from sources other than the Language Data Consortium. In each case the link provided offers a quick synopsis of the corpus' contents and with links to the official website and, where applicable, additional information beyond that.

Corpora from the Language Data Consortium (LDC)

The code to the left of the corpus name is the one assigned by the LDC. The letter immediately following the date inside that code indicates the type of corpus: "L" identifies a lexicon, "S" indicates an audio corpus, "T" identifies a text corpus, and "V" indicates a video corpus. The corpora have been arranged by year and then by their LDC catalogue number. The links provided direct to the LDC page for the corpus.

2015 LDC Corpora

2014 LDC Corpora

2013 LDC Corpora

2012 LDC Corpora

2010 LDC Corpora

2009 LDC Corpora

2008 LDC Corpora

2007 LDC Corpora


2006 LDC Corpora

2005 LDC Corpora


2004 LDC Corpus

2003 LDC Corpora

2002 LDC Corpus

2001 LDC Corpus

2000 LDC Corpus

199 LDC Corpus

1998 LDC Corpora

1997 LDC Corpora

1996 LDC Corpora

1995 LDC Corpora

1994 LDC Corpora

1993 LDC Corpora

Corpora by Language

In this section corpora are listed according to language. Varieties and dialects are identified in sub-sections where appropriate, with LDC corpora being identified by the language variety labelled (or not) according to the corpus summaries. Non-LDC corpora are listed alphabetically first, followed by corpora in order of their LDC catalogue number. As before, each item links to its corpus page (for non-LDC corpora) or to the LDC corpus page (for LDC corpora), and the letter immediately following the date inside the LDC catalogue number indicates the type of corpus: "L" identifies a lexicon, "S" indicates an audio corpus, "T" identifies a text corpus, and "V" indicates a video corpus. If a corpus contains multiple languages, it has entries in the section for each language.

Albanian (Tosk)

Arabic

Egyptian Arabic

Gulf Arabic

Maghrebi Arabic

Multiple Dialects

North Levantine Arabic

Standard Arabic

South Levantine Arabic

Bulgarian

Central Kurdish (Sorani Dialect)

Chinese

Mandarin Chinese

Wu Chinese

Xiang Chinese

Croatian

Czech

Dari

English

French

German

Gullah Creole

Hausa

Hindi

Italian

Multiple Dialects

Japanese

Korean

Mandekan

Paharia

Mal Paharia

Sauria Paharia

Kumarbhag Paharia

Persian (Farsi)

Farsi / Iranian Persian

Polish

Portuguese

Russian

Sea Island Creole English

Spanish

Swedish

Tamil

Thai

Trinidadian Creole English

Turkish

Vietnamese

Yémba

Yoruba

Standard Yoruba

Lucumí