Corpora and Corpus-based Computational Linguistics:

Manuel Barbera's Web Resources Reference Guide.

Welcome to Manuel Barbera's Reference Guide to Corpora and Corpus-based Computational Linguistics Resources on the Web (more shortly CLR Guide).

There are about 1600 exhaustively annotated files. Descriptions as a rule are taken directly from the sites they refer to and are only slightly adapted or translated into English. Usually I put a particular stress on the kind of availability (conditions, costs etc.) of the resources.

Besides providing istitutional and general references, these pages aim principally to gather information on specific languages, especially the "exotic" and lesser known ones. At the present time I have not yet systematically combed the data on the "major" European languages, and I have surely missed a lot of resources for "minor" languages as well (it is often hard to find them out). Now, however, 111 languages are already represented, and it's only a start.

References are given systematically only to text corpora, and not to speech (i.e. audio recordings) and dictionary (i.e. lexical database) corpora, which, as automatic translation pages as well, are reported only occasionally, if of particular interest. In the future, however, I hope to make a speech corpora section as well.

A special section was devised for multilingual and parallel corpora, by virtue of their increasing importance for translation and language teaching.

For dictionary resources, however, you can easily refer to Robert Beard's Web of Dictionary Online, now changed in; for "words and expressions you most likely won't find in a normal dictionary" cf. instead Hans-Christian Holm's TAD (The Alternative Dictionaries).

The section on References & Educational focuses mainly on (1) institutions, journals and personal pages; (2) standards and formats; (3) courses, lessons and papers. Other useful but less classified references are given as well.

The section on Tools seems large, but really is almost an outline: it doesn't strive for completeness, but I hope it can anyhow be useful.

As to e-Texts, I here consider only the main collection and link pages, especially those Electronic Libraries that have some interest for Corpus Linguistics (1) as a source of digitalized texts or (2) as querable quasi-corpora. For a more codicological, librarian and antiquarian approach you could refer to the rich reference page Libro Antico (by Angela Nuovo, Aldo Coletto and Graziano Ruffilli).

In the Localized Resources files I provide only links to language-specific corpora, e-texts and NLP resources in general. For other kinds of linguistic informations you can refer to the Linguist List, the Ethnologue, the SIL (formerly Summer Institute), the Yamada Language Center, Jennifer's Language Page, and the rich archives of Languages-on-the-web, proud of 30,000 language links. The European Minority (or Minoritized) Language site is also a more specific but useful starting page; for Native America the American Indian Language Link Page, the Native Languages Page and Brave Arrow's Native Links are good starting points; for Creole languages you can instead look at the Creolist Archives.

This is the third major release of my CLR Guide, but the site is still in progress (and it will ever remain so, I hope). The links in red are already checked; those in blue are taken from other links collections and aren't yet checked. To keep uniformely updated a site like this is nearly impossible: since last Spring I began to date every addition, revision or simply checking I made (and any file without date goes implicitely back to Summer 2000). So the reader at least knows how "fresh" the file he's consulting is.

I apologize for incidental mistakes or inaccuracies: my Guide is still almost in its infancy (even if earlier versions of these files have already been online at the Trieste University SSLMIT and at the Stuttgart EURALEX sites; this third release of my CLR Guide at was helped by a MURST grant), and anyhow such a work is by its very nature imperfect and in need of improvements. Definitely a Web reviewer's job is never done. Please e-mail me any corrections or suggestion; additions, especially for lesser known languages, are welcomed.

And, last but not least, many thanks to all the people that have already linked to my pages: Artifara (cf. home), Beom-mo Kang (cf. home), Biblioteca del Dipartimento di Scienze filologiche e letterarie of Turin University (cf. the Library main page), Bruce from UCLA, CELT, Cesáreo Calvo Rigual (cf. home), David Lee (cf. home), Elio Jucci (alternative page), EURALEX, Heok-Seung Kwon (cf. home), Hongyin Tao (cf. home), Menno van Zaanen (cf. home), Montserrat Civit Torruella (cf. home), Onofrio Carruba (cf. home), Pilar Sánchez-Gijón (cf. home), Raffaele Cocchi (cf. home), Roger Levy at Stanford, Ryan Stansifer (cf. home), Star Thrower Publishing, Tanja Gaustad (cf. home), UCREL, Virginia Tech University Libraries (Slavic, East European, and former USSR pages; cf. home), Yonsei University (cf. home).


General Resources.

(1) Corpora and Corpus Linguistics.
(2) Multilingual and Parallel Corpora.
(3) Electronic Literary Text Archives.
(4) References, Standards & Educational Resources.
(5) Tools.

Localized Resources.

(A-D) Afrikaans - Albanian - Albanian (Caucasic) - Arabic - Armenian - Australian lgs. - Awabakal (Yuin-Kuric) - Azerbaijani - Barbadian (Creole English) - Basque - Bengali - Berbice (Creole Dutch) - Bulgarian - Catalan - Chinese (incl. Cantonese) - Chiricahua (Apache) - Commonwealth Antillean Creole French - Commonwealth Winward Islands Creole English - Czech - Danish - Dutch
(E) English (Modern) - English (Old & Middle) - Esperanto - Estonian
(F-I) Farsi - Finnish - French - French Antillean Creole French - Frisian - Gaelic - Georgian - German - Gothic - Greek (Classic and Modern) - Gujarati - Gulf of Guinea Creole Portuguese - Guyana Creole English - Guyanais (Creole French) - Haitian (Creole French) - Hebrew - Hindi - Hungarian - Icelandic (incl. Old Norse) - Indoeuropean - Indonesian - Irish (incl. Ogamic, Old & Middle Irish) - Italian
(J-R) Jamaican Creole English - Japanese - Karelian - Korean - Krio (Sierra Leone Creole English) - Kru (Liberian Pidgin English) - Latin - Latvian - Leeward Islands Creole English - Lithuanian - Livonian - Louisiana Creole French - Macaísta (Macau Creole Portuguese) - Malay - Maltese - Mambila - Manx - Mari (Eastern Meadow) - Mauritian Creole (Isle de France CF) - Mescalero (Apache) - Miskito Creole English - Mitchif (French-Cree mixed language) - Nahuatl - Neapolitan - Negerhollands (Creole Dutch) - Norwegian - Occitan - Palenquero (Creole Spanish) - Panjabi - Polish - Portuguese (incl. Brazilian & Galego-Portuguese) - Romanian - Russian
(S-Z) Sardinian - Saxon (Old) - Scots - Serbo-Croate - Singhalese - Slavonian (Old Church Slavonian) - Slovak - Slovene - Spanish - Sumerian - Swahili - Swedish - Tagalog - Taino - Tamil - Tetun (East Timorese) - Thai - Tibetan - Tocharian (A & B) - Tok Pisin (Creole English) - Turkish - Ukrainian - Upper Guinea Creole Portuguese - Urdu - Uzbek - Veps - Vietnamese - Virgin Islands Creole English - Welsh - West African Pidgin English.


Manuel Barbera, 2000 August 28 - Last revisions 2004 March 25.

!! New layout 2001 August 30 !!

HTML Version by Eva Cappellini & Manuel Barbera.
Best viewed with Internet Explorer.
Creative Commons License
This work is released under a Creative Commons Licence.