Localized Resources .2.

(A-D) Afrikaans - Albanian - Albanian (Caucasic) - Arabic - Armenian - Australian lgs. - Awabakal (Yuin-Kuric) - Azerbaijani - Barbadian (Creole English) - Basque - Bengali - Berbice (Creole Dutch) - Bulgarian - Catalan - Chinese (incl. Cantonese) - Chiricahua (Apache) - Commonwealth Antillean Creole French - Commonwealth Winward Islands Creole English - Czech - Danish - Dutch
(E) English (Modern) - English (Old & Middle) - Esperanto - Estonian
(F-I) Farsi - Finnish - French - French Antillean Creole French - Frisian - Gaelic - Georgian - German - Gothic - Greek (Classic and Modern) - Gujarati - Gulf of Guinea Creole Portuguese - Guyana Creole English - Guyanais (Creole French) - Haitian (Creole French) - Hebrew - Hindi - Hungarian - Icelandic (incl. Old Norse) - Indoeuropean - Indonesian - Irish (incl. Ogamic, Old & Middle Irish) - Italian
(J-R) Jamaican Creole English - Japanese - Karelian - Korean - Krio (Sierra Leone Creole English) - Kru (Liberian Pidgin English) - Latin - Latvian - Leeward Islands Creole English - Lithuanian - Livonian - Louisiana Creole French - Macaísta (Macau Creole Portuguese) - Malay - Maltese - Mambila - Manx - Mari (Eastern Meadow) - Mauritian Creole (Isle de France CF) - Mescalero (Apache) - Miskito Creole English - Mitchif (French-Cree mixed language) - Nahuatl - Neapolitan - Negerhollands (Creole Dutch) - Norwegian - Occitan - Palenquero (Creole Spanish) - Panjabi - Polish - Portuguese (incl. Brazilian & Galego-Portuguese) - Romanian - Russian
(S-Z) Sardinian - Saxon (Old) - Scots - Serbo-Croate - Singhalese - Slavonian (Old Church Slavonian) - Slovak - Slovene - Spanish - Sumerian - Swahili - Swedish - Tagalog - Taino - Tamil - Tetun (East Timorese) - Thai - Tibetan - Tocharian (A & B) - Tok Pisin (Creole English) - Turkish - Ukrainian - Upper Guinea Creole Portuguese - Urdu - Uzbek - Veps - Vietnamese - Virgin Islands Creole English - Welsh - West African Pidgin English.

I provide here language-specific links to corpora, e-texts and NLP resources in general. Resources already presented in the previous sections are also repeated here whenever relevant.

(E)

English (Modern).

ACCENZ (Computerised Corpus of English in New Zealand): see WSC (Wellington Spoken Corpus).
ACE (Australian Corpus of English)

ACE was the first systematically compiled heterogeneous (not tagged) corpus in Australia, designed to support a variety of linguistic researches. Interest in the differentiation between Australian, British and American English meant that a corpus modelled on the Brown and LOB corpora would provide ready comparisons. It would also serve as a strategic sample of current Australian English, and as a reference corpus for comparisons with more specialised, homogeneous corpora in Australia. ACE matches the Brown and LOB corpora in most aspects of its structure and constituency, so that direct interdialectal comparisons can be made on a comparable range of printed genres. Yet the desire to create an up-to-date corpus of Australian English prompted the decision not to match Brown and LOB chronologically, ie. with data drawn from publications of the early 1960s. Instead, ACE consists of material from 1986.
A version on CD-ROM is available from ICAME: you can also download a small sample, and a manual as well.

ACL/DCI Corpus of American English

(Association for Computational Linguistics / Data Collection Initiative Corpus):
http://morph.ldc.upenn.edu/Catalog/readme_files/acldci.readme.html
The ACL Data Collection Initiative was founded "to oversee the acquisition and preparation of a large text corpus to be made available for scientific research at cost and without royalties". Towards this goal, the ACL/DCI has acquired several hundred million words of text, has modified much of it so as to make it more accessible for research purposes, and has distributed tapes containing portions of this data to more than 40 research sites. The many formats in which the originals of these texts came have all, to one extent or another, been mapped into a markup language consistent with the SGML standard (ISO 8879). SGML provides a labelled bracketing of the text, with labels permitted to have associated feature-value pairs. The ACL/DCI corpus of American English is available in different distibutions. Contact: Linguistic Data Consortium, 441 Williams Hall, University of Pennsylvania, Philadelphia, PA 19104-6305; Phone +898-0464; FAX: (+1 215) 573-2175x.
+ The LDC CD-ROMs version contains texts from: Wall Street Journal, copyright 1987, 1988, 1989, provided by Dow Jones, Inc.; the Collins English Dictionary, Copyright 1979, William Collins Sons Co., Ltd.; scientific abstracts provided by the U.S. Department of Energy and a variety of grammatically tagged and parsed materials from the Treebank project at the University of Pennsylvania, copyright 1990,1991, University of Pennsylvania. The total amount of uncompressed text is 620 Mbytes. The many formats in which the originals of these texts came have all, to one extent or another, been mapped into a markup language consistent with the SGML standard (ISO 8879).
Available by the LDC through membership or 100$ price.
+ For a CD-ROM corpus contains a complete, Treebank-style parsing of the three-year WSJ archive from the ACL/DCI corpus (about 30 million words of text) see BLLIP 1987-89 WSJ Corpus (Release 1), available from the LDC.

Alex Catalogue of Electronic Texts: http://www.lib.ncsu.edu/stacks/alex-index.html

The Alex Catalogue of Electronic texts is a collection of digital documents (freely querable online and downloadable) collected in the subject areas of English literature, American literature, and Western philosophy. The Catalogue isn't only an Archive of downloadable texts: you can also search the content of located texts and make some query online. For example, you can search for Mark Twain's The Adventures Of Huckleberry Finn. Simple. You can then search the content of The Adventures for the words like fish and belly to get a description of Huck Finn's father. Moreover, you can search the content of multiple documents simultaneously. For example, you can first locate all the documents in the collection authored by Mark Twain. Next, you can search selected documents for something like slav* (which includes slave, slaves, slavery, etc.) to draw out themes across texts. For more information see this site in the E-Texts section.

AMALGAM Corpora: http://www.scs.leeds.ac.uk/ccalas/amalgam/amalghome.htm

The AMALGAM (Automatic Mapping Among Lexico-Grammatical Annotation Models) project is an attempt to create a set of mapping algorithms to map between the main tagsets and phrase structure grammar schemes used in various research corpora. They are developing a Multi-tagged Corpus and Multi-Treebank, i.e. a single text-set annotated with all the above tagging and parsing schemes. Useful demos are already online:
+ AMALGAM Multi-tagged Corpus (180 Eng. sentences).
+ AMALGAM Multi-Treebank (60 Eng. sentences).
For more information see this site in the Reference, Standards etc. section.

American News Stories Corpus: http://ota.ahds.ac.uk/ (search Catalog).

A 1608 KB corpus made of stories from the Associated Press news network, December 1979. Compiled by Glenn Akers. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

American Verse Project: http://www.hti.umich.edu/english/amverse/

This project of tthe University of Michigan Humanities Text Initiative (HTI) is assembling an electronic archive of volumes of American poetry prior to 1920. All texts are freely readable and downloadable either in HML or SGML formats. Simple, boolean or co-occurence searches can be submitted throughout the entire American Verse Project collection; there is also an interface for searching only personally selected works in the collection.

Anaphoric Treebank: http://www.comp.lancs.ac.uk/computing/research/ucrel/corpora.html#ana

A subsample of the AP corpus, annotated to show the reference of pronouns and lexical cohesion. Approximately 100,000 words.

ANC (American National Corpus): http://americannationalcorpus.org/

The ANC project (led by Catherine McLeod, Nancy Ide, Charles Fillmore and others) is fostering the development of a corpus comparable to the BNC, covering American English. Corpus-analytic work has demonstrated that the BNC is inappropriate for the study of American English, due to the numerous differences in use of the language. Creation of a corpus of American English will significantly contribute to language and linguistic research, as well as provide a rich national resource for use in education at all levels. A consortium of publishers of American English dictionaries and companies with interests in language processing has been formed. Consortium members are providing both materials for inclusion in the corpus and initial financial support for the project. The LDC is providing staff time to perform the initial clean-up and base-level encoding of the data and will manage distribution of the corpus. The ANC will contain a core corpus of at least 100 million words, comparable across genres to the BNC. Beyond this, the corpus will include an additional component of potentially several hundreds of millions of words, chosen to provide both the broadest and largest selection of texts possible. The ANC will be encoded according to the specifications of XCES, the XML version of CES. Initially, the corpus will contain only textual data across a variety of genres, including transcriptions of spoken data. Audio speech data, video, etc. will be added in a later phase, depending on funding. All data will be distributed freely for non-commercial research purposes from the outset. Commercial use will be limited to members of the ANC Consortium throughout the development process and for five years after the full corpus becomes available. The project is still in his infancy, since it was ideated only in 1999 (see the Proposal paper online). [2001 April 29].

AP Treebank (The Associated Press Treebank):

http://www.comp.lancs.ac.uk/computing/research/ucrel/corpora.html#apt
A skeleton-parsed corpus of American newswire reports. 1,000,000 words.

APHB Treebank (The American Printing House for the Blind Treebank):

http://www.comp.lancs.ac.uk/computing/research/ucrel/corpora.html#aphb
A skeleton-parsed corpus of a wide range of English texts. 200,000 words.

ARCHER Historical Corpus: [homepage missing]

The Representative Corpus of Historical English Registers has about 2 million words of British and American English texts between 1650 and 1990, with both written and speech-based registers. Presented in the two following papers: [1] Biber, Douglas, Edward Finegan & Dwight Atkinson (1994a). "ARCHER and Its Challenges: Compiling and Exploring a Representative Corpus of Historical English Registers". Creating and Using English Language Corpora. Papers from the Fourteenth International Conference on English Language Research on Computerized Corpora, Zürich 1993, ed. by Udo Fries, Gunnel Tottie & Peter Schneider, 1–13. Amsterdam & Atlanta, GA: Rodopi. [2] Biber, Douglas, Edward Finegan, Dwight Atkinson, Ann Beck, Dennis Burges & Jena Burges (1994b). "The Design and Analysis of the ARCHER Corpus: A Progress Report [A Representative Corpus of Historical English Registers]". Corpora Across the Centuries. Proceedings of the First International Colloquium on English Diachronic Corpora, St Catharine's College Cambridge, 25–27 March 1993, ed. by Merja Kytö, Matti Rissanen & Susan Wright, 3–6. Amsterdam & Atlanta, GA: Rodopi. [Last checked 2001 August 26].
Contact: Douglas Biber (see this page), Department of English, Northern Arizona University, Flagstaff, AZ 86011-6032, USA.

Arizona Corpus: [homepage missing]

The Arizona Corpus of Elementary Student Writing is said to contain 5,000 essays written by native English, Spanish and Navajo student residents of the state of Arizona. No other information available (these data were got from a page of Przemyslaw Kaszubski on Computerised learner corpora). [2001 May 1].

BAF French - English Parallel Corpus: http://www-rali.iro.umontreal.ca/arc-a2/BAF/

The BAF Corpus is a corpus of French - English bi-texts, i.e. of pair of French and English texts which are mutual translations, and whose sentences have been aligned. This corpus has been built up by the CITI computer assisted translation group (TAO). Most of the texts are of institutional genre (canadian HANSARD, ONU reports, etc.), but a few scientifical papers and a literary work were also included. The whole corpus has about 400.000 wors for each language. BAF Version 1.1. is already available and can be freely downloaded in UNIX GZ format, ZIP and each file separatedly in TXT and CES formats. Description, allignment conventions, encoding documentation, and a COAL Tools suite, are also freely available on the site. [2001 April 23].

Berkshire Probate Inventories Corpus: http://ota.ahds.ac.uk/ (search Catalog).

A 383 KB English corpus made of transcriptions from: Mss. in the Berkshire County Record Office. Compiled by C. R. J. Currie. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

Bible of University of Maryland Parallel Corpus Project: http://benjamin.umd.edu/parallel/bible.html

The MPCP (University of Maryland Parallel Corpus Project) provides versions of the Bible consistently annotated according to the CES. There are also some freely downloadable PS paper related to this project, mainly by Philip Resnik. Versions already freely available are the Chinese, Danish, English, Finnish, French, Greek, Indonesian, Latin, Spanish, Swahili, Swedish and Vietnamese ones. For more cf. under the Parallel Corpora section. [2001 May 1].

Birkbeck Spelling Error Corpus: http://ota.ahds.ac.uk/ (search Catalog).

A 1684 KB corpus compiled by Roger Mitton. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

BLLIP 1987-89 WSJ Corpus (Release 1): http://morph.ldc.upenn.edu/Catalog/LDC2000T43.html

This two CD-ROM newswire corpus contains a complete, Treebank-style parsing of the three-year WSJ archive from the ACL/DCI corpus -- about 30 million words of text. The parsing and part-of-speech (POS) tagging for the entire archive were done using statistically-based methods developed by Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale and Mark Johnson of BLLIP. This corpus both overlaps and supplements the 1-million-word Penn Treebank collection of parsed and POS-tagged WSJ texts. Available only by the LDC through membership or 100$ price.

Blues Lyric Poetry Corpus: http://ota.ahds.ac.uk/ (search Catalog).

A 1483 KB English corpus transcribed from: Blues lyric poetry: an anthology / Michael Taft. -- New York; London: Garland, 1983. -- (Garland reference library of the humanities ; v. 361). Compiled by Michael Taft. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

BNC (British National Corpus): http://info.ox.ac.uk/bnc/

The British National Corpus is a very large (over 100 million words) corpus of modern English, both spoken and written. The Corpus is designed to represent as wide a range of modern British English as possible. The written part (90%) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text. The spoken part (10%) includes a large amount of unscripted informal conversation, recordeded by volunteers selected from different age, region and social classes in a demographically balanced way, together with spoken language collected in all kinds of different contexts, ranging from formal business or government meetings to radio shows and phone-ins. The corpus comprises 4,124 texts, of which 863 are transcribed from spoken conversations or monologues. Each text is segmented into orthographic sentence units, within which each word is automatically assigned a word class (part of speech) code. There are six and a quarter million sentence units in the whole corpus. Segmentation and word-classification was carried out automatically by the CLAWS stochastic part-of-speech tagger developed at the University of Lancaster. The classification scheme used for the corpus distinguishes some 65 parts of speech, which are described in the accompanying documentation. The corpus is encoded according to the Guidelines of the TEI using SGML (using ISO standard 8879). [Last rev. 2001 April 23].
+ BNC World Edition on CD-ROM. The cost (excluding VAT) is £250 for a full networked licence, or £50 (a very interesting price indeed!) for a single user licence (in addition, VAT at 17.5% where applicable is payable on orders within Europe, and a small fixed fee is charged for postage and packing). This includes BNC Licence Fee (valid for five years) of £10, Two CD-Roms, Documentation ,(Networked version only), Source code for the SARA system. Please note however that the current version of the BNC (version 1.0) cannot be distributed outside the EU because of copyright restrictions applicable to a few texts. A new version which will not be restricted in this way is currently in production. The single-user system is designed for use on standalone computers running under any Microsoft Windows 32-bit system (i.e.. Windows 98, 98, ME, NT or 2000). It needs at least 6 Gb of free disk space, and 64 Mb of RAM. Better performance will be obtained with more memory (128+Mb) and faster (over 300 Mhz) processors. The networked system is designed for use on a local TCP/IP network running under any version of Unix. The server has been successfully run under versions of Linux, Solaris, and Digital Unix and on a variety of hardware platforms: a minimum of 8 Gb of disk space is needed and at least 128 Mb of memory. The amount of memory used depends chiefly on the number of client sessions running and on the complexity of queries posed. A fully-featured Windows 32 client, which can be installed on any PC connected to the TCP/IP network, is supplied as well.
+
BNC Online is a new service which allows anyone with access to the internet to search online the British National Corpus. Several level of access are provided. (1) You can make a simple search directly from the web browser you are currently using, freely and without registering; the restricted search interface will not return more than 50 hits, with a maximum of one sentence of context for each, but it will support any legal CQL query. (2) You can register for a free temporary user name to experiment unrestrictedly with this online service. A temporary username costs nothing, but expires after thirty days. (3) To take full advantage of the BNC Online service, however, you must first download the SARA client software and install it on your PC. SARA is a special purpose browser and concordancer generator, designed specifically for use with the BNC. It is free of charge to all BNC licensees. At present it is only available for Microsoft 32-bit Windows systems (Windows 95, 98, or NT). (4) Once you have made a temporary 30 days registration, if you want to continue your use of the service after this trial period, you must pay a 60 pound fee to receive a full registration. This includes a three year licence to use the system, one copy of a detailed user manual, and free updates of the client software. This fee entitles you to a license for one or two machines for a year. Compared with other similar services it is surely a great offer! [Rev. 2001 April 23].
+ A small Spoken English subsection of the BNC constitutes the 50% of the tagged and freely available CHRISTINE Corpus.

BoLC (It-En) (Bononia Legal Corpus): http://www.cilta.unibo.it/SITOBOLC_ITA.htm (English also)

The BOnonia Legal Corpus (BOLC), developed in CILTA (Centro Interfacoltà di Linguistica Teorica e Applicata 'L. Heilmann') at Bologna University since 1997, by Rema Rossini Favretti and Fabio Tamburini, for the moment is formed of two subcorpora: one English, the other Italian, but it could be expanded at a later stage. Future availability is not known. For more details cf. under the Parallel Corpora section. [2001 April 23].

Brown Corpus (The Brown University Corpus):

http://www.comp.lancs.ac.uk/computing/research/ucrel/corpora.html#brown
Approximately 1,000,000 words of American written English dating from 1960. The genre categories are parallel to those of the LOB corpus. Available as orthographic text only. For further information see the Corpus Bibliography and the Corpus Manual(available also at this adress).
+ A version on CD-ROM is available from ICAME; you can also dowload a small sample.
+ Another version (tagged and parsed) comes from LDC as part of the Penn TreeBank.
+ A 130,000-word subset of the Brown Corpus constitutes the text basis of the annotated and freely available SUSANNE Corpus.
+ A 5 MB version compiled by W. Nelson Francis & Henry Kuccaronera is freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

Bundesregierung Multilingual (Fr-De-En) Corpus (TRACTOR):

http://solaris3.ids-mannheim.de/tractor/telri/MAN/man-04.htm
Texts from Deutsche Bundesregierung. Cf. this site under Multilingual and Parallel Corpora section.
Available under subscription to TRACTOR.

Calgary Corpus: ftp://ftp.cpsc.ucalgary.ca/pub/projects/text.compression.corpus

The Calgary text compression corpus. This corpus is used in the book Bell, T.C., Cleary, J.G. and Witten, I.H. Text compression. Prentice Hall, Englewood Cliffs, NJ, 1990 to evaluate the practical performance of various text compression schemes. Several other researchers are now using the corpus to evaluate text compression schemes. Nine different types of English texts are represented, and to confirm that the performance of schemes is consistent for any given type, many of the types have more than one representative. All fyles are freely downloadable. The Calgary Corpus has now been superseded by the new Canterbury Corpus. [2001 April 28].

CALLHOME American English Transcripts: http://morph.ldc.upenn.edu/Catalog/LDC97T14.html

The text component of the package includes transcripts and documentation files for 120 unscripted telephone conversations between native speakers of English; a separate LDC catalog entry, LDC97S42 provides the speech data for these conversations, which are partitioned into separate subdirectories for "training" (80 conversations), "development test set" (20 conversations) and "evalutation test set" (20 conversations). The transcripts cover a contiguous 10 minute segment of each call in the training and development test sets, and a 5 minute segment of each call in the evaluation set, yielding a total of 18.3 hours of transcribed spontaneous speech, comprising about 230,000 words. The transcripts are timestamped by speaker turn for alignment with the speech signal and are provided in standard orthography. In addition to transcript files, this corpus contains full documentation on the transcription conventions and format. Complete auditing information on the speakers represented in the transcripts (including gender, channel quality and so on) is also included.
Available as FTP file by the LDC through membership or by 500$ price.

Canadian Poets Anthology Corpus: http://ota.ahds.ac.uk/ (search Catalog).

A 5 MB corpus made of English texts by 14 canadian poets. Compiled by Sandra Djwa. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

Canterbury Corpus: http://corpus.canterbury.ac.nz/

The Canterbury Corpus is a benchmark to enable researchers to evaluate lossless compression methods. This site includes test files and compression test results for many research compression methods. The Canterbury Corpus file set has being developed specifically for testing new compression algorithms. The files were selected based on their ability to provide representative performance results. This set of files is designed to replace the Calgary Corpus which is now over ten years old. Several sets of results are available on this web site. As well as the new Canterbury Corpus, a corpus of large files has been tested, and results for the original Calgary Corpus are also available. The constituents of the true Canterbury Text Compression Corpus ranges from "normal" English texts (such as Shakespeare) to informatic sources (C, Lisp, HTML etc.). In addition there is also an Artificial Corpus of abnormal texts (such as alphabet, random texts etc.), a Large Corpus, made from thery large files mainly English (ranging from the Complete Genoma of Eschirichia Coli to King James Bible), the old Calgary Corpus and a Miscellaneous. All (sub)corpora are freely downloadable as TAR-GZ or ZIP files. [2001 April 28].

CAPA (contemporary American Poetry Archive): http://capa.conncoll.edu/

The Connecticut College CAPA is a huge electronic archive designed to make out-of-print volumes of American poetry available through the Web to readers, scholars, and researchers. All text are stored in HTML and are freely downloadable. Contact: Wendy Battin.

CEEC (Corpus of Early English Correspondence): http://www.eng.helsinki.fi/doe/projects/ceec/

The Corpus of Early English Correspondence (CEEC) has been compiled for the study of social variables in the history of English. To enable this, great attention has been paid to the authenticity of letters on the one hand and to the social representativeness of the writers on the other. The timespan covered is from 1417 to 1681, and the size of the whole not tagged corpus is 2.7 million words. Because of widespread illiteracy, however, only the highest ranks of society are well represented, and women's letters form no more than one fifth of the full CEEC. For more information on the compilation principles see Nevalainen & Raumolin-Brunberg (eds.) (1996) and Keränen (1998).
+ The Corpus of Early English Correspondence Sampler (CEECS) represents the non-copyrighted materials included in the Corpus of Early English Correspondence. This means that the editors of the collections included here have died over 70 years ago. We have also included some material (re)edited by us (see Henslowe and Marchall collections). The sampler corpus (CEECS) reflects the structure of the full CEEC only in some respects. The time covered is nearly the same (1418-1680) and here too the bulk of the material is at the latter end of the time span. 23 letter collections have been included with altogether 194 informants. The size of the CEECS is 450,000 words. It has been divided into two parts for technical reasons. CEECS1 covers the 15th and 16th centuries, with the exception of the Hutton collection, which goes on to the 17th century. CEECS2 consists of 17th century material, only 3 letters in Original 3 are from the late 16th century.
+ A version on CD-ROM is available from ICAME; you can also dowload a small sample, and a manual as well.

CELEX English Database: http://www.kun.nl/celex.

CELEX, the Dutch Centre for Lexical Information, has three separate databases, Dutch, English and German, all of which are open to external users. The latest release of the English database (E2.5), completed in June 1993, contains 52,446 lemmata representing 160,594 wordforms. Apart from orthographic features, the CELEX database comprises representations of the phonological, morphological, syntactic and frequency properties of lemmata. For Dutch and English lemma homographs, frequencies have been disambiguated on the basis of the 42.4 m. Dutch INL and the 17.9 m. English Collins/COBUILD text corpora. The CELEX database is open to all academic researchers and people associated with other not-for-profit research institutes free of charge (at least until 2001). Users will only be charged Dfl. 100,= for the CELEX User Guide on a one-shot basis. In order login to CELEX, a personal account should be obtained from Richard Piepenbrock, project manager: see at this page.

Chancery English Anthology Corpus: http://ota.ahds.ac.uk/ (search Catalog).

A 566 KB SGML tagged corpus made from J. Fisher, M. Richardson & J. L. Fisher, Anthology of Chancery English, U. of Tennessee Press, Knoxville 1984. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

CHILDES Database: http://childes.psy.cmu.edu/

mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages. The bulk of the collection is however English.: there are 42 corpora from normally-developing English-speaking children, 3 corpora which include a complete morphological and part-of-speech analysis (namely: 1, 2, 3), 20 corpora from clinical subjects, 15 corpora from bilingual and second language learning subjects, and more. See under Multilingual and Parallel Corpora section for a fuller file.
All materials are freely available directly from the Site.

CHRISTINE Corpus (Geoffrey Sampson's Spoken British English Annotated Corpus):

http://www.grsampson.net/Resources.html:
The CHRISTINE project by Geoffrey Sampson is setting out to extend the SUSANNE analytic scheme and the SUSANNE Corpus to cover spoken English, and particularly spontaneous, informal spoken English. 50% of the Corpus comes from the spoken part of the BNC (British National Corpus), 10% from the Emotion in Speech Corpus, and 40% from LLC (London-Lund Corpus). For more details on the annotation scheme cf. under the SUSANNE Corpus.
The "hagiographical" explanation about the name "Christine" Geoffery Sampson give in his page is only too charming, and must be quoted literally and without cuts. "Before this project began, I referred to it as the 'Spoken SUSANNE project'. But it is useful to have a short, distinctive name for a separate research undertaking. Apart from anything else, we need a name in order to create structure in our mass of electronic files at Sussex. SUSANNE stood for 'Surface and underlying structural analyses of natural English'. (One of the N's was taken from 'analyses'.) But the name was also appropriate for reasons that I shan't go into here, having to do with the life of St Susanna. Our new project is 'daughter of SUSANNE'. But Susanna, as a holy virgin, had no daughter. So I chose a 'successor' name in terms of the calendar. St Susanna's day is 23rd July. July 24th is the day dedicated to SS Christina of Tyre and Christina the Astonishing. (It is also the day of our local Sussex saint, Lewina of Seaford - but 'Lewina' seemed too strange a name to make a satisfactory project title.) St Christina of Tyre makes a good patroness for a project on speech. We are told that, after being condemned to have her tongue cut out, she carried on speaking just as clearly as ever. Picking up her excised tongue, she threw it at the judge, blinding him in one eye. (A neat trick, which we shall have to bear in mind in case we have any trouble with Research Council assessors.) If you insist on an acronym, CHRISTINE can just about be twisted into that too: 'Chrestomathized speech trees in natural English'. (Ouch!) At any rate, it makes a distinctive and attractive name".
The CHRISTINE Corpus was realesed as "Stage I", ready and freely available for use, in 1999. The second release of CHRISTINE, available since August 2000, incorporates a minor change in the distribution of analytic information between the fields, to make it more compatible with SUSANNE and easier to read. It includes about 40% of the originally planned complete CHRISTINE Corpus: while it was in being, the project annotated considerably more material than the sample now published, but the remainder was not brought into a suitable state for publication by the end of the project; the current CHRISTINE Corpus was originally referred to as "CHRISTINE Stage I", in the expectation that it would soon be replaced by a larger corpus; it is still hoped that this eventually can be done, but the work remaining to be done has turned out to be considerably more than was envisaged in 1999; hence the short name "CHRISTINE Corpus" is now used for the corpus currently available. However, CHRISTINE in its own right offers a structural analysis of a cross-section of 1990s spontaneous speech from all British regions, social classes, etc.
For a general presentation of the CHRISTINE project go to its proper homepage, which is http://www.grsampson.net/RChristine.html (but is advisable to refer first of all to the general one provided in the title line). There is also a full Documentation file, which is available as a single Web page. The Corpus itself can be freely downloaded by anonymous ftp: what you receive will be a gzipped tar file; use the "gunzip CHRISTINE1.tgz" to uncompress it into a CHRISTINE1.tar file, and "tar -xvf CHRISTINE1.tar" to unpack the archive into its constituent files.
[updated 2004 March 25].

Civil War Polemic Corpus: http://ota.ahds.ac.uk/ (search Catalog).

A 631 KB - 443000 words English corpus transcribed from Normalised versions of Yale prose Milton and of contemporary editions, compiled by Thomas N. Corns. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

Claremont Corpus of Elizabethan Verse: http://ota.ahds.ac.uk/ (search Catalog).

A 945 KB English corpus transcribed from the Claremont Corpus of Elizabethan Verse, Modern American spelling, compiled by Ward Elliott. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

CLAWS tagger: http://www.comp.lancs.ac.uk/computing/research/ucrel/claws/

CLAWS (Constituent Likelihood Automatic Word-tagging System) is the well known POS tagger for English texts which has been continuously developed since the early 1980s at the UCREL. For more infos see the full file in the Tools section.

COBUILD-Collins Bank of English: http://www.cobuild.collins.co.uk/boe_info.html

The Bank of English is a collection of samples of modern English language held on computer for analysis of words, meanings, grammar and usage. In October 2000 the latest release of the corpus amounted to 415 million words and it continues to grow. This huge collection is composed of a wide range of different types of writing and speech. It contains samples of the English language from hundreds of different sources: written texts come from newspapers, magazines, fiction and non-fiction books, brochures, leaflets, reports, letters, and so on; the spoken word is represented by transcriptions of everyday casual conversation, radio broadcasts, meetings, interviews and discussions, etc. The material is up- to-date, with the majority of texts originating after 1990. The Bank of English was launched in 1991 by COBUILD (a division of HarperCollins Publishers) and the University of Birmingham. Since 1980 COBUILD, which is based within the School of English at Birmingham University, has been collecting a corpus of texts on computer for dictionary compilation and language study. In 1991 HarperCollins decided on a major initiative to increase the scale of the corpus to 200 million words, to form the basic data resource for a new generation of authoritative language reference publications. [Last rev. 2001 April 23].
+ The COBUILD English Dictionary books and CD-ROMs are based on this database. Now they are distributed also by Athelstan.
+ There is a free demo available.
+ Now a large corpus, the CobuildDirect Corpus, is available online through subscription. It is composed of 56 million words of contemporary written and spoken text; a detailed description of the contents is downloadable by anonimous ftp. The subscription is expensive: UKpounds 50 for a one month unlimited access trial (not renewable), UKpounds 300 for 6 months unlimited connection time, UKpounds 500 for 12 months unlimited connection time.
+ There is a limited free access to a CobuildDirect Corpus Sampler available: you can type in some simple queries here and get a display of concordance (up to 40 lines) or collocation (up to 100 collocates) lines from the corpus. The query syntax allows you to specify word combinations, wildcards, part-of-speech tags, and so on.
+ Further access to larger corpora may be granted only by special arrangement.

COLT (The Bergen Corpus of London Teenage Language):

http://www.hf.uib.no/i/Engelsk/COLT/index.html
The collection of the material took place in 1993. The aim of the project has been to create a corpus of British English teenage talk and make it available for research, first on the internet, next as an orthographically and prosodically transcribed CD-ROM version, and finally as a CD-ROM version with both text and sound. The recordings were made by 31 volunteering 13-17 year old boys and girls from five socially different school boroughs, so-called 'recruits' equipped with a Sony Walkman, a lapel microphone and a log book. The entire material of roughly half a million words was orthographically transcribed by trained transcribers employed by the Longman Group for transcribing The British National Corpus (BNC). A copy of this version of COLT was incorporated in the BNC. At the Bergen end, the orthographically transcribed material was subsequently submitted to careful editing, which involved correcting misinterpreted talk, reducing the number of passages and adding untranscribed talk. The edited version was then tagged for word classes in the same way as the BNC by a research team at Lancaster university.
Online search in the entire corpus requires subscription, but there is a free search in the pilot version consisting of 151 texts.
A version on CD-ROM is available from ICAME; see a small sample.
PDF manual downloadable.

Complete Works of William Shakespeare: http://tech-two.mit.edu/Shakespeare/

There are all the plays in HTML format; texts are all of public domain, but is broken in a lot of files and is ultimately based on the free and easily available Moby Shakespeare, a part of the Moby Project. The most interesting feature of the site is the free search engine, but last time I checked was down. [2001 April 23].

Corpus Resources in the Slovene Language (TRACTOR):

http://solaris3.ids-mannheim.de/tractor/telri/LJU1/lju1.htm
It contains also a 500000 Words English-Slovene and Sloven-English Corpus: cf. this site under Multilingual and Parallel Corpora. Available under subscription to TRACTOR.

Corvinus Library: http://www.hungary.com/corvinus/

In the Corvinus Virtual Library you will find freely readable and downloadable transcriptions of a good lot of books on Hungarian history, published in the United States of America, in the English language or translated from Hungarian. Texts are usually in DOC format (with a few PDF).

CPSA (Corpus of Professional Spoken American-English): http://www.athel.com/cpsa.html

This corpus, which has been constructed from a selection of existing transcripts of interactions in professional settings, contains two main sub-corpora of a million words each. One sub-corpus consists mainly of academic discussions such as faculty council meetings and committee meetings related to testing. The second sub-corpus contains transcripts of White House press conferences, which are almost exclusively question-and-answer sessions. The transcripts making up the spoken American corpus have been selected because they appear to be relatively unedited. However, they have not been produced by linguists and so do not have all the features one might wish for. There is a tagged and an untagged verison available.
The corpus can be ordered from Athelstan at $49 (untagged ver.) and $79 (tagged ver.) for individual user (the site licence is $179). There is also a 50K sample corpus free on the web.

CRATER: see under ITU or CRATER Parallel Corpus.
CSAE: see Santa Barbara Corpus or CSAE.
CSR-III Text Newswire Corpus (Continuous Speech Recognition training data):

http://morph.ldc.upenn.edu/Catalog/LDC95T6.html
The third ARPA Continuous Speech Recognition (CSR) Language Model Training Data is a four CD-ROM set for speaker-independent, large-vocabulary speech recognition systems. The text collection comprises both source text data (prepared by LDC and BBN) and derived statistical tables (compiled by CMU) of unigram, bigram and trigram word frequencies. The sources include all available WSJ texts, spanning 1987 through March 1994 and all AP and San Jose Mercury news data from the three TIPSTER corpus volumes.
Available only through LDC membership.

Dedications Corpus: http://ota.ahds.ac.uk/ (search Catalog)

A 808 KB English corpus with the Dedications, etc., transcribed by Ralph Crane and compiled by T.H. Howard-Hill. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

Digital Tradition Folksong Database: http://www.mudcat.org/threads.cfm

The Mudcat Café (cf. homepage), an online magazine dedicated to blues and folk music, maintains the Digitrad Lyric Database, a huge e-text archive with over 8000 popular lyrics in English, all freely downloadable in simple HTML format. Usually even the music is available. [2002 February 19].

DSO Corpus of Sense-Tagged English (Singapore Defence Science Organisation Sense-Tagged Corpus):

http://morph.ldc.upenn.edu/Catalog/LDC97T12.html
This corpus contains sense-tagged word occurrences for 121 nouns and 70 verbs which are among the most frequently occurring and ambiguous words in English. It was provided by Hwee Tou Ng of the Defence Science Organisation (DSO) of Singapore. It was first reported in the following paper at ACL-96 ("Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach", by Hwee Tou Ng and Hian Beng Lee, in Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pages 40-47, Santa Cruz, California, USA, June 1996. Cf. this site). These occurrences are provided in about 192,800 sentences taken from the Brown corpus and the Wall Street Journal and have been hand tagged by students at the Linguistics Program of the National University of Singapore. WordNet 1.5 sense definitions of these nouns and verbs were used to identify a word sense for each occurrence of each word. In addition to providing the word occurrences in their full sentential context, the corpus includes complete listings of the WordNet 1.5 sense definitions used in the tagging.
Available by the LDC through membership or 100$.

Dublin-Essex Treebank project: http://www.compapp.dcu.ie/~away/Treebank/treebank.html

The Dublin-Essex Project aimes at deriving Linguistic Resources from Treebanks.
Some English parsed sample is already freely downloadable, and they promise to put free also future research products, such as English Lexicon and Probabilistic Grammar.

ECI/MCI 1 Corpus (European Corpus Initiative Multilingual Corpus):

http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.

ECO (Early Canadiana Online): http://www.canadiana.org

Early Canadiana Online (ECO) is a full text online collection of more than 3,000 books and pamphlets (English and French languages) documenting Canadian history from the first European contact to the late 19th century. You can make simple queries online, but unfortunately you can download texts only one page at a time (English version).

Edinburgh Associative Corpus: http://ota.ahds.ac.uk/ (search Catalog).

The 5 MB English corpus compiled by George Kiss and Christine Armstrong with the Edinburgh Associative Thesaurus by George Kiss. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

Edinburgh DOST Corpus of Older Scottish Texts: http://ota.ahds.ac.uk(search Catalog).

The 5 MB Edinburgh DOST corpus of Older Scottish texts, compiled by A.J. Aitken, Paul Bratley and Neil Hamilton-Smith. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

Electronic Text Center at the University of Virginia: http://etext.lib.virginia.edu/

The Electronic Text Center's holdings are a large collection of browseable and searchable texts, and include approximately 45,000 on- and off-line humanities texts in twelve languages – but chiefly English –, with more than 50,000 related images. Unfortunately all this stuff is usually available only to U.Va students, faculty, and staff. See however this site: there are useful links as well.

Emotion In Speech Corpus: http://midwich.reading.ac.uk/research/speechlab/emotion/

The Emotion project was a joint research project between The Speech Laboratory at the University of Reading and The Department of Psychology at the University of Leeds. This project brings together expertise in phonetics and phonology and in cognitive psychology in order to examine emotional speech and to produce a database of such speech to put alongside the emotionally neutral material found in most spoken language databases.
They say that the project is now complete and copies of the corpus have been deposited with the ESRC Data Archive, and that they hope to be able to make it available on CD-ROM, but they don't say how to obtain now access to this Corpus (only a small subsection of the texts are present in the free CHRISTINE Corpus). However they "are sorry but because of Copyright restrictions the Emotion in Speech Corpus is not available for distribution either online nor by other means". Let us hope ...

EngCG Parser: http://www.lingsoft.fi/doc/engcg/

EngCG, the Constraint Grammar Parser of English by Pasi Tapanainen (1993), performs morphosyntactic analysis (tagging) of English text. There is an online demo at Lingsoft site. It is sold by Lingsoft: for availability (it is a commercial software!) you have to ask to info@lingsoft.fi.

EngCG-2 Tagger: http://www.ling.helsinki.fi/~avoutila/cg/index.html

EngCG-2, by Pasi Tapainen and Atro Voutilainen, is a program that assigns morphological and part-of-speech tags to words in English text at a speed of about 3,000 words per second on a Pentium running Linux. It is an improved version of the original EngCG tagger, which is based on the Constraint Grammar framework advocated by a team of computational linguists in Helsinki, Finland. The documention (and several articles) are available online.
EngCG-2 tagger with the English Constraint Grammar can be licensed from Conexor. There is also a free demo. Contacts: Tapanainen and Voutilainen.

Englex: http://www.sil.org/computing/catalog/englex.html

Englex is an English parsing description tool (viz. a morphological parsing lexicon of English) for PC-Kimmo by SIL Computing. It uses the standard orthography for English. It is intended for use with PC-Kimmo (or programs that use the PC-Kimmo parser, such as K-Text and K-Tagger). With such software and Englex, you can morphologically parse English words and text. Practical applications include morphologically preprocessing text for a syntactic parser and producing morphologically tagged text. Englex works under Win 3.1-98 + NT, DOS, MAC and Unix. Freely available under agreement to the SIL standard free license.

E-Server Org at CMU: http://english-www.hss.cmu.edu/

Finally a really free site from an Universitary Istitution, and a very good one! The EServer at Carnegie Mellon University has been online for ten years; today it offers a huge collection of 29,113 free works online, mainly in TXT format, all in English language, both originals and translations, covering a wide range of literary genres, from novels to drama, essays and Journals (such as Bad Subjects and Cultronics). The texts are arranged more for online reading than for downloading, but, of course, you can easily download them as well. There is also an FTP server at restricted to members (for membership see this page).There is not a general catalogue and you have to make a search of what you want.

ET10-63 Parallel Corpus: http://www.comp.lancs.ac.uk/computing/research/ucrel/corpora.html#et10

The ET10-63 corpus is a bilingual parallel corpus of English and French, containing EC official documents on telecommunications. The corpus is part-of-speech tagged and also lemmatized. Approximately 1,250,000 words of each language.

E-Text Archives: http://www.etext.org/index.shtml

Home to electronic texts of all kinds, from contemporary American amateur authors to Shakespeare, from the mainstream and off-beat religious texts to the profane personal poetry; there are e-zines of every kind, from the political to the technical; many texts coma from the Usenet. English Langauge is prevalent. All texts are freely downloadable from the E-text FTP.

European Free Trade Organization Multilingual (De-En) Corpus (TRACTOR):

http://solaris3.ids-mannheim.de/tractor/telri/MAN/man-08.htm
Cf. under Multilingual and Parallel Corpora. Available under subscription to TRACTOR.

FLOB Corpus (The Freiburg LOB Corpus)

In 1991 a group of students at Freiburg University was engaged in what at first sight must appear as an almost anachronistic activity: they were keying in extracts of roughly 2,000 words from British newspapers. The sampling model was the press section of the LOB corpus (see Sand/Siemund 1992). 1992 saw the beginning of a new Brown corpus. The ultimate aim was to compile a parallel one-million-word corpora of the early 1990s to match the original LOB and Brown corpora as closely as possible, so to provide linguists with an empirical basis to study language change in progress. The corpus is not tagged.
A version on CD-ROM is available from ICAME; you can also reach a small sample, and a manual as well. Contact: Christian Mair, Englisches Seminar I, Institut für Englische Sprache und Literatur, Albert-Ludwigs Universität, D-7800 Freiburg, Germany.

FROWN Corpus (The Freiburg-Brown Corpus)

In 1991, Christian Mair took the initiative to compile a set of corpora that would match the well-known and widely used Brown and LOB corpora with the only difference that they should represent the language of the early 1990s. The project started in April 1991 with the compilation of the British Press Section of the new FLOB corpus. 1992 saw the beginning of the new Freiburg Brown Corpus, Frown. The corpus is not tagged.
A version on CD-ROM is available from ICAME; a small sample is available; and a manual too.

Hansard Canadian English-French Parallel Corpus

The Hansard Corpus consists of parallel texts in English and Canadian French, drawn from official records of the proceedings of the Canadian Parliament.
See under Multilingual and Parallel Corpora section.

Harpers Magazine 1879-1880 Corpus: http://ota.ahds.ac.uk/ (search Catalog).

A 578 KB English corpus made by the 1879-1880 issues of Harpers Magazine. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

Helsinki Corpus of English Texts: Diachronic and Dialectal Part:

http://www.comp.lancs.ac.uk/computing/research/ucrel/corpora.html#hels
The Helsinki Historical Corpus is a computerized collection of extracts of continuous text. The Corpus contains a diachronic part covering the period from c. 750 to c. 1700 and a dialect part based on transcripts of interviews with speakers of British rural dialects from the 1970's. The aim of the Corpus is to promote and facilitate the diachronic and dialectal study of English as well as to offer computerized material to those interested in the development and varieties of language. For more details cf. the full file in the Old & Middle English section.

Heok-Seung Kwon's Corpus Linguistics Links: http://plaza.snu.ac.kr/~hskwon/corpus.html

A small but selective page of links to the main Corpus Linguistics resources on the Web, focusing on English. By Heok-Seung Kwon of Seoul National University, cf. his homepage. [2002 February 18].

Hong Kong South China Morning Post Corpus: http://ota.ahds.ac.uk/

(search Catalog).
A 6.9 MB - 1 million words English corpus from the Hong Kong South China Morning Post Corpus, Feb-March 1992. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

Hub-4 - 1996 CSR Language Model: http://morph.ldc.upenn.edu/Catalog/LDC98T31.html

Two CD-ROM set which contain data from transcribed news broadcasts, designated for use in the baseline language model (LM) for the 1996 CSR Hub 4 Evaluation. The LDC obtained the bulk of the data from broadcast news CD-ROMs produced by Primary Source Media, Inc. This portion includes the period from January 1992 to April 1996 and contains approximately 1 gigabyte of data uncompressed. This release also includes about 36 megabytes of material received on floppy disks covering the period from late May through June 1996, with somewhat different format from the bulk of the data. The text data are presented in two forms: (1) a relatively unprocessed ("raw" or "sentence-tagged") form and (2) a fully processed ("conditioned", "verbalized-punctuation") form. Available only by membership to the LDC.

Hub-4 - English Broadcast News Transcripts 1996:

http://morph.ldc.upenn.edu/Catalog/LDC97T22.html (cf. also 1997)
The primary motivation for this collection is to provide training data for the DARPA"Hub-4" Project on continuous speech recognition in the broadcast domain. Transcripts have been made of all recordings in these publications, manually time aligned to the phrasal level, annotated to identify boundaries between news stories, speaker turn boundaries and gender information about the speakers. The released version of the transcripts is in SGML format and there is accompaining documentation and an SGML DTD file, included with the transcription release.
The 1996 Broadcast News Speech Corpus contains a total of 104 hours of broadcasts from ABC, CNN and CSPAN television networks and NPR and PRI radio networks with corresponding transcripts. The speech files are available in a 19 disc training data set with one additional disc of development data and an additional disc of evaluation data.
The 1997 Broadcast News Speech Corpus contains in a set of 18 CD-ROMs a total of 97 hours of recordings from radio and television news broadcasts, gathered between June 1997 and February 1998. It has been prepared to serve as a supplement to the 1996 Broadcast News Speech collection (consisting of over 100 hours of similar recordings).
Both available only by membership to the LDC.

Hub-5-LVCSR see Switchboard 1.2 Corpus
Humanist Electronic Discussion Group Corpus: http://ota.ahds.ac.uk/

(search Catalog).
The 5 MB English corpus transcribing Humanist, the complete electronic discussion group, 1987-1989. Compiled by Willard McCarty. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

IBM Manuals Treebank: http://www.comp.lancs.ac.uk/computing/research/ucrel/corpora.html#ibm

A skeleton-parsed corpus of computer manuals. 800,000 words.

IBM-Lancaster Spoken English Corpus: see under SEC.
"ICAME Journal": http://www.hd.uib.no/journal.html

The Journal of ICAME (International Computer Archive of Modern and Medieval English) is published once a year since 1977, with articles, conference reports, rewiews and notices related to corpus linguistics. For more infos see the full file in the Reference section.

ICE (The international Corpus of English): http://www.ucl.ac.uk/english-usage/ice/index.htm

The International Corpus of English began in 1990 with the primary aim of providing material for comparative studies of varieties of English throughout the world. Twenty centres around the world are preparing corpora of their own national or regional variety of English. Each ICE corpus consists of spoken and written material produced after 1989. Each corpus (cf. the components of ICE) contains one million words. In order to ensure compatability between the national corpora, each team is following a common corpus design, as well as a common scheme for grammatical annotation. Cf. the ICE-GB (Great Britain component), ICE-NZ (New Zeland component) and ICE-EA (East African component). ICE incorporates also ICLE, the International Corpus of Learner English.

ICE - EA (The international Corpus of English - East African component):

http: www.tu-chemnitz.de/phil/english/real/eafrica
The East African component of ICE, the International Corpus of English, is a computerized (but not tagged) collection of spoken and written texts from Kenya and Tanzania. A version on CD-ROM is available from ICAME; a small sample is available both for the written section and for the spoken section. A PDF manual is downloadable as well.

ICE - GB (The Great Britain component of the International Corpus of English):

http://www.ucl.ac.uk/english-usage/ice-gb/index.htm.
ICE-GB is the British component of ICE and is the first of the ICE corpora to be completed, and is now available. One million words of spoken and written British English from the 1990s, tagged, parsed, checked, and bundled with ICECUP exploration software designed for parsed corpora. ICE-GB is both POS tagged and parsed.
+ ICE-GB CD-ROM can be ordered from the site at 293.75£price.
+ ICE-GB Sample Corpus consists of Ten texts – over 20,000 words – fully parsed and annotated, exactly as they are in ICE-GB, in bundle with a full working version of ICECUP III. It is available for free download.

ICE - NZ (The New Zealand component of the International Corpus of English):

http://www.vuw.ac.nz/lals/corpora.htm
One million words of spoken and written New Zealand English collected in the years 1990 to 1996. ICE-NZ, the New Zealand component of ICE, consists of 600,000 words of speech and 400,000 words of written text. The WSC and the spoken component of ICE-NZ share 9 categories. Since informal conversational data in particular was so difficult to collect, there is an overlap of 339,248 words (173 files) between the two corpora to achieve economy in data collection.
At this page, you can find some Notes on collecting conversations for the ICE-NZ Corpus by Janet Holmes.

ICECUP (the ICE Utility Program): http://www.ucl.ac.uk/english-usage/ice-gb/icecup.htm

ICECUP is the corpus exploration program designed for the ICE-GB. For more infos cf. the fyll file in the Tools section.

ICLE (International Corpus of Learner English): http://www.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/cecl.html

ICLE (a part of the ICE project) is a computerized corpus of argumentative essays on different topics written by advanced learners of English (university students of English mainly in their second or third year). The ICLE project was launched in 1990 by Sylviane Granger, University of Louvain-la Neuve, Belgium, and the corpus is made up of a number of subcorpora representing the following language backgrounds: Chinese, Czech, Dutch, Finnish, French, German, Japanese, Polish, Russian, Spanish, and Swedish. There is also a smaller comparable corpus of British and American undergraduate essays. The length of the essays varies between 500 and 1000 words. The homepage shown hereupon is properly that of the CECL (Centre for English Corpus Linguistic) that act as main ICLE page. For more details you must go to the pages of each component, viz. Swedish, Finnish, Polish, Brazilian, Czech.

Intellectual Property and Copyright Multilingual (Fr-En) Corpus (TRACTOR):

http://solaris3.ids-mannheim.de/tractor/telri/MAN/man-11.htm
Cf. this site under Multilingual and Parallel Corpora.
Available under subscription to TRACTOR.

IntraText Library: http://www.eulogos.it/default.htm

A small library of interactive hypertexts for free reading and search maintained by Èulogos. All literary texts, many religious (the BRI, Bibliotheca Religiosa). Nine languages are till now supported (Albanian, German, English, Spanish, French, Italian, Latin, Finnish).

ITU (CRATER) Parallel Corpus (International Telecommunications Union \ Crater Corpus):

http://www.comp.lancs.ac.uk/linguistics/crater/corpus.html
The European (English, French, and Spanish) Language Newspaper Text tagged corpus, free querable online. For more infos see in the Multilingual and Parallel Corpora section.

JOC-CES Multilingual (En-De-Fr-It-Sp) Corpus:

http://www.lpl.univ-aix.fr/projects/multext/MUL4.html
This corpus is made by a set of pieces fron the Official Journal of the European Community (JOC) and is CES (Corpus Encoding Standard) conformant. It is available with three level of treatment: paragraph annotated (CESDOC), POS-Tagged (CESANA) and parallel text aligned (CESALIGN).
Availability unknown: only a few sample to download on.

JUMAN English Morphological Analyzer:

http://cactus.aist-nara.ac.jp/staff/matsu/misc/nlt.html (Japanese also)
The English morphological analyzer of the Nara Intitute of Science and Technology deals with inflection of English nouns, verbs and adjectives. Since the treatment of the information given by inflection differs in systems, the detailed information is assumed to be written in grammar rules by the user. It is free without restriction. For fuller details cf. the NAIST Natural Language Tools. [2001 April 28].

JURIS Corpus (Justice Department Retrieval and Inquiry System Data):

http://morph.ldc.upenn.edu/Catalog/LDC98T32.html
The text data contained on this two CD-ROM set represent a release of the JURIS (Justice Department Retrieval and Inquiry System) data collection that has been made available to the Linguistic Data Consortium (LDC) by the U.S. Department of Justice. The time span of the text ranges from the 1700's to the early 1990's. There are 1664 individual text files in the corpus, 1011 on the first CD-ROM and 653 on the second. There are a total of 694,667 document units in the corpus and these can be categorized to some extent with regard to their content. The terminology and organization of categories are those used in the JURIS documentation.
Available by the LDC through membership or 1500$ price.

Kirchenmusik: Textsammlung: http://home.t-online.de/home/jo_vo/textlist.htm

The Textlist page of the Kirchenmusik online site (a good and well known resource for music lovers) by Joachim Vogelsänger unfolds a huge and free collection of texts of Oratorios, Cantatas, Sacred Hymns and so like. The mosts are in German, but English texts are also well represented, spanning from Händel's Alexanderfest to Britten's Ceremony of Carols and Tippett's A child of our time. All the texts are freely downloadable in simple HTML format. For more details cf. the full file in the E-Texts section. [2001 August 27].

Knut Hofland English - French Aligned Texts: http://kh.hd.uib.no/tactweb/en-fr.htm

An online freely querable database of English-French aligned texts, processed by the same software (by Knut Hofland) used for the Oslo ENPC project.

Kolhapur Corpus: http://www.comp.lancs.ac.uk/computing/research/ucrel/corpora.html#kol

Approximately 1,000,000 words of Indian written English dating from 1978. The Kohlapur Corpus of Indian Written English is intended to be comparable to the Brown and the LOB corpora, and so to serve as source material for comparative studies of American, British and Indian English which in its turn is excepted to lead to comprehensive description of Indian English. The Indian Corpus is a representative corpus of sample texts printed and published in 1978; the texts were largely selected by stratified random sampling process; again, the gender categories are parallel to those of the LOB corpus. Available as not tagged orthographic text only.
A version on CD-ROM is available from ICAME; a small sample and a manual are available.

Lampeter Historical Corpus (The Lampeter Corpus of Early Modern English Tracts):

http://www.tu-chemnitz.de/phil/english/real/lampeter/lamphome.htm
The Lampeter (not tagged) Corpus of Early Modern English Tracts is a collection of texts on various subject matter published between 1640 and 1740 - a time that is marked by the rise of mass publication, the development of a public discourse in many areas of everyday life and, last but not least, the standardisation of British English. The Lampeter Corpus has been compiled over the last four years at Chemnitz University's REAL Centre and has been completed fairly recently, i.e. May 1998. It is available for scholarly research free of charge from (but the service not yet activated). Other available versions:
+ Oxford Text Archiv version (Oxford University Computing Services - 13 Banbury Road, Oxford OX2 6NN - UK)
+ ICAME version (International Computer Archiv of Modern and Medieval English); cf. also the alternative page. Also available from the Norwegian Computing Centre for the Humanities (Harald Haarfagresgt. 31, N-5007 Bergen, Norway), with a small downloadable sample.
+ an UCREL version is still advertised:, but the official page is now that of REAL.
+ A SGML-TEI Light 7.8 MB version is freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

Late Modern English Prose Corpus: http://ota.ahds.ac.uk/

(search Catalog)
A 555 KB English corpus of Late Modern English Prose. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

LCCPW (Lancaster Corpus of Children's Project Writing): http://www.ling.lancs.ac.uk/lever/index.htm

The Lancaster Corpus of Children's Project Writing (LCCPW) is a digitized collection of project work produced by children in one primary school classroom aged between 9 and 11, collected in 1994-1197, and built up in corpus format by Nick Smith and Roz Ivanic^. It is also a component in a larger research programme: a longitudinal study of children's writing-for-learning, based on the writing of 8 - 12 year old children. A demonstration of this "strange" and higly remarkable corpus took place at CL 2001 Congress (Lancaster April 1), on which I rely for this file. The corpus is mounted on a public-access website to provide easy navigation between its element. It consists of: (a) a hyperdocument showing the original visual form of these texts which can be navigated in a variety of way; (b) an electronic version of the texts themselves, transcribed from handwriting using a system developed at Lancaster; (c) an electronic version of the texts tagged for POS. It surely deserve (more than) a visit! [2001 May 2].

Le Monde Diplomatique English Corpus (TRACTOR):

http://solaris3.ids-mannheim.de/tractor/telri/MAN/man-16.htm
Articles in English from Le Monde Diplomatique (HTML texts). From Institut für Deutsche Sprache, Mannheim, Germany.
Available under subscription to TRACTOR.

LEXIS Corpus (Sample of Spoken English): http://ota.ahds.ac.uk/

(search Catalog)
The 384 KB LEXIS Corpus (samples of 1963 spoken English). Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

Library of Congress Electronic Texts and Publishing Resources:

http://lcweb.loc.gov/global/etext/etext.html
The Reference links-page of the Library of Congress for American electronic publishing.

Lieder and Songs Texts Page: http://www.recmusic.org/lieder/

This page contains the texts (all freely downloadable HTML) of 1842 poets and 1217 composers in 22 different languages: English has 1865 texts. For a more detailed description, see in the E-Texts section.

LLC (London-Lund Corpus): http://www.ucl.ac.uk/english-usage/about/history.htm;

(cf. also ICAME and UCREL pages).
As the name implies, the London-Lund Corpus of Spoken English (LLC) derives from two projects. The first is the Survey of English Usage (SEU) at University College London, launched in 1959 by Randolph Quirk, who was succeeded as Director in 1983 by Sidney Greenbaum. The second project is the Survey of Spoken English (SSE), which was started by Jan Svartvik at Lund University in 1975 as a sister project of the London Survey. The spoken corpus of the Survey of English Usage has been transcribed with a sophisticated marking of prosodic and paralinguistic features. All the SEU texts, both written and spoken, have been analysed grammatically. In 1975 the Survey of Spoken English was established at Lund. Its initial aim was to make available, in machine-readable form, the spoken material which by then had been collected and transcribed in London: 87 texts totalling some 435 000 words (see Svartvik et al. 1982 for an account of the input procedures).
A version on CD-ROM is available from ICAME; you can also reach a small sample and a manual. As general contact, mail to ICAME.
+ A 5 MB version is freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
+ Some transcriptions of the London-Lund corpus that took place as part of the Beach/CoMoPro projects at the University of Edinburgh, Centre for Cognitive Science, are freely downloadable by FTP. [Added 2001 August 5].

LLT (The Lancaster-Leeds Treebank):

http://www.comp.lancs.ac.uk/computing/research/ucrel/corpora.html#llt
A manually parsed subsample of the LOB corpus showing the surface phrase structure of each sentence, the 45,000-word Lancaster-Leeds Treebank, which Geoffrey Sampson developed twenty years ago for Geoffrey Leech and Roger Garside's parsing project, though small, was apparently the first in the field: the authors believe that they coined the term treebank itself, which has now come into general use in Computational Linguistics. The 45,000 words are taken from all the gender categories of the LOB corpus.
See also the ICAME LOB page.

LOB Corpus (The Lancaster/Oslo-Bergen Corpus):

http://www.comp.lancs.ac.uk/computing/research/ucrel/corpora.html#lob
Approximately 1,000,000 words of British Written English dating from 1960. The corpus is made up of 15 different genre categories. Available as orthographic text, and tagged with the CLAWS1 part-of-speech tagging system. The Leeds-Lancaster Treebank (LLT) and Lancaster Parsed Corpus (cf. LPC) are analyzed subsamples of the LOB corpus. For manuals see the following pages: Corpus Manual (1978), Tagged Corpus Manual - 1986 (alternative page) and Local Online copy (- password required).
A version on CD-ROM is available from ICAME; for a small sample follow this link. For a general contact, mail to ICAME.
+ A 5 MB tagged and horizontal format version compiled by Stig Johansson is freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

Longman-Lancaster Corpus: http://www.comp.lancs.ac.uk/computing/research/ucrel/corpora.html#llc

Approximately 14.5 million words of Written English from various geographical locations in the English-speaking world and of various dates and text types. Orthographic text only. Conctat: Della Summers, Longman Dictionaries, Longman House, Burnt Mill, Harlow, Essex, CM20 2JE UK.

LPC (The Lancaster Parsed Corpus):

http://www.comp.lancs.ac.uk/computing/research/ucrel/corpora.html#lpc
A subsample of the LOB corpus, parsed by computer and manually corrected by several researchers. Approximately 140,000 words with samples from each of the 15 categories in the LOB corpus.
A version on CD-ROM is available from ICAME; you can also get a small introduction and a fuller PDF manual.

LUCY Corpus: http://www.grsampson.net/Resources.html

LUCY, that's to say Geoffrey Sampson's Structure in Written English in the UK Project, which began in January 2000 and ended in Winter 2003. The LUCY Corpus is a structurally-annotated sample ("treebank") of present-day British written English, representing not only the polished writing of published documents, but also the less-skilled or unskilled writing of young adults at the end of secondary and beginning of tertiary education, and of children aged nine to twelve in various types of school and parts of the country. Like its sister treebanks, SUSANNE and CHRISTINE, LUCY uses the same highly detailed and comprehensive scheme of structural annotation (the "SUSANNE scheme"), which is widely recognized as the most precise system of its kind available.
The aim of the LUCY project was to create a body of machine-readable data that will enable researchers to examine how the grammatical resources of the English language are actually used by people writing English in Britain at the turn of the millennium, and to compare written usage with usage in spontaneous speech. The material material was selected in order to make the Corpus specially relevant for studies of young people's acquisition of writing skills. As well as samples drawn from recent published writing (which can be seen as in some sense representing the 'model' for writing-skills education), and from unpublished writing of various types produced by mature users of written English (for instance, business correspondence), LUCY includes samples of writing produced by young people destined for careers where the generation of written prose will be a significant element, but who have not yet finished acquiring mature writing skills.
For a general presentation of the LUCY project go to its proper homepage, which is http://www.grsampson.net/RLucy.html (but is advisable to refer first of all to the general one provided in the title line). There is also a full Documentation file, which is available as a single Web page. The Corpus itself can be freely downloaded by anonymous ftp: what you receive will be a gzipped tar file; use the "gunzip LUCYrf.tgz" to uncompress it into a LUCYrf.tar file, and "tar -xvf LUCYrf.tar" to unpack the archive into its constituent files.
[updated 2004 March 25].

Melbourne survey corpus of Australian English: http://ota.ahds.ac.uk/

(search Catalog).
The 591 KB Melbourne-Surrey corpus of Australian English compiled by Knut Hofland. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

MEMEM (Michigan Early Modern English Materials): http://www.hti.umich.edu/m/memem/

The Michigan Early Modern English Materials (MEMEM) were compiled by Richard W. Bailey, Jay L. Robinson, James W. Downer, with Patricia V. Lehman and are freely querable online. The Materials consist of citations collected for the modal verbs and certain other English words for the Early Modern English Dictionary. Many of the slips used in the work were the original Oxford English Dictionary slips, provided to the University of Michigan by the editors of the OED. The work included here was prepared electronically over a period of several years ending in 1975. The source file is ca. 16 megabytes and consists of ca. 50,000 records. The source files are said in the description page to be available via anonymous ftp in several files as a compressed 5 MB files containing the Materials, the DTD constructed for the Materials, and the character DTD for the Materials; the links are however clearly wrong. [2001 April 23].

METER Corpus (MEasuring TExt ReuseCorpus):

http://www.dcs.shef.ac.uk/research/groups/nlp/meter/Metercorpus/metercorpus.htm
As a part of the METER (MEasuring TExt Reuse) project a staff from the Departments of Computer Sciences and of Journalism (consisting in Robert Gaizauskas, Jonathan Foster, Yorick Wilks, John Arundel, Paul Clough, Scott Piao) have built a new type of comparable corpus consisting of annotated examples of related newspaper texts. Texts in the corpus were manually collected from two main sources: the British Press Association (PA) and nine British national newspapers that subscribe to the PA newswire service. In addition to being structured to support efficient search for related PA and newspaper texts, the corpus is annotated at two levels. First, each of the newspaper texts is assigned one of three coarse, global classifications indicating its derivation relation to the PA: wholly derived, partially derived or non-derived. Second, about 400 wholly or partially derived newspaper articles are annotated down to the lexical level, indicating for each phrase, or even individual word, whether it appears verbatim, rewritten or as new material. The hope is that this corpus will be of use for a variety of studies, including detection and measurement of text reuse, analysis of paraphrase and journalistic styles, and information extraction/retrieval. [2001 April 29].
+ A lot of documantation and materials is available directly from the site.
+ Beta 1.0 was released in a limited edition and presented at the CL2001 Congress hold at Lancaster University in spring 2001. Contact.

Michigan Early Modern English Corpus: http://ota.ahds.ac.uk/ (search Catalog).

The 5 MB Michigan Early Modern English Corpus was compiled by Richard W. Bailey and Jay L. Robinson. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

Myriobiblos: http://www.myriobiblos.gr/

Myriobiblos, The E-text Library of the Church of Greece, provides a lot free HTML e-texts (you can browse and save them) from Classical to modern Greek; but there are also fewer texts (mainly translation) in Bulgarian, English, French, German, Italian, Romanian and Russian. For more cf. under the E-Texts section. [2001 May 1].

Moby Project: http://www.dcs.shef.ac.uk/research/ilash/Moby/

The Moby lexicon project by Grady Ward’s has been placed into the public domain. There is a downloadable 25 MB tar-gzipped complete distribution, or each sub-project can be downloaded individually. For details cf. under the Corpora General section.

Moby Shakespeare: ftp://gatekeeper.dec.com/pub/data/shakespeare/

Moby Shakespeare edition, a part of the Moby Project, is the only complete freeware e-text of all Shakespeare’s works. It is easily available in more or less complete version and formats from nearly all literary English e-texts repositories. For more infos cf. under the E-Texts section. [2001 April 23].

Modern English Prose Corpus: http://ota.ahds.ac.uk/ (search Catalog).

A 490 KB corpus of Modern prose (15 2000-word samples; Chapter(s) from various mid-20th Century novels) compiled by Andrew Q. Morton and Neil Hamilton-Smith. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

MUC-VI Text Collection Corpus (Message Understanding Conference English Text):

http://morph.ldc.upenn.edu/Catalog/LDC96T10.html
The MUC VI corpus contains English texts used in the 1996 Message Understanding Conference. The texts for this corpus are taken from Dow Jones Inc., Reuters America Inc. and are protected by applicable copyright law.
Available only by the LDC through membership or 100$ price.

NATO Multilingual (Fr-De-En) Corpus (TRACTOR):

http://solaris3.ids-mannheim.de/tractor/telri/MAN/man-06.htm
HTML texts from NATO. cf. this site under Multilingual and Parallel Corpora.
Available under subscription to TRACTOR.

New Dragon Book of Verse Corpus : http://ota.ahds.ac.uk/ (search Catalog).

A 287 KB corpus of Modern English Verse transcribed from The new dragon book of verse, edited by Michael Harrison and Christopher Stuart-Clark, Oxford: Oxford University Press, 1977. Compiled by Graham Roberts. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

New Scientist Article Corpus: http://ota.ahds.ac.uk/

(search Catalog).
A 5 MB English corpus made of articles from "New Scientist": Vol. 96, no. 1334 (2 Dec. 1982) - Vol. 98, no. 1357 (12 May 1983). Compiled by E. O. Winter. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

New York Newspaper Advertisements and News Items 1777-1779: http://ota.ahds.ac.uk/

(search Catalog).
A 1590 KB corpus of New York newspaper advertisements and news items 1777-1779. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

Newdigate Newsletters Corpus

This is an electronic version of the first 2,100 manuscript newsletters (of a total of 3,950) in the Newdigate series. Most are addressed to Sir Richard Newdigate (d. 1710), Arbury, Warwickshire; they date from 13 January 1674 to 29 September 1715 and are at the Folger Shakespeare Library, Washington, D. C. They were issued on Tuesdays, Thursdays, and Saturdays by the Secretary of State and were usually written on three sides of a bifolium. Those in this corpus come up through 11 June 1692.
A version on CD-ROM is available from ICAME (alternative page); see also a small description. For a general contact, mail to ICAME.

North American News Text Corpus: http://morph.ldc.upenn.edu/Catalog/LDC95T21.html

The North American News Text corpus is composed of news text that has been formatted using TIPSTER-style SGML markup. The text is taken from the following sources: "Los Angeles Times" & "Washington Post", 05/94-08/97, 52 million words; "New York Times News Syndicate", 07/94-12/96, 173; "Reuters News Service" (General & Financial), 04/94-12/96, 85; "Wall Street Journal", 07/94-12/96, 40.
Available only by LDC membership.

North American News Text Supplement Corpus: http://morph.ldc.upenn.edu/Catalog/LDC98T30.html

This release of North American News Text provides a supplement to the LDC's earlier publication of similar materials (cf. North American News Text Corpus). The same TIPSTER-style SGML markup is used in formatting the data. The data sources are as follows: "Los Angeles Times" & "Washington Post", 09/97-04/98, 11 million words; "New York Times News Syndicate", 01/97-04/98, 116; "Associated Press World Stream English", 11/94-04/98, 143.
Available only by LDC membership.

Northern Ireland Speech Corpus: http://ota.ahds.ac.uk/ (search Catalog).

Northern Ireland English transcribed corpus of speech. 1453 Kb. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

Northwest Coast Indian mythology Corpus: http://ota.ahds.ac.uk/ (search Catalog).

The 1734 KB English corpus of British Columbian Indian myths from published & unpublished sources compiled by Randy Bouchard and Hilde Colenbrander. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

NPECTE (Newcastle-Poitiers Electronic Corpus of Tyneside English):

http://www.ncl.ac.uk/english/research/linguistics/npecte.htm
The NPECTE project is based on two separate corpora of recorded speech:The earlier of the two corpora was gathered during the Tyneside Linguistic Survey (TLS) in the late 1960s, and consists of 86 loosely-structured 30-minute interviews. The informants were drawn from a stratified random sample of Gateshead in North-East England, and were equally divided among various social class groupings of male and female speakers, with young, middle, and old-aged cohorts. The original reel-to-reel tapes are now savaged to audio cassette format, catalogued, archived and housed in the Catherine Cookson Archive of Tyneside and Northumbrian Dialect in the Department of English Literary and Linguistic Studies (DELLS), University of Newcastle upon Tyne. The more recent corpus was collected in the Tyneside area in 1994 for an ESRC-funded project ‘Phonological Variation and Change in Contemporary Spoken English’ (PVC). This data is in the form of 18 DAT tapes, each of which averages 60 minutes in length. Dyads of friends or relatives were encouraged to converse freely with minimal interference from the fieldworker, and informants were again equally divided between various social class groupings of male and female speakers in young, middle, and old-age cohorts. This material is housed in the Department of Speech, University of Newcastle upon Tyne. Recently, an AHRB grant was awarded under the Resource Enhancement Scheme to combine the TLS and PVC collections into a single corpus and to make it available to the research community in a variety of formats: digitized sound, phonetic transcription, standard orthographic transcription, and various levels of tagged text, all aligned. The project is due to begin on 1 September 2001. Current members are Joan Beal (Sheffield), Karen Corrigan (Newcastle; homepage), Marc Fryd (Poitiers), and Hermann Moisl (Newcastle; homepage). [2001 May 1].

NPtool: http://www.lingsoft.fi/doc/nptool/

NPtool, by Atro Voutilainen, is a fast and accurate system for extracting noun phrases from English texts. It is sold by Lingsoft. For availability (it is a commercial software!) you have to ask to info@lingsoft.fi.

OBI (The Online Book Initiative): http://ftp.std.com/obi/

The Online Book Initiative's "Online Book Repository" (OBR) is a large collection of English language texts (originals and translations) and related materials ranging from Shakespeare and The Bible to novels, poetry, standards documents, etc. The page is only an index, but it is speedy and all texts are ready to be freely downloaded. Contact.

OMACL (Online Medieval and Classical Library): http://sunsite.berkeley.edu/OMACL/

The Online Medieval and Classical Library (hold by the The Berkeley Digital Library SunSITE) is a collection of some of the most important literary works of Classical and Medieval civilization translated into English. Texts can be browsed and serched online and you can also freely download them in ZIP format from the OMACL FTP Site at the University of Kansas. At present there are only 30 texts available, and many of the larger texts are also available in multiple-file editions.

Online Classics Horror and Phantasy Fiction: http://home.swipnet.se/~w-60478/

This page collects links to every work of classic horror and fantasy fiction (lato sensu: Shakespeare, Goethe's Faust and Milton's Paradises are comprised as well!) available on the Internet. All texts are in English language. All links are easy to access and download; only texts which are not easily accessible are directly reproduced on this site. All texts are freely downloadable, usually in ZIP format.

Opera e-Libretto: http://www.geocities.com/voyerju/libretti.html

Opera e-Libretto (Collection Ulric Voyer) is a collection of 220 free e-texts of opera libretti. Displayed libretti are in English (Purcell, Blow, Clay, Cellier, Edwards, Sullivan, Ford, Cadman, Gershwin, Yanelow, Hoiby, Neff + Engl. translation of Marschner’s Vampyr), Italian, French, German, Russian and Danish . All texts are in html, usually broken in more files according to act divisions. For a more detailed file cf. under the E-Texts section. [2001 June].

Orwell's 1984 parallel English-Romanian Text (TRACTOR):

http://solaris3.ids-mannheim.de/tractor/telri/BUC/buc-01.htm
Cf. under Multilingual and Parallel Corpora.
Available under subscription to TRACTOR (http://www.tractor.de/).

Oslo ENPC (English - Norwegian Parallel Corpus): http://www.hf.uio.no/iba/prosjekt/

The English-Norwegian Parallel Corpus (ENPC) of the University of Oslo consists of original texts and their translations (English to Norwegian and Norwegian to English). The focus has been on novels and fairly general non-fictional books. In order to include material by a range of authors and translators, the texts of the corpus are limited to text extracts (chunks of 10,000-15,000 words). The coding system used to mark up the ENPC follows the suggestions made by the Text Encoding Initiative (TEI) as presented in Guidelines for Electronic Text Encoding and Interchange (Sperberg-McQueen & Burnard, 1994). The English part of the ENPC has been tagged for part-of-speech (POS). The tagging was done automatically by using the English Constraint Grammar parser (cf. EngCG Parser) developed by Atro Voutilainen etc.. The Norwegian part of the corpus will not be tagged, for lack of a Norwegian tagger.
Access to the Corpus is up today restricted only to researchers and students at the University of Oslo: cf. this site.
Only the manual is freely available online.
See under Multilingual and Parallel Corpora section for more infos.

OTA (Oxford Text Archive): http://ota.ahds.ac.uk/

(beware that the page has lot of frames and Java)
A large catalogue of electronic texts, mainly of literary, philological and scholarly genre. English Language is prevalent but not exclusive. They offer also some linguistic corpora for free after sending a disclaimer statement (e.g. Lampeter Corpus, Northern Ireland Speech Corpus, SUSANNE Corpus): query their catalogue with search author=corpora. For more information see the E-Texts section.

Pamphlets of the American Revolution Corpus: http://ota.ahds.ac.uk/ (search Catalog).

A 414 KB corpus made by selections from Pamphlets of the American Revolution by Bernard Baylin. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.

Penn Treebank Corpus: http://www.cis.upenn.edu/~treebank/home.html

The Penn Treebank Project at the University of Pennsylvania (Penn) annotates naturally-occuring text for linguistic structure. Most notably, it produces skeletal parses showing rough syntactic and semantic information -- a bank of linguistic trees. Distributed by the LDC.
+ Treebank-1, the first CD-ROM release (LDC Catalog No.: LDC94T4B-3.1; no longer mantained), contained over 1.6 million words of hand-parsed material from the Dow Jones News Service, plus an additional 1 million words tagged for part-of-speech. This material is a subset of the language model corpus for the DARPA CSR large-vocabulary speech recognition project. It also contained the first fully parsed version of the Brown