|(A-D)||Afrikaans - Albanian - Albanian (Caucasic) - Arabic - Armenian - Australian lgs. - Awabakal (Yuin-Kuric) - Azerbaijani - Barbadian (Creole English) - Basque - Bengali - Berbice (Creole Dutch) - Bulgarian - Catalan - Chinese (incl. Cantonese) - Chiricahua (Apache) - Commonwealth Antillean Creole French - Commonwealth Winward Islands Creole English - Czech - Danish - Dutch|
|(E)||English (Modern) - English (Old & Middle) - Esperanto - Estonian|
|(F-I)||Farsi - Finnish - French - French Antillean Creole French - Frisian - Gaelic - Georgian - German - Gothic - Greek (Classic and Modern) - Gujarati - Gulf of Guinea Creole Portuguese - Guyana Creole English - Guyanais (Creole French) - Haitian (Creole French) - Hebrew - Hindi - Hungarian - Icelandic (incl. Old Norse) - Indoeuropean - Indonesian - Irish (incl. Ogamic, Old & Middle Irish) - Italian|
|(J-R)||Jamaican Creole English - Japanese - Karelian - Korean - Krio (Sierra Leone Creole English) - Kru (Liberian Pidgin English) - Latin - Latvian - Leeward Islands Creole English - Lithuanian - Livonian - Louisiana Creole French - Macaísta (Macau Creole Portuguese) - Malay - Maltese - Mambila - Manx - Mari (Eastern Meadow) - Mauritian Creole (Isle de France CF) - Mescalero (Apache) - Miskito Creole English - Mitchif (French-Cree mixed language) - Nahuatl - Neapolitan - Negerhollands (Creole Dutch) - Norwegian - Occitan - Palenquero (Creole Spanish) - Panjabi - Polish - Portuguese (incl. Brazilian & Galego-Portuguese) - Romanian - Russian|
|(S-Z)||Sardinian - Saxon (Old) - Scots - Serbo-Croate - Singhalese - Slavonian (Old Church Slavonian) - Slovak - Slovene - Spanish - Sumerian - Swahili - Swedish - Tagalog - Taino - Tamil - Tetun (East Timorese) - Thai - Tibetan - Tocharian (A & B) - Tok Pisin (Creole English) - Turkish - Ukrainian - Upper Guinea Creole Portuguese - Urdu - Uzbek - Veps - Vietnamese - Virgin Islands Creole English - Welsh - West African Pidgin English.|
I provide here language-specific links to corpora, e-texts and NLP resources in general. Resources already presented in the previous sections are also repeated here whenever relevant.
ACE was the first systematically compiled heterogeneous (not tagged) corpus in Australia, designed to support a variety of linguistic researches. Interest in the differentiation between Australian, British and American English meant that a corpus modelled on the Brown and LOB corpora would provide ready comparisons. It would also serve as a strategic sample of current Australian English, and as a reference corpus for comparisons with more specialised, homogeneous corpora in Australia. ACE matches the Brown and LOB corpora in most aspects of its structure and constituency, so that direct interdialectal comparisons can be made on a comparable range of printed genres. Yet the desire to create an up-to-date corpus of Australian English prompted the decision not to match Brown and LOB chronologically, ie. with data drawn from publications of the early 1960s. Instead, ACE consists of material from 1986.
A version on CD-ROM is available from ICAME: you can also download a small sample, and a manual as well.
(Association for Computational Linguistics / Data Collection Initiative Corpus):
The ACL Data Collection Initiative was founded "to oversee the acquisition and preparation of a large text corpus to be made available for scientific research at cost and without royalties". Towards this goal, the ACL/DCI has acquired several hundred million words of text, has modified much of it so as to make it more accessible for research purposes, and has distributed tapes containing portions of this data to more than 40 research sites. The many formats in which the originals of these texts came have all, to one extent or another, been mapped into a markup language consistent with the SGML standard (ISO 8879). SGML provides a labelled bracketing of the text, with labels permitted to have associated feature-value pairs. The ACL/DCI corpus of American English is available in different distibutions. Contact: Linguistic Data Consortium, 441 Williams Hall, University of Pennsylvania, Philadelphia, PA 19104-6305; Phone +898-0464; FAX: (+1 215) 573-2175x.
+ The LDC CD-ROMs version contains texts from: Wall Street Journal, copyright 1987, 1988, 1989, provided by Dow Jones, Inc.; the Collins English Dictionary, Copyright 1979, William Collins Sons Co., Ltd.; scientific abstracts provided by the U.S. Department of Energy and a variety of grammatically tagged and parsed materials from the Treebank project at the University of Pennsylvania, copyright 1990,1991, University of Pennsylvania. The total amount of uncompressed text is 620 Mbytes. The many formats in which the originals of these texts came have all, to one extent or another, been mapped into a markup language consistent with the SGML standard (ISO 8879).
Available by the LDC through membership or 100$ price.
+ For a CD-ROM corpus contains a complete, Treebank-style parsing of the three-year WSJ archive from the ACL/DCI corpus (about 30 million words of text) see BLLIP 1987-89 WSJ Corpus (Release 1), available from the LDC.
The Alex Catalogue of Electronic texts is a collection of digital documents (freely querable online and downloadable) collected in the subject areas of English literature, American literature, and Western philosophy. The Catalogue isn't only an Archive of downloadable texts: you can also search the content of located texts and make some query online. For example, you can search for Mark Twain's The Adventures Of Huckleberry Finn. Simple. You can then search the content of The Adventures for the words like fish and belly to get a description of Huck Finn's father. Moreover, you can search the content of multiple documents simultaneously. For example, you can first locate all the documents in the collection authored by Mark Twain. Next, you can search selected documents for something like slav* (which includes slave, slaves, slavery, etc.) to draw out themes across texts. For more information see this site in the E-Texts section.
The AMALGAM (Automatic Mapping Among Lexico-Grammatical Annotation Models) project is an attempt to create a set of mapping algorithms to map between the main tagsets and phrase structure grammar schemes used in various research corpora. They are developing a Multi-tagged Corpus and Multi-Treebank, i.e. a single text-set annotated with all the above tagging and parsing schemes. Useful demos are already online:
+ AMALGAM Multi-tagged Corpus (180 Eng. sentences).
+ AMALGAM Multi-Treebank (60 Eng. sentences).
For more information see this site in the Reference, Standards etc. section.
A 1608 KB corpus made of stories from the Associated Press news network, December 1979. Compiled by Glenn Akers. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
This project of tthe University of Michigan Humanities Text Initiative (HTI) is assembling an electronic archive of volumes of American poetry prior to 1920. All texts are freely readable and downloadable either in HML or SGML formats. Simple, boolean or co-occurence searches can be submitted throughout the entire American Verse Project collection; there is also an interface for searching only personally selected works in the collection.
A subsample of the AP corpus, annotated to show the reference of pronouns and lexical cohesion. Approximately 100,000 words.
The ANC project (led by Catherine McLeod, Nancy Ide, Charles Fillmore and others) is fostering the development of a corpus comparable to the BNC, covering American English. Corpus-analytic work has demonstrated that the BNC is inappropriate for the study of American English, due to the numerous differences in use of the language. Creation of a corpus of American English will significantly contribute to language and linguistic research, as well as provide a rich national resource for use in education at all levels. A consortium of publishers of American English dictionaries and companies with interests in language processing has been formed. Consortium members are providing both materials for inclusion in the corpus and initial financial support for the project. The LDC is providing staff time to perform the initial clean-up and base-level encoding of the data and will manage distribution of the corpus. The ANC will contain a core corpus of at least 100 million words, comparable across genres to the BNC. Beyond this, the corpus will include an additional component of potentially several hundreds of millions of words, chosen to provide both the broadest and largest selection of texts possible. The ANC will be encoded according to the specifications of XCES, the XML version of CES. Initially, the corpus will contain only textual data across a variety of genres, including transcriptions of spoken data. Audio speech data, video, etc. will be added in a later phase, depending on funding. All data will be distributed freely for non-commercial research purposes from the outset. Commercial use will be limited to members of the ANC Consortium throughout the development process and for five years after the full corpus becomes available. The project is still in his infancy, since it was ideated only in 1999 (see the Proposal paper online). [2001 April 29].
A skeleton-parsed corpus of American newswire reports. 1,000,000 words.
A skeleton-parsed corpus of a wide range of English texts. 200,000 words.
The Representative Corpus of Historical English Registers has about 2 million words of British and American English texts between 1650 and 1990, with both written and speech-based registers. Presented in the two following papers:  Biber, Douglas, Edward Finegan & Dwight Atkinson (1994a). "ARCHER and Its Challenges: Compiling and Exploring a Representative Corpus of Historical English Registers". Creating and Using English Language Corpora. Papers from the Fourteenth International Conference on English Language Research on Computerized Corpora, Zürich 1993, ed. by Udo Fries, Gunnel Tottie & Peter Schneider, 1–13. Amsterdam & Atlanta, GA: Rodopi.  Biber, Douglas, Edward Finegan, Dwight Atkinson, Ann Beck, Dennis Burges & Jena Burges (1994b). "The Design and Analysis of the ARCHER Corpus: A Progress Report [A Representative Corpus of Historical English Registers]". Corpora Across the Centuries. Proceedings of the First International Colloquium on English Diachronic Corpora, St Catharine's College Cambridge, 25–27 March 1993, ed. by Merja Kytö, Matti Rissanen & Susan Wright, 3–6. Amsterdam & Atlanta, GA: Rodopi. [Last checked 2001 August 26].
Contact: Douglas Biber (see this page), Department of English, Northern Arizona University, Flagstaff, AZ 86011-6032, USA.
The Arizona Corpus of Elementary Student Writing is said to contain 5,000 essays written by native English, Spanish and Navajo student residents of the state of Arizona. No other information available (these data were got from a page of Przemyslaw Kaszubski on Computerised learner corpora). [2001 May 1].
The BAF Corpus is a corpus of French - English bi-texts, i.e. of pair of French and English texts which are mutual translations, and whose sentences have been aligned. This corpus has been built up by the CITI computer assisted translation group (TAO). Most of the texts are of institutional genre (canadian HANSARD, ONU reports, etc.), but a few scientifical papers and a literary work were also included. The whole corpus has about 400.000 wors for each language. BAF Version 1.1. is already available and can be freely downloaded in UNIX GZ format, ZIP and each file separatedly in TXT and CES formats. Description, allignment conventions, encoding documentation, and a COAL Tools suite, are also freely available on the site. [2001 April 23].
A 383 KB English corpus made of transcriptions from: Mss. in the Berkshire County Record Office. Compiled by C. R. J. Currie. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
The MPCP (University of Maryland Parallel Corpus Project) provides versions of the Bible consistently annotated according to the CES. There are also some freely downloadable PS paper related to this project, mainly by Philip Resnik. Versions already freely available are the Chinese, Danish, English, Finnish, French, Greek, Indonesian, Latin, Spanish, Swahili, Swedish and Vietnamese ones. For more cf. under the Parallel Corpora section. [2001 May 1].
A 1684 KB corpus compiled by Roger Mitton. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
This two CD-ROM newswire corpus contains a complete, Treebank-style parsing of the three-year WSJ archive from the ACL/DCI corpus -- about 30 million words of text. The parsing and part-of-speech (POS) tagging for the entire archive were done using statistically-based methods developed by Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale and Mark Johnson of BLLIP. This corpus both overlaps and supplements the 1-million-word Penn Treebank collection of parsed and POS-tagged WSJ texts. Available only by the LDC through membership or 100$ price.
A 1483 KB English corpus transcribed from: Blues lyric poetry: an anthology / Michael Taft. -- New York; London: Garland, 1983. -- (Garland reference library of the humanities ; v. 361). Compiled by Michael Taft. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
The British National Corpus is a very large (over 100 million words) corpus of modern English, both spoken and written. The Corpus is designed to represent as wide a range of modern British English as possible. The written part (90%) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text. The spoken part (10%) includes a large amount of unscripted informal conversation, recordeded by volunteers selected from different age, region and social classes in a demographically balanced way, together with spoken language collected in all kinds of different contexts, ranging from formal business or government meetings to radio shows and phone-ins. The corpus comprises 4,124 texts, of which 863 are transcribed from spoken conversations or monologues. Each text is segmented into orthographic sentence units, within which each word is automatically assigned a word class (part of speech) code. There are six and a quarter million sentence units in the whole corpus. Segmentation and word-classification was carried out automatically by the CLAWS stochastic part-of-speech tagger developed at the University of Lancaster. The classification scheme used for the corpus distinguishes some 65 parts of speech, which are described in the accompanying documentation. The corpus is encoded according to the Guidelines of the TEI using SGML (using ISO standard 8879). [Last rev. 2001 April 23].
+ BNC World Edition on CD-ROM. The cost (excluding VAT) is £250 for a full networked licence, or £50 (a very interesting price indeed!) for a single user licence (in addition, VAT at 17.5% where applicable is payable on orders within Europe, and a small fixed fee is charged for postage and packing). This includes BNC Licence Fee (valid for five years) of £10, Two CD-Roms, Documentation ,(Networked version only), Source code for the SARA system. Please note however that the current version of the BNC (version 1.0) cannot be distributed outside the EU because of copyright restrictions applicable to a few texts. A new version which will not be restricted in this way is currently in production. The single-user system is designed for use on standalone computers running under any Microsoft Windows 32-bit system (i.e.. Windows 98, 98, ME, NT or 2000). It needs at least 6 Gb of free disk space, and 64 Mb of RAM. Better performance will be obtained with more memory (128+Mb) and faster (over 300 Mhz) processors. The networked system is designed for use on a local TCP/IP network running under any version of Unix. The server has been successfully run under versions of Linux, Solaris, and Digital Unix and on a variety of hardware platforms: a minimum of 8 Gb of disk space is needed and at least 128 Mb of memory. The amount of memory used depends chiefly on the number of client sessions running and on the complexity of queries posed. A fully-featured Windows 32 client, which can be installed on any PC connected to the TCP/IP network, is supplied as well.
+ BNC Online is a new service which allows anyone with access to the internet to search online the British National Corpus. Several level of access are provided. (1) You can make a simple search directly from the web browser you are currently using, freely and without registering; the restricted search interface will not return more than 50 hits, with a maximum of one sentence of context for each, but it will support any legal CQL query. (2) You can register for a free temporary user name to experiment unrestrictedly with this online service. A temporary username costs nothing, but expires after thirty days. (3) To take full advantage of the BNC Online service, however, you must first download the SARA client software and install it on your PC. SARA is a special purpose browser and concordancer generator, designed specifically for use with the BNC. It is free of charge to all BNC licensees. At present it is only available for Microsoft 32-bit Windows systems (Windows 95, 98, or NT). (4) Once you have made a temporary 30 days registration, if you want to continue your use of the service after this trial period, you must pay a 60 pound fee to receive a full registration. This includes a three year licence to use the system, one copy of a detailed user manual, and free updates of the client software. This fee entitles you to a license for one or two machines for a year. Compared with other similar services it is surely a great offer! [Rev. 2001 April 23].
+ A small Spoken English subsection of the BNC constitutes the 50% of the tagged and freely available CHRISTINE Corpus.
The BOnonia Legal Corpus (BOLC), developed in CILTA (Centro Interfacoltà di Linguistica Teorica e Applicata 'L. Heilmann') at Bologna University since 1997, by Rema Rossini Favretti and Fabio Tamburini, for the moment is formed of two subcorpora: one English, the other Italian, but it could be expanded at a later stage. Future availability is not known. For more details cf. under the Parallel Corpora section. [2001 April 23].
Approximately 1,000,000 words of American written English dating from 1960. The genre categories are parallel to those of the LOB corpus. Available as orthographic text only. For further information see the Corpus Bibliography and the Corpus Manual(available also at this adress).
+ A version on CD-ROM is available from ICAME; you can also dowload a small sample.
+ Another version (tagged and parsed) comes from LDC as part of the Penn TreeBank.
+ A 130,000-word subset of the Brown Corpus constitutes the text basis of the annotated and freely available SUSANNE Corpus.
+ A 5 MB version compiled by W. Nelson Francis & Henry Kuccaronera is freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
Texts from Deutsche Bundesregierung. Cf. this site under Multilingual and Parallel Corpora section.
Available under subscription to TRACTOR.
The Calgary text compression corpus. This corpus is used in the book Bell, T.C., Cleary, J.G. and Witten, I.H. Text compression. Prentice Hall, Englewood Cliffs, NJ, 1990 to evaluate the practical performance of various text compression schemes. Several other researchers are now using the corpus to evaluate text compression schemes. Nine different types of English texts are represented, and to confirm that the performance of schemes is consistent for any given type, many of the types have more than one representative. All fyles are freely downloadable. The Calgary Corpus has now been superseded by the new Canterbury Corpus. [2001 April 28].
The text component of the package includes transcripts and documentation files for 120 unscripted telephone conversations between native speakers of English; a separate LDC catalog entry, LDC97S42 provides the speech data for these conversations, which are partitioned into separate subdirectories for "training" (80 conversations), "development test set" (20 conversations) and "evalutation test set" (20 conversations). The transcripts cover a contiguous 10 minute segment of each call in the training and development test sets, and a 5 minute segment of each call in the evaluation set, yielding a total of 18.3 hours of transcribed spontaneous speech, comprising about 230,000 words. The transcripts are timestamped by speaker turn for alignment with the speech signal and are provided in standard orthography. In addition to transcript files, this corpus contains full documentation on the transcription conventions and format. Complete auditing information on the speakers represented in the transcripts (including gender, channel quality and so on) is also included.
Available as FTP file by the LDC through membership or by 500$ price.
A 5 MB corpus made of English texts by 14 canadian poets. Compiled by Sandra Djwa. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
The Canterbury Corpus is a benchmark to enable researchers to evaluate lossless compression methods. This site includes test files and compression test results for many research compression methods. The Canterbury Corpus file set has being developed specifically for testing new compression algorithms. The files were selected based on their ability to provide representative performance results. This set of files is designed to replace the Calgary Corpus which is now over ten years old. Several sets of results are available on this web site. As well as the new Canterbury Corpus, a corpus of large files has been tested, and results for the original Calgary Corpus are also available. The constituents of the true Canterbury Text Compression Corpus ranges from "normal" English texts (such as Shakespeare) to informatic sources (C, Lisp, HTML etc.). In addition there is also an Artificial Corpus of abnormal texts (such as alphabet, random texts etc.), a Large Corpus, made from thery large files mainly English (ranging from the Complete Genoma of Eschirichia Coli to King James Bible), the old Calgary Corpus and a Miscellaneous. All (sub)corpora are freely downloadable as TAR-GZ or ZIP files. [2001 April 28].
The Connecticut College CAPA is a huge electronic archive designed to make out-of-print volumes of American poetry available through the Web to readers, scholars, and researchers. All text are stored in HTML and are freely downloadable. Contact: Wendy Battin.
The Corpus of Early English Correspondence (CEEC) has been compiled for the study of social variables in the history of English. To enable this, great attention has been paid to the authenticity of letters on the one hand and to the social representativeness of the writers on the other. The timespan covered is from 1417 to 1681, and the size of the whole not tagged corpus is 2.7 million words. Because of widespread illiteracy, however, only the highest ranks of society are well represented, and women's letters form no more than one fifth of the full CEEC. For more information on the compilation principles see Nevalainen & Raumolin-Brunberg (eds.) (1996) and Keränen (1998).
+ The Corpus of Early English Correspondence Sampler (CEECS) represents the non-copyrighted materials included in the Corpus of Early English Correspondence. This means that the editors of the collections included here have died over 70 years ago. We have also included some material (re)edited by us (see Henslowe and Marchall collections). The sampler corpus (CEECS) reflects the structure of the full CEEC only in some respects. The time covered is nearly the same (1418-1680) and here too the bulk of the material is at the latter end of the time span. 23 letter collections have been included with altogether 194 informants. The size of the CEECS is 450,000 words. It has been divided into two parts for technical reasons. CEECS1 covers the 15th and 16th centuries, with the exception of the Hutton collection, which goes on to the 17th century. CEECS2 consists of 17th century material, only 3 letters in Original 3 are from the late 16th century.
+ A version on CD-ROM is available from ICAME; you can also dowload a small sample, and a manual as well.
CELEX, the Dutch Centre for Lexical Information, has three separate databases, Dutch, English and German, all of which are open to external users. The latest release of the English database (E2.5), completed in June 1993, contains 52,446 lemmata representing 160,594 wordforms. Apart from orthographic features, the CELEX database comprises representations of the phonological, morphological, syntactic and frequency properties of lemmata. For Dutch and English lemma homographs, frequencies have been disambiguated on the basis of the 42.4 m. Dutch INL and the 17.9 m. English Collins/COBUILD text corpora. The CELEX database is open to all academic researchers and people associated with other not-for-profit research institutes free of charge (at least until 2001). Users will only be charged Dfl. 100,= for the CELEX User Guide on a one-shot basis. In order login to CELEX, a personal account should be obtained from Richard Piepenbrock, project manager: see at this page.
A 566 KB SGML tagged corpus made from J. Fisher, M. Richardson & J. L. Fisher, Anthology of Chancery English, U. of Tennessee Press, Knoxville 1984. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages. The bulk of the collection is however English.: there are 42 corpora from normally-developing English-speaking children, 3 corpora which include a complete morphological and part-of-speech analysis (namely: 1, 2, 3), 20 corpora from clinical subjects, 15 corpora from bilingual and second language learning subjects, and more. See under Multilingual and Parallel Corpora section for a fuller file.
All materials are freely available directly from the Site.
The CHRISTINE project by Geoffrey Sampson is setting out to extend the SUSANNE analytic scheme and the SUSANNE Corpus to cover spoken English, and particularly spontaneous, informal spoken English. 50% of the Corpus comes from the spoken part of the BNC (British National Corpus), 10% from the Emotion in Speech Corpus, and 40% from LLC (London-Lund Corpus). For more details on the annotation scheme cf. under the SUSANNE Corpus.
The "hagiographical" explanation about the name "Christine" Geoffery Sampson give in his page is only too charming, and must be quoted literally and without cuts. "Before this project began, I referred to it as the 'Spoken SUSANNE project'. But it is useful to have a short, distinctive name for a separate research undertaking. Apart from anything else, we need a name in order to create structure in our mass of electronic files at Sussex. SUSANNE stood for 'Surface and underlying structural analyses of natural English'. (One of the N's was taken from 'analyses'.) But the name was also appropriate for reasons that I shan't go into here, having to do with the life of St Susanna. Our new project is 'daughter of SUSANNE'. But Susanna, as a holy virgin, had no daughter. So I chose a 'successor' name in terms of the calendar. St Susanna's day is 23rd July. July 24th is the day dedicated to SS Christina of Tyre and Christina the Astonishing. (It is also the day of our local Sussex saint, Lewina of Seaford - but 'Lewina' seemed too strange a name to make a satisfactory project title.) St Christina of Tyre makes a good patroness for a project on speech. We are told that, after being condemned to have her tongue cut out, she carried on speaking just as clearly as ever. Picking up her excised tongue, she threw it at the judge, blinding him in one eye. (A neat trick, which we shall have to bear in mind in case we have any trouble with Research Council assessors.) If you insist on an acronym, CHRISTINE can just about be twisted into that too: 'Chrestomathized speech trees in natural English'. (Ouch!) At any rate, it makes a distinctive and attractive name".
The CHRISTINE Corpus was realesed as "Stage I", ready and freely available for use, in 1999. The second release of CHRISTINE, available since August 2000, incorporates a minor change in the distribution of analytic information between the fields, to make it more compatible with SUSANNE and easier to read. It includes about 40% of the originally planned complete CHRISTINE Corpus: while it was in being, the project annotated considerably more material than the sample now published, but the remainder was not brought into a suitable state for publication by the end of the project; the current CHRISTINE Corpus was originally referred to as "CHRISTINE Stage I", in the expectation that it would soon be replaced by a larger corpus; it is still hoped that this eventually can be done, but the work remaining to be done has turned out to be considerably more than was envisaged in 1999; hence the short name "CHRISTINE Corpus" is now used for the corpus currently available. However, CHRISTINE in its own right offers a structural analysis of a cross-section of 1990s spontaneous speech from all British regions, social classes, etc.
For a general presentation of the CHRISTINE project go to its proper homepage, which is http://www.grsampson.net/RChristine.html (but is advisable to refer first of all to the general one provided in the title line). There is also a full Documentation file, which is available as a single Web page. The Corpus itself can be freely downloaded by anonymous ftp: what you receive will be a gzipped tar file; use the "gunzip CHRISTINE1.tgz" to uncompress it into a CHRISTINE1.tar file, and "tar -xvf CHRISTINE1.tar" to unpack the archive into its constituent files.
[updated 2004 March 25].
A 631 KB - 443000 words English corpus transcribed from Normalised versions of Yale prose Milton and of contemporary editions, compiled by Thomas N. Corns. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
A 945 KB English corpus transcribed from the Claremont Corpus of Elizabethan Verse, Modern American spelling, compiled by Ward Elliott. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
CLAWS (Constituent Likelihood Automatic Word-tagging System) is the well known POS tagger for English texts which has been continuously developed since the early 1980s at the UCREL. For more infos see the full file in the Tools section.
The Bank of English is a collection of samples of modern English language held on computer for analysis of words, meanings, grammar and usage. In October 2000 the latest release of the corpus amounted to 415 million words and it continues to grow. This huge collection is composed of a wide range of different types of writing and speech. It contains samples of the English language from hundreds of different sources: written texts come from newspapers, magazines, fiction and non-fiction books, brochures, leaflets, reports, letters, and so on; the spoken word is represented by transcriptions of everyday casual conversation, radio broadcasts, meetings, interviews and discussions, etc. The material is up- to-date, with the majority of texts originating after 1990. The Bank of English was launched in 1991 by COBUILD (a division of HarperCollins Publishers) and the University of Birmingham. Since 1980 COBUILD, which is based within the School of English at Birmingham University, has been collecting a corpus of texts on computer for dictionary compilation and language study. In 1991 HarperCollins decided on a major initiative to increase the scale of the corpus to 200 million words, to form the basic data resource for a new generation of authoritative language reference publications. [Last rev. 2001 April 23].
+ The COBUILD English Dictionary books and CD-ROMs are based on this database. Now they are distributed also by Athelstan.
+ There is a free demo available.
+ Now a large corpus, the CobuildDirect Corpus, is available online through subscription. It is composed of 56 million words of contemporary written and spoken text; a detailed description of the contents is downloadable by anonimous ftp. The subscription is expensive: UKpounds 50 for a one month unlimited access trial (not renewable), UKpounds 300 for 6 months unlimited connection time, UKpounds 500 for 12 months unlimited connection time.
+ There is a limited free access to a CobuildDirect Corpus Sampler available: you can type in some simple queries here and get a display of concordance (up to 40 lines) or collocation (up to 100 collocates) lines from the corpus. The query syntax allows you to specify word combinations, wildcards, part-of-speech tags, and so on.
+ Further access to larger corpora may be granted only by special arrangement.
The collection of the material took place in 1993. The aim of the project has been to create a corpus of British English teenage talk and make it available for research, first on the internet, next as an orthographically and prosodically transcribed CD-ROM version, and finally as a CD-ROM version with both text and sound. The recordings were made by 31 volunteering 13-17 year old boys and girls from five socially different school boroughs, so-called 'recruits' equipped with a Sony Walkman, a lapel microphone and a log book. The entire material of roughly half a million words was orthographically transcribed by trained transcribers employed by the Longman Group for transcribing The British National Corpus (BNC). A copy of this version of COLT was incorporated in the BNC. At the Bergen end, the orthographically transcribed material was subsequently submitted to careful editing, which involved correcting misinterpreted talk, reducing the number of
Online search in the entire corpus requires subscription, but there is a free search in the pilot version consisting of 151 texts.
A version on CD-ROM is available from ICAME; see a small sample.
PDF manual downloadable.
There are all the plays in HTML format; texts are all of public domain, but is broken in a lot of files and is ultimately based on the free and easily available Moby Shakespeare, a part of the Moby Project. The most interesting feature of the site is the free search engine, but last time I checked was down. [2001 April 23].
It contains also a 500000 Words English-Slovene and Sloven-English Corpus: cf. this site under Multilingual and Parallel Corpora. Available under subscription to TRACTOR.
In the Corvinus Virtual Library you will find freely readable and downloadable transcriptions of a good lot of books on Hungarian history, published in the United States of America, in the English language or translated from Hungarian. Texts are usually in DOC format (with a few PDF).
This corpus, which has been constructed from a selection of existing transcripts of interactions in professional settings, contains two main sub-corpora of a million words each. One sub-corpus consists mainly of academic discussions such as faculty council meetings and committee meetings related to testing. The second sub-corpus contains transcripts of White House press conferences, which are almost exclusively question-and-answer sessions. The transcripts making up the spoken American corpus have been selected because they appear to be relatively unedited. However, they have not been produced by linguists and so do not have all the features one might wish for. There is a tagged and an untagged verison available.
The corpus can be ordered from Athelstan at $49 (untagged ver.) and $79 (tagged ver.) for individual user (the site licence is $179). There is also a 50K sample corpus free on the web.
The third ARPA Continuous Speech Recognition (CSR) Language Model Training Data is a four CD-ROM set for speaker-independent, large-vocabulary speech recognition systems. The text collection comprises both source text data (prepared by LDC and BBN) and derived statistical tables (compiled by CMU) of unigram, bigram and trigram word frequencies. The sources include all available WSJ texts, spanning 1987 through March 1994 and all AP and San Jose Mercury news data from the three TIPSTER corpus volumes.
Available only through LDC membership.
A 808 KB English corpus with the Dedications, etc., transcribed by Ralph Crane and compiled by T.H. Howard-Hill. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
The Mudcat Café (cf. homepage), an online magazine dedicated to blues and folk music, maintains the Digitrad Lyric Database, a huge e-text archive with over 8000 popular lyrics in English, all freely downloadable in simple HTML format. Usually even the music is available. [2002 February 19].
This corpus contains sense-tagged word occurrences for 121 nouns and 70 verbs which are among the most frequently occurring and ambiguous words in English. It was provided by Hwee Tou Ng of the Defence Science Organisation (DSO) of Singapore. It was first reported in the following paper at ACL-96 ("Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach", by Hwee Tou Ng and Hian Beng Lee, in Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pages 40-47, Santa Cruz, California, USA, June 1996. Cf. this site). These occurrences are provided in about 192,800 sentences taken from the Brown corpus and the Wall Street Journal and have been hand tagged by students at the Linguistics Program of the National University of Singapore. WordNet 1.5 sense definitions of these nouns and verbs were used to identify a word sense for each occurrence of each word. In addition to providing the word occurrences in their full sentential context, the corpus includes complete listings of the WordNet 1.5 sense definitions used in the tagging.
Available by the LDC through membership or 100$.
The Dublin-Essex Project aimes at deriving Linguistic Resources from Treebanks.
Some English parsed sample is already freely downloadable, and they promise to put free also future research products, such as English Lexicon and Probabilistic Grammar.
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.
Early Canadiana Online (ECO) is a full text online collection of more than 3,000 books and pamphlets (English and French languages) documenting Canadian history from the first European contact to the late 19th century. You can make simple queries online, but unfortunately you can download texts only one page at a time (English version).
The 5 MB English corpus compiled by George Kiss and Christine Armstrong with the Edinburgh Associative Thesaurus by George Kiss. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
The 5 MB Edinburgh DOST corpus of Older Scottish texts, compiled by A.J. Aitken, Paul Bratley and Neil Hamilton-Smith. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
The Electronic Text Center's holdings are a large collection of browseable and searchable texts, and include approximately 45,000 on- and off-line humanities texts in twelve languages – but chiefly English –, with more than 50,000 related images. Unfortunately all this stuff is usually available only to U.Va students, faculty, and staff. See however this site: there are useful links as well.
The Emotion project was a joint research project between The Speech Laboratory at the University of Reading and The Department of Psychology at the University of Leeds. This project brings together expertise in phonetics and phonology and in cognitive psychology in order to examine emotional speech and to produce a database of such speech to put alongside the emotionally neutral material found in most spoken language databases.
They say that the project is now complete and copies of the corpus have been deposited with the ESRC Data Archive, and that they hope to be able to make it available on CD-ROM, but they don't say how to obtain now access to this Corpus (only a small subsection of the texts are present in the free CHRISTINE Corpus). However they "are sorry but because of Copyright restrictions the Emotion in Speech Corpus is not available for distribution either online nor by other means". Let us hope ...
EngCG, the Constraint Grammar Parser of English by Pasi Tapanainen (1993), performs morphosyntactic analysis (tagging) of English text. There is an online demo at Lingsoft site. It is sold by Lingsoft: for availability (it is a commercial software!) you have to ask to email@example.com.
EngCG-2, by Pasi Tapainen and Atro Voutilainen, is a program that assigns morphological and part-of-speech tags to words in English text at a speed of about 3,000 words per second on a Pentium running Linux. It is an improved version of the original EngCG tagger, which is based on the Constraint Grammar framework advocated by a team of computational linguists in Helsinki, Finland. The documention (and several articles) are available online.
EngCG-2 tagger with the English Constraint Grammar can be licensed from Conexor. There is also a free demo. Contacts: Tapanainen and Voutilainen.
Englex is an English parsing description tool (viz. a morphological parsing lexicon of English) for PC-Kimmo by SIL Computing. It uses the standard orthography for English. It is intended for use with PC-Kimmo (or programs that use the PC-Kimmo parser, such as K-Text and K-Tagger). With such software and Englex, you can morphologically parse English words and text. Practical applications include morphologically preprocessing text for a syntactic parser and producing morphologically tagged text. Englex works under Win 3.1-98 + NT, DOS, MAC and Unix. Freely available under agreement to the SIL standard free license.
Finally a really free site from an Universitary Istitution, and a very good one! The EServer at Carnegie Mellon University has been online for ten years; today it offers a huge collection of 29,113 free works online, mainly in TXT format, all in English language, both originals and translations, covering a wide range of literary genres, from novels to drama, essays and Journals (such as Bad Subjects and Cultronics). The texts are arranged more for online reading than for downloading, but, of course, you can easily download them as well. There is also an FTP server at restricted to members (for membership see this page).There is not a general catalogue and you have to make a search of what you want.
The ET10-63 corpus is a bilingual parallel corpus of English and French, containing EC official documents on telecommunications. The corpus is part-of-speech tagged and also lemmatized. Approximately 1,250,000 words of each language.
Home to electronic texts of all kinds, from contemporary American amateur authors to Shakespeare, from the mainstream and off-beat religious texts to the profane personal poetry; there are e-zines of every kind, from the political to the technical; many texts coma from the Usenet. English Langauge is prevalent. All texts are freely downloadable from the E-text FTP.
In 1991 a group of students at Freiburg University was engaged in what at first sight must appear as an almost anachronistic activity: they were keying in extracts of roughly 2,000 words from British newspapers. The sampling model was the press section of the LOB corpus (see Sand/Siemund 1992). 1992 saw the beginning of a new Brown corpus. The ultimate aim was to compile a parallel one-million-word corpora of the early 1990s to match the original LOB and Brown corpora as closely as possible, so to provide linguists with an empirical basis to study language change in progress. The corpus is not tagged.
A version on CD-ROM is available from ICAME; you can also reach a small sample, and a manual as well. Contact: Christian Mair, Englisches Seminar I, Institut für Englische Sprache und Literatur, Albert-Ludwigs Universität, D-7800 Freiburg, Germany.
In 1991, Christian Mair took the initiative to compile a set of corpora that would match the well-known and widely used Brown and LOB corpora with the only difference that they should represent the language of the early 1990s. The project started in April 1991 with the compilation of the British Press Section of the new FLOB corpus. 1992 saw the beginning of the new Freiburg Brown Corpus, Frown. The corpus is not tagged.
A version on CD-ROM is available from ICAME; a small sample is available; and a manual too.
The Hansard Corpus consists of parallel texts in English and Canadian French, drawn from official records of the proceedings of the Canadian Parliament.
See under Multilingual and Parallel Corpora section.
A 578 KB English corpus made by the 1879-1880 issues of Harpers Magazine. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
The Helsinki Historical Corpus is a computerized collection of extracts of continuous text. The Corpus contains a diachronic part covering the period from c. 750 to c. 1700 and a dialect part based on transcripts of interviews with speakers of British rural dialects from the 1970's. The aim of the Corpus is to promote and facilitate the diachronic and dialectal study of English as well as to offer computerized material to those interested in the development and varieties of language. For more details cf. the full file in the Old & Middle English section.
A 6.9 MB - 1 million words English corpus from the Hong Kong South China Morning Post Corpus, Feb-March 1992. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
Two CD-ROM set which contain data from transcribed news broadcasts, designated for use in the baseline language model (LM) for the 1996 CSR Hub 4 Evaluation. The LDC obtained the bulk of the data from broadcast news CD-ROMs produced by Primary Source Media, Inc. This portion includes the period from January 1992 to April 1996 and contains approximately 1 gigabyte of data uncompressed. This release also includes about 36 megabytes of material received on floppy disks covering the period from late May through June 1996, with somewhat different format from the bulk of the data. The text data are presented in two forms: (1) a relatively unprocessed ("raw" or "sentence-tagged") form and (2) a fully processed ("conditioned", "verbalized-punctuation") form. Available only by membership to the LDC.
http://morph.ldc.upenn.edu/Catalog/LDC97T22.html (cf. also 1997)
The primary motivation for this collection is to provide training data for the DARPA"Hub-4" Project on continuous speech recognition in the broadcast domain. Transcripts have been made of all recordings in these publications, manually time aligned to the phrasal level, annotated to identify boundaries between news stories, speaker turn boundaries and gender information about the speakers. The released version of the transcripts is in SGML format and there is accompaining documentation and an SGML DTD file, included with the transcription release.
The 1996 Broadcast News Speech Corpus contains a total of 104 hours of broadcasts from ABC, CNN and CSPAN television networks and NPR and PRI radio networks with corresponding transcripts. The speech files are available in a 19 disc training data set with one additional disc of development data and an additional disc of evaluation data.
The 1997 Broadcast News Speech Corpus contains in a set of 18 CD-ROMs a total of 97 hours of recordings from radio and television news broadcasts, gathered between June 1997 and February 1998. It has been prepared to serve as a supplement to the 1996 Broadcast News Speech collection (consisting of over 100 hours of similar recordings).
Both available only by membership to the LDC.
The 5 MB English corpus transcribing Humanist, the complete electronic discussion group, 1987-1989. Compiled by Willard McCarty. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
A skeleton-parsed corpus of computer manuals. 800,000 words.
The Journal of ICAME (International Computer Archive of Modern and Medieval English) is published once a year since 1977, with articles, conference reports, rewiews and notices related to corpus linguistics. For more infos see the full file in the Reference section.
The International Corpus of English began in 1990 with the primary aim of providing material for comparative studies of varieties of English throughout the world. Twenty centres around the world are preparing corpora of their own national or regional variety of English. Each ICE corpus consists of spoken and written material produced after 1989. Each corpus (cf. the components of ICE) contains one million words. In order to ensure compatability between the national corpora, each team is following a common corpus design, as well as a common scheme for grammatical annotation. Cf. the ICE-GB (Great Britain component), ICE-NZ (New Zeland component) and ICE-EA (East African component). ICE incorporates also ICLE, the International Corpus of Learner English.
The East African component of ICE, the International Corpus of English, is a computerized (but not tagged) collection of spoken and written texts from Kenya and Tanzania. A version on CD-ROM is available from ICAME; a small sample is available both for the written section and for the spoken section. A PDF manual is downloadable as well.
ICE-GB is the British component of ICE and is the first of the ICE corpora to be completed, and is now available. One million words of spoken and written British English from the 1990s, tagged, parsed, checked, and bundled with ICECUP exploration software designed for parsed corpora. ICE-GB is both POS tagged and parsed.
+ ICE-GB CD-ROM can be ordered from the site at 293.75£price.
+ ICE-GB Sample Corpus consists of Ten texts – over 20,000 words – fully parsed and annotated, exactly as they are in ICE-GB, in bundle with a full working version of ICECUP III. It is available for free download.
One million words of spoken and written New Zealand English collected in the years 1990 to 1996. ICE-NZ, the New Zealand component of ICE, consists of 600,000 words of speech and 400,000 words of written text. The WSC and the spoken component of ICE-NZ share 9 categories. Since informal conversational data in particular was so difficult to collect, there is an overlap of 339,248 words (173 files) between the two corpora to achieve economy in data collection.
At this page, you can find some Notes on collecting conversations for the ICE-NZ Corpus by Janet Holmes.
ICLE (a part of the ICE project) is a computerized corpus of argumentative essays on different topics written by advanced learners of English (university students of English mainly in their second or third year). The ICLE project was launched in 1990 by Sylviane Granger, University of Louvain-la Neuve, Belgium, and the corpus is made up of a number of subcorpora representing the following language backgrounds: Chinese, Czech, Dutch, Finnish, French, German, Japanese, Polish, Russian, Spanish, and Swedish. There is also a smaller comparable corpus of British and American undergraduate essays. The length of the essays varies between 500 and 1000 words. The homepage shown hereupon is properly that of the CECL (Centre for English Corpus Linguistic) that act as main ICLE page. For more details you must go to the pages of each component, viz. Swedish, Finnish, Polish, Brazilian, Czech.
A small library of interactive hypertexts for free reading and search maintained by Èulogos. All literary texts, many religious (the BRI, Bibliotheca Religiosa). Nine languages are till now supported (Albanian, German, English, Spanish, French, Italian, Latin, Finnish).
The European (English, French, and Spanish) Language Newspaper Text tagged corpus, free querable online. For more infos see in the Multilingual and Parallel Corpora section.
This corpus is made by a set of pieces fron the Official Journal of the European Community (JOC) and is CES (Corpus Encoding Standard) conformant. It is available with three level of treatment: paragraph annotated (CESDOC), POS-Tagged (CESANA) and parallel text aligned (CESALIGN).
Availability unknown: only a few sample to download on.
http://cactus.aist-nara.ac.jp/staff/matsu/misc/nlt.html (Japanese also)
The English morphological analyzer of the Nara Intitute of Science and Technology deals with inflection of English nouns, verbs and adjectives. Since the treatment of the information given by inflection differs in systems, the detailed information is assumed to be written in grammar rules by the user. It is free without restriction. For fuller details cf. the NAIST Natural Language Tools. [2001 April 28].
The text data contained on this two CD-ROM set represent a release of the JURIS (Justice Department Retrieval and Inquiry System) data collection that has been made available to the Linguistic Data Consortium (LDC) by the U.S. Department of Justice. The time span of the text ranges from the 1700's to the early 1990's. There are 1664 individual text files in the corpus, 1011 on the first CD-ROM and 653 on the second. There are a total of 694,667 document units in the corpus and these can be categorized to some extent with regard to their content. The terminology and organization of categories are those used in the JURIS documentation.
Available by the LDC through membership or 1500$ price.
The Textlist page of the Kirchenmusik online site (a good and well known resource for music lovers) by Joachim Vogelsänger unfolds a huge and free collection of texts of Oratorios, Cantatas, Sacred Hymns and so like. The mosts are in German, but English texts are also well represented, spanning from Händel's Alexanderfest to Britten's Ceremony of Carols and Tippett's A child of our time. All the texts are freely downloadable in simple HTML format. For more details cf. the full file in the E-Texts section. [2001 August 27].
An online freely querable database of English-French aligned texts, processed by the same software (by Knut Hofland) used for the Oslo ENPC project.
Approximately 1,000,000 words of Indian written English dating from 1978. The Kohlapur Corpus of Indian Written English is intended to be comparable to the Brown and the LOB corpora, and so to serve as source material for comparative studies of American, British and Indian English which in its turn is excepted to lead to comprehensive description of Indian English. The Indian Corpus is a representative corpus of sample texts printed and published in 1978; the texts were largely selected by stratified random sampling process; again, the gender categories are parallel to those of the LOB corpus. Available as not tagged orthographic text only.
A version on CD-ROM is available from ICAME; a small sample and a manual are available.
The Lampeter (not tagged) Corpus of Early Modern English Tracts is a collection of texts on various subject matter published between 1640 and 1740 - a time that is marked by the rise of mass publication, the development of a public discourse in many areas of everyday life and, last but not least, the standardisation of British English. The Lampeter Corpus has been compiled over the last four years at Chemnitz University's REAL Centre and has been completed fairly recently, i.e. May 1998. It is available for scholarly research free of charge from (but the service not yet activated). Other available versions:
+ Oxford Text Archiv version (Oxford University Computing Services - 13 Banbury Road, Oxford OX2 6NN - UK)
+ ICAME version (International Computer Archiv of Modern and Medieval English); cf. also the alternative page. Also available from the Norwegian Computing Centre for the Humanities (Harald Haarfagresgt. 31, N-5007 Bergen, Norway), with a small downloadable sample.
+ an UCREL version is still advertised:, but the official page is now that of REAL.
+ A SGML-TEI Light 7.8 MB version is freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
A 555 KB English corpus of Late Modern English Prose. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
The Lancaster Corpus of Children's Project Writing (LCCPW) is a digitized collection of project work produced by children in one primary school classroom aged between 9 and 11, collected in 1994-1197, and built up in corpus format by Nick Smith and Roz Ivanic^. It is also a component in a larger research programme: a longitudinal study of children's writing-for-learning, based on the writing of 8 - 12 year old children. A demonstration of this "strange" and higly remarkable corpus took place at CL 2001 Congress (Lancaster April 1), on which I rely for this file. The corpus is mounted on a public-access website to provide easy navigation between its element. It consists of: (a) a hyperdocument showing the original visual form of these texts which can be navigated in a variety of way; (b) an electronic version of the texts themselves, transcribed from handwriting using a system developed at Lancaster; (c) an electronic version of the texts tagged for POS. It surely deserve (more than) a visit! [2001 May 2].
Articles in English from Le Monde Diplomatique (HTML texts). From Institut für Deutsche Sprache, Mannheim, Germany.
Available under subscription to TRACTOR.
The 384 KB LEXIS Corpus (samples of 1963 spoken English). Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
The Reference links-page of the Library of Congress for American electronic publishing.
This page contains the texts (all freely downloadable HTML) of 1842 poets and 1217 composers in 22 different languages: English has 1865 texts. For a more detailed description, see in the E-Texts section.
(cf. also ICAME and UCREL pages).
As the name implies, the London-Lund Corpus of Spoken English (LLC) derives from two projects. The first is the Survey of English Usage (SEU) at University College London, launched in 1959 by Randolph Quirk, who was succeeded as Director in 1983 by Sidney Greenbaum. The second project is the Survey of Spoken English (SSE), which was started by Jan Svartvik at Lund University in 1975 as a sister project of the London Survey. The spoken corpus of the Survey of English Usage has been transcribed with a sophisticated marking of prosodic and paralinguistic features. All the SEU texts, both written and spoken, have been analysed grammatically. In 1975 the Survey of Spoken English was established at Lund. Its initial aim was to make available, in machine-readable form, the spoken material which by then had been collected and transcribed in London: 87 texts totalling some 435 000 words (see Svartvik et al. 1982 for an account of the input procedures).
A version on CD-ROM is available from ICAME; you can also reach a small sample and a manual. As general contact, mail to ICAME.
+ A 5 MB version is freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
+ Some transcriptions of the London-Lund corpus that took place as part of the Beach/CoMoPro projects at the University of Edinburgh, Centre for Cognitive Science, are freely downloadable by FTP. [Added 2001 August 5].
A manually parsed subsample of the LOB corpus showing the surface phrase structure of each sentence, the 45,000-word Lancaster-Leeds Treebank, which Geoffrey Sampson developed twenty years ago for Geoffrey Leech and Roger Garside's parsing project, though small, was apparently the first in the field: the authors believe that they coined the term treebank itself, which has now come into general use in Computational Linguistics. The 45,000 words are taken from all the gender categories of the LOB corpus.
See also the ICAME LOB page.
Approximately 1,000,000 words of British Written English dating from 1960. The corpus is made up of 15 different genre categories. Available as orthographic text, and tagged with the CLAWS1 part-of-speech tagging system. The Leeds-Lancaster Treebank (LLT) and Lancaster Parsed Corpus (cf. LPC) are analyzed subsamples of the LOB corpus. For manuals see the following pages: Corpus Manual (1978), Tagged Corpus Manual - 1986 (alternative page) and Local Online copy (- password required).
A version on CD-ROM is available from ICAME; for a small sample follow this link. For a general contact, mail to ICAME.
+ A 5 MB tagged and horizontal format version compiled by Stig Johansson is freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
Approximately 14.5 million words of Written English from various geographical locations in the English-speaking world and of various dates and text types. Orthographic text only. Conctat: Della Summers, Longman Dictionaries, Longman House, Burnt Mill, Harlow, Essex, CM20 2JE UK.
A subsample of the LOB corpus, parsed by computer and manually corrected by several researchers. Approximately 140,000 words with samples from each of the 15 categories in the LOB corpus.
A version on CD-ROM is available from ICAME; you can also get a small introduction and a fuller PDF manual.
LUCY, that's to say Geoffrey Sampson's Structure in Written English in the UK Project, which began in January 2000 and ended in Winter 2003. The LUCY Corpus is a structurally-annotated sample ("treebank") of present-day British written English, representing not only the polished writing of published documents, but also the less-skilled or unskilled writing of young adults at the end of secondary and beginning of tertiary education, and of children aged nine to twelve in various types of school and parts of the country. Like its sister treebanks, SUSANNE and CHRISTINE, LUCY uses the same highly detailed and comprehensive scheme of structural annotation (the "SUSANNE scheme"), which is widely recognized as the most precise system of its kind available.
The aim of the LUCY project was to create a body of machine-readable data that will enable researchers to examine how the grammatical resources of the English language are actually used by people writing English in Britain at the turn of the millennium, and to compare written usage with usage in spontaneous speech. The material material was selected in order to make the Corpus specially relevant for studies of young people's acquisition of writing skills. As well as samples drawn from recent published writing (which can be seen as in some sense representing the 'model' for writing-skills education), and from unpublished writing of various types produced by mature users of written English (for instance, business correspondence), LUCY includes samples of writing produced by young people destined for careers where the generation of written prose will be a significant element, but who have not yet finished acquiring mature writing skills.
For a general presentation of the LUCY project go to its proper homepage, which is http://www.grsampson.net/RLucy.html (but is advisable to refer first of all to the general one provided in the title line). There is also a full Documentation file, which is available as a single Web page. The Corpus itself can be freely downloaded by anonymous ftp: what you receive will be a gzipped tar file; use the "gunzip LUCYrf.tgz" to uncompress it into a LUCYrf.tar file, and "tar -xvf LUCYrf.tar" to unpack the archive into its constituent files.
[updated 2004 March 25].
The 591 KB Melbourne-Surrey corpus of Australian English compiled by Knut Hofland. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
The Michigan Early Modern English Materials (MEMEM) were compiled by Richard W. Bailey, Jay L. Robinson, James W. Downer, with Patricia V. Lehman and are freely querable online. The Materials consist of citations collected for the modal verbs and certain other English words for the Early Modern English Dictionary. Many of the slips used in the work were the original Oxford English Dictionary slips, provided to the University of Michigan by the editors of the OED. The work included here was prepared electronically over a period of several years ending in 1975. The source file is ca. 16 megabytes and consists of ca. 50,000 records. The source files are said in the description page to be available via anonymous ftp in several files as a compressed 5 MB files containing the Materials, the DTD constructed for the Materials, and the character DTD for the Materials; the links are however clearly wrong. [2001 April 23].
As a part of the METER (MEasuring TExt Reuse) project a staff from the Departments of Computer Sciences and of Journalism (consisting in Robert Gaizauskas, Jonathan Foster, Yorick Wilks, John Arundel, Paul Clough, Scott Piao) have built a new type of comparable corpus consisting of annotated examples of related newspaper texts. Texts in the corpus were manually collected from two main sources: the British Press Association (PA) and nine British national newspapers that subscribe to the PA newswire service. In addition to being structured to support efficient search for related PA and newspaper texts, the corpus is annotated at two levels. First, each of the newspaper texts is assigned one of three coarse, global classifications indicating its derivation relation to the PA: wholly derived, partially derived or non-derived. Second, about 400 wholly or partially derived newspaper articles are annotated down to the lexical level, indicating for each phrase, or even individual word, whether it appears verbatim, rewritten or as new material. The hope is that this corpus will be of use for a variety of studies, including detection and measurement of text reuse, analysis of paraphrase and journalistic styles, and information extraction/retrieval. [2001 April 29].
+ A lot of documantation and materials is available directly from the site.
+ Beta 1.0 was released in a limited edition and presented at the CL2001 Congress hold at Lancaster University in spring 2001. Contact.
The 5 MB Michigan Early Modern English Corpus was compiled by Richard W. Bailey and Jay L. Robinson. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
Myriobiblos, The E-text Library of the Church of Greece, provides a lot free HTML e-texts (you can browse and save them) from Classical to modern Greek; but there are also fewer texts (mainly translation) in Bulgarian, English, French, German, Italian, Romanian and Russian. For more cf. under the E-Texts section. [2001 May 1].
The Moby lexicon project by Grady Ward’s has been placed into the public domain. There is a downloadable 25 MB tar-gzipped complete distribution, or each sub-project can be downloaded individually. For details cf. under the Corpora General section.
Moby Shakespeare edition, a part of the Moby Project, is the only complete freeware e-text of all Shakespeare’s works. It is easily available in more or less complete version and formats from nearly all literary English e-texts repositories. For more infos cf. under the E-Texts section. [2001 April 23].
A 490 KB corpus of Modern prose (15 2000-word samples; Chapter(s) from various mid-20th Century novels) compiled by Andrew Q. Morton and Neil Hamilton-Smith. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
The MUC VI corpus contains English texts used in the 1996 Message Understanding Conference. The texts for this corpus are taken from Dow Jones Inc., Reuters America Inc. and are protected by applicable copyright law.
Available only by the LDC through membership or 100$ price.
A 287 KB corpus of Modern English Verse transcribed from The new dragon book of verse, edited by Michael Harrison and Christopher Stuart-Clark, Oxford: Oxford University Press, 1977. Compiled by Graham Roberts. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
A 5 MB English corpus made of articles from "New Scientist": Vol. 96, no. 1334 (2 Dec. 1982) - Vol. 98, no. 1357 (12 May 1983). Compiled by E. O. Winter. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
A 1590 KB corpus of New York newspaper advertisements and news items 1777-1779. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
This is an electronic version of the first 2,100 manuscript newsletters (of a total of 3,950) in the Newdigate series. Most are addressed to Sir Richard Newdigate (d. 1710), Arbury, Warwickshire; they date from 13 January 1674 to 29 September 1715 and are at the Folger Shakespeare Library, Washington, D. C. They were issued on Tuesdays, Thursdays, and Saturdays by the Secretary of State and were usually written on three sides of a bifolium. Those in this corpus come up through 11 June 1692.
A version on CD-ROM is available from ICAME (alternative page); see also a small description. For a general contact, mail to ICAME.
The North American News Text corpus is composed of news text that has been formatted using TIPSTER-style SGML markup. The text is taken from the following sources: "Los Angeles Times" & "Washington Post", 05/94-08/97, 52 million words; "New York Times News Syndicate", 07/94-12/96, 173; "Reuters News Service" (General & Financial), 04/94-12/96, 85; "Wall Street Journal", 07/94-12/96, 40.
Available only by LDC membership.
This release of North American News Text provides a supplement to the LDC's earlier publication of similar materials (cf. North American News Text Corpus). The same TIPSTER-style SGML markup is used in formatting the data. The data sources are as follows: "Los Angeles Times" & "Washington Post", 09/97-04/98, 11 million words; "New York Times News Syndicate", 01/97-04/98, 116; "Associated Press World Stream English", 11/94-04/98, 143.
Available only by LDC membership.
Northern Ireland English transcribed corpus of speech. 1453 Kb. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
The 1734 KB English corpus of British Columbian Indian myths from published & unpublished sources compiled by Randy Bouchard and Hilde Colenbrander. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
The NPECTE project is based on two separate corpora of recorded speech:The earlier of the two corpora was gathered during the Tyneside Linguistic Survey (TLS) in the late 1960s, and consists of 86 loosely-structured 30-minute interviews. The informants were drawn from a stratified random sample of Gateshead in North-East England, and were equally divided among various social class groupings of male and female speakers, with young, middle, and old-aged cohorts. The original reel-to-reel tapes are now savaged to audio cassette format, catalogued, archived and housed in the Catherine Cookson Archive of Tyneside and Northumbrian Dialect in the Department of English Literary and Linguistic Studies (DELLS), University of Newcastle upon Tyne. The more recent corpus was collected in the Tyneside area in 1994 for an ESRC-funded project ‘Phonological Variation and Change in Contemporary Spoken English’ (PVC). This data is in the form of 18 DAT tapes, each of which averages 60 minutes in length. Dyads of friends or relatives were encouraged to converse freely with minimal interference from the fieldworker, and informants were again equally divided between various social class groupings of male and female speakers in young, middle, and old-age cohorts. This material is housed in the Department of Speech, University of Newcastle upon Tyne. Recently, an AHRB grant was awarded under the Resource Enhancement Scheme to combine the TLS and PVC collections into a single corpus and to make it available to the research community in a variety of formats: digitized sound, phonetic transcription, standard orthographic transcription, and various levels of tagged text, all aligned. The project is due to begin on 1 September 2001. Current members are Joan Beal (Sheffield), Karen Corrigan (Newcastle; homepage), Marc Fryd (Poitiers), and Hermann Moisl (Newcastle; homepage). [2001 May 1].
NPtool, by Atro Voutilainen, is a fast and accurate system for extracting noun phrases from English texts. It is sold by Lingsoft. For availability (it is a commercial software!) you have to ask to firstname.lastname@example.org.
The Online Book Initiative's "Online Book Repository" (OBR) is a large collection of English language texts (originals and translations) and related materials ranging from Shakespeare and The Bible to novels, poetry, standards documents, etc. The page is only an index, but it is speedy and all texts are ready to be freely downloaded. Contact.
The Online Medieval and Classical Library (hold by the The Berkeley Digital Library SunSITE) is a collection of some of the most important literary works of Classical and Medieval civilization translated into English. Texts can be browsed and serched online and you can also freely download them in ZIP format from the OMACL FTP Site at the University of Kansas. At present there are only 30 texts available, and many of the larger texts are also available in multiple-file editions.
This page collects links to every work of classic horror and fantasy fiction (lato sensu: Shakespeare, Goethe's Faust and Milton's Paradises are comprised as well!) available on the Internet. All texts are in English language. All links are easy to access and download; only texts which are not easily accessible are directly reproduced on this site. All texts are freely downloadable, usually in ZIP format.
Opera e-Libretto (Collection Ulric Voyer) is a collection of 220 free e-texts of opera libretti. Displayed libretti are in English (Purcell, Blow, Clay, Cellier, Edwards, Sullivan, Ford, Cadman, Gershwin, Yanelow, Hoiby, Neff + Engl. translation of Marschner’s Vampyr), Italian, French, German, Russian and Danish . All texts are in html, usually broken in more files according to act divisions. For a more detailed file cf. under the E-Texts section. [2001 June].
The English-Norwegian Parallel Corpus (ENPC) of the University of Oslo consists of original texts and their translations (English to Norwegian and Norwegian to English). The focus has been on novels and fairly general non-fictional books. In order to include material by a range of authors and translators, the texts of the corpus are limited to text extracts (chunks of 10,000-15,000 words). The coding system used to mark up the ENPC follows the suggestions made by the Text Encoding Initiative (TEI) as presented in Guidelines for Electronic Text Encoding and Interchange (Sperberg-McQueen & Burnard, 1994). The English part of the ENPC has been tagged for part-of-speech (POS). The tagging was done automatically by using the English Constraint Grammar parser (cf. EngCG Parser) developed by Atro Voutilainen etc.. The Norwegian part of the corpus will not be tagged, for lack of a Norwegian tagger.
Access to the Corpus is up today restricted only to researchers and students at the University of Oslo: cf. this site.
Only the manual is freely available online.
See under Multilingual and Parallel Corpora section for more infos.
(beware that the page has lot of frames and Java)
A large catalogue of electronic texts, mainly of literary, philological and scholarly genre. English Language is prevalent but not exclusive. They offer also some linguistic corpora for free after sending a disclaimer statement (e.g. Lampeter Corpus, Northern Ireland Speech Corpus, SUSANNE Corpus): query their catalogue with search author=corpora. For more information see the E-Texts section.
A 414 KB corpus made by selections from Pamphlets of the American Revolution by Bernard Baylin. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
The Penn Treebank Project at the University of Pennsylvania (Penn) annotates naturally-occuring text for linguistic structure. Most notably, it produces skeletal parses showing rough syntactic and semantic information -- a bank of linguistic trees. Distributed by the LDC.
+ Treebank-1, the first CD-ROM release (LDC Catalog No.: LDC94T4B-3.1; no longer mantained), contained over 1.6 million words of hand-parsed material from the Dow Jones News Service, plus an additional 1 million words tagged for part-of-speech. This material is a subset of the language model corpus for the DARPA CSR large-vocabulary speech recognition project. It also contained the first fully parsed version of the Brown Corpus, which has also been completely retagged using the Penn Treebank tag set. Tagged and parsed data from Department of Energy abstracts, IBM computer manuals, MUC-3 and ATIS were also included. In addition, the CD-ROM included source code for programs that were used by the Penn Treebank project in creating portions of the data. Source code is also included for "tgrep".
+ Treebank-2. The Penn Treebank Project Release 2 CDROM, features the new Penn Treebank II bracketing style, which is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied, along with a complete style manual explaining the bracketing and new versions of tools for searching and treating bracketed data. This CDROM also contains all the annotated text material from the earlier Treebank Preliminary Release, including the Brown Corpus. While these materials have not all been converted to the newer bracketing style, they have been cleaned up to remove problems that had appeared in the earlier release. The contents of Treebank Release 2 are as follows:  1 million words of 1989 Wall Street Journal material annotated in Treebank II style;  A small sample of ATIS-3 material annotated in Treebank II style;  300-page style manual for Treebank II bracketing, as well as the part-of-speech tagging guidelines;  The contents of the previous Treebank CDROM (Version 0.5), with cleaner versions of the WSJ, Brown Corpus, and ATIS material (annotated in Treebank I style);  Tools for processing Treebank data, including "tgrep", a tree-searching and manipulation package (note that usability of this release of tgrep is limited: users of Sun sparc systems should have no problem, but others may find the software to be difficult or impossible to port).
+ Treebank-3. This CD-ROM contains the some Treebank-2 Material (1 million words of 1989 Wall Street Journal material annotated in Treebank II style; a small sample of ATIS-3 material annotated in Treebank II style; a fully tagged version of the Brown Corpus) and some new material (Switchboard tagged, dysfluency-annotated, and parsed text; Brown parsed text). Available by the LDC through membership or 2500$ price (Treebank 2 and 3 each). [Last Rev. 2001 April 28].
Starry.com archive of unpublished, prepublished and cyber-published American literature: from traditional novels wroten for the web to real "virtual novel". All text are are HTML suitable for reading online and not for downloading, but, of course, they can be freely downloaded as well. If you want fresh, modern and post-modern, narrative raw texts in English, this is surely a good spot.
http://ota.ahds.ac.uk/ (search Catalog).
The 446 KB PIXI Corpus compiled by L. Gavioli and G. Mansfield from Service encounters in English and Italian bookshops, by L. Gavioli and G. Mansfield, Cooperativa Libraria Uni.Editrice Bologna 1990. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
Orthographic transcriptions of some 61,000 words of child language data. The corpus is parsed according to Hallidayan systemic-functional grammar. There is no prosodic information. You can get a small introduction and a fuller manual.
+ A CD-ROM version is available from ICAME.
+ The EPOW, the Edited Polytechnic of Wales Corpus, a 7.8 MB version is freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
The oldest (since 1971) and largest project to get out of copyright literature online, freely available. The majority of the texts are in English, but there is also a few titles in other European language. For more information see under E-Texts section.
+ Sealsoft Literary Works is a page where you can make online searches on all the Gutenberg texts, treated as a 80 and more million words corpus.
This page maintained by Lyle Neff (cf. homepage) is a rich database of the online sources of opera libretti. A lot of e-texts (html format) are freely available directly from the site, other are only linked to. Beside libretti also secular songs and sacred vocal music are also dealt with. Language covered are Italian, French, English, German, Russian, Spanish (zarzuelas), Latin (sacred vocal music) and Jewish (songs). There are also links to other less specific musical and linguistic resources. [2001 June 20].
Parts of the German-English Translation Corpus is now available online.
See under Multilingual and Parallel Corpora section.
The 1858 KB English/ME corpus compiled by the Records of early English drama project. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
A 1459 KB corpus transcribed from The prologues and epilogues of the Restoration, a complete edition edited by Pierre Danchin, Nancy: Publications Université Nancy II, 1981-1988. Compiled by David Bond. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
A 79.1 KB English corpus made of Role-play transcripts from John Bro Transcripts of discussions in American, English/Chinese. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
Just released; The Santa Barbara Corpus of Spoken American English is based on hundreds of recordings of natural speech from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. It reflects many ways that people use language in their lives: conversation, gossip, arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom lectures, political speeches, bedtime stories, sermons, weddings, and more. Contact: Jack DuBois, Department of Linguistics, University of California at Santa Barbara, Santa Barbara, CA 93106, USA.
See also the research page and this other link.
It consists of 52,637 words of British spoken English, mainly taken from BBC radio broadcasts dating from the mid 1980s. In line with the conventions used in the LOB corpus project, each sample text is assigned to an overall category (indicated by a letter) and identifled by a "part number" (indicated by a digit). In addition, each text is given an absolute number to indicate its position in the corpus as a whole. Available as orthographic text, tagged with the CLAWS2 part-of-speech tagging system, parsed, and prosodically annotated. There are also tapes of a standard suitable for the instrumental analysis of F0 values.
The CD-ROMs of the SEC are available from ICAME (alternative page); a small sample and a manual are available on the web. As general contact, mail to ICAME.
SENSEVAL is a project concerned in Evaluating Word Sense Disambiguation Systems. The first SENSEVAL took place in the summer of 1998, for English, French and Italian. The second is planned for Pisa, Spring 2001. They let you have, as free demo, English dictionary entries and tagged examples for 35 words (see this link). For more cf. the References, Standards etc. section.
This SETIS Library (The Scholarly Electronic Text and Image Service:) provides a rich collection of 18th, 19th and early 20th century Australian texts. All are freely readable and downloadable in TEI2 conformant HTM format.
The Survey of English Usage (SEU) of the Department of English Language and Literature at University College London, was launched in 1959 by Randolph Quirk, who was succeeded as Director in 1983 by Sidney Greenbaum. The goal of the Survey of English Usage is to provide the resources for accurate descriptions of the grammar of adult educated speakers of English. For that purpose the major activity of the Survey has been the assembly and analysis of a corpus comprising samples of different types of spoken and written British English. The original target for the corpus of one million words has now been reached, and the corpus is therefore complete: the SEU corpus contains 200 samples or 'texts', each consisting of 5000 words, for a total of one million words; the texts were collected over the last 30 years (1953-1987), half taken from Spoken English and half from Written English; the Spoken English texts comprise both dialogue and monologue; the Written English texts include not only printed and manuscript material but also examples of English read aloud, as in broadcast news and scripted speeches. ICE-GB, the British part of ICE (the International Corpus of English project), is also included; the spoken texts make up the London-Lund Corpus (cf. LLC). Some 83,419 sentences are now available tagged and parsed for function. It includes ICECUP, a dedicated retrieval software. Contact: Survey of English Usage, University College, London, Gower Street, London WC1E, UK.
SHAXICON by Donald Foster is a lexical database that indexes all of the words that appear in the canonical plays 12 times or less, including a line-citation and speaking character for each occurrence of each word. (These are called "rare words", though they are not rare in any absolute sense -- "family [n.]" and "real [ad.]" are rare words in Shakespeare.) All rare-word variants are indexed as well, including the entire "bad" quartos of H5, 2H6, 3H6, Ham, Shr, and Wiv; also the nondramatic works, canonical and otherwise (Ven, Luc, PP, PhT, Son, LC, FE, the Will, "Shall I die" et. al.); the additions to Mucedorus and The Spanish Tragedy, the Prologue to Merry Devil of Edmonton, all of Edward III and Sir Thomas More (hands S and D); Ben Jonson's Every Man in His Humour (both Q1 and F1) and Sejanus (F1); and more; but these other texts have no effect on the 12-occurrence cutoff that sets the parameters for SHAXICON's lexical universe.
Address queries to Professor Donald Foster, for availability.
Cf. also the Shakespeare Autorship home page.
It was started by Jan Svartvik at Lund University in 1975 as a sister project of the SEU. The corpus consists of 100 texts, each of 5000 words, totalling 500.000 running words of spoken British English. Information about the compilation of the corpus and explanation of the symbols (prosodic, phonetic, etc.) used on the CD-ROM can be found in the printed volume A Corpus of English Conversation, edited by J. Svartvik & R. Quirk, Lund Studies in English 56, Lund: Liber/Gleerups, 893 pp, (1980).
Contact: Jan Svartvik, Lund University, Department of English, Helgonabacken 14, S-223 62 Lund, Sweden.
Star Thrower Publishing provides modern and experimental American literature freely online. All text are are HTML suitable for reading online and not for downloading, but, of course, they can be freely downloaded as well. Go to this page and see what they have. [Last checked 2001 December 25].
The SUSANNE Corpus, that's to say Geoffrey Sampson's Written American English Annotated Corpus, contains 130,000-word cross-section of written American English (it is based on a subset of the million-word Brown Corpus; for british counterparts of SUSANNE cf. the CHRISTINE Corpus based on samples of the spoken language and the LUCY Corpus based on samples of the written language) annotated in accordance with the SUSANNE (Surface and Underlying Structural ANalysis of Natural English) scheme.
This scheme, which is so far the first serious attempt anywhere to produce a comprehensive, fully explicit annotation scheme for English grammatical structure, is fully expoundend in Sampson's book English for the Computer: The SUSANNE Corpus and analytic scheme (published by Clarendon Press, the scholarly imprint of Oxford University Press, in 1995), a must reading for anyone interested in English linguistics and in corpus tagging in general. The genesis of the SUSANNE scheme lay in work on statistics-based parsing techniques led by Geoffrey Leech and Roger Garside in the early 1980s at Lancaster University, where Geoffrey Sampson then worked. The SUSANNE scheme was one of the first attempts to lay down a set of annotation standards which resolve all doubts of the kinds just discussed, by specifying an explicit rule to decide each grey area. The scheme does not aim to be correct: that is, no claim is made that where a construction might be analysed in one way or another way, the SUSANNE analysis corresponds to how speakers process the construction psychologically, or anything of that sort. The SUSANNE philosophy is that it is more important that every linguistic form should have a predictable, explicitly defined analysis than that analyses should always be theoretically ideal. Collecting and registering data in terms of an explicit taxonomic scheme is a precondition for successful theorizing, not the other way round. However, the SUSANNE scheme has been developed and debugged through consultation between numerous researchers and through application to sizeable samples of British and American English; so, if not 'correct', the scheme can at least be described as consistent. When this work began in the early 1980s, no grammatically analysed corpora yet existed. The 45,000-word LLT (Lancaster-Leeds Treebank) which Geoffrey Sampson developed for Geoffrey Leech and Roger Garside's parsing project, though small, was apparently the first in the field. As the virtues of corpus-based methods have been more widely appreciated in the 1990s, larger analysed corpora have come into existence, the largest of which (Mitchell Marcus's Penn TreeBank) makes the SUSANNE Corpus insignificant in size terms, but these resources have quite different aims: the SUSANNE Corpus was produced as an adjunct to the development of detailed analytic standards, and not to produce the largest possible quantity of analysed language material, using an analytic scheme which is only as subtle as is compatible with that aim.
For a general presentation of the SUSANNE project go to its proper homepage, which is http://www.grsampson.net/RSue.html (but is advisable to refer first of all to the general one provided in the title line). There is also a full Documentation file, which is available as a single Web page. The Corpus itself can be freely downloaded by anonymous ftp in the last version (Release 5, completed in August 2000, which is substantially revised from the previous release, which was circulated by the OTA, cf. below): what you receive will be a gzipped tar file; use the "gunzip SUSANNE.tgz" to uncompress it into a SUSANNE.tar file, and "tar -xvf SUSANNE.tar" to unpack the archive into its constituent files.
+ Please note that the address given for the SUSANNE Corpus in English for the Computer is now out of date.
+ You can download Suzanne also from a lot of other NLP resources site on the web (Stuttgart IMS ftp, for ex.). But the only safe source for the latest version is the author's site.
+ A 1483 KB version was freely (for non-commercial use only after you have sent them a statement) available from OTA (Oxford Text Archive) site: but it's only version 4.
[updated 2004 March 25].
This release of Speech Under Simulated and Actual Stress (SUSAS) was created by the Robust Speech Processing Laboratory at Duke University under the direction of Professor John H. L. Hansen and sponsored by the Air Force Research Laboratory. The database is partitioned into four domains, encompassing a wide variety of stresses and emotions. A total of 32 speakers (13 female, 19 male), with ages ranging from 22 to 76 was employed to generate in excess of 16,000 utterances. SUSAS also contains several longer speech files from four Apache helicopter pilots. These helicopter speech files were transcribed by the Linguistic Data Consortium and are available via ftp. A common highly confusable vocabulary set of 35 aircraft communication words makes up the database. All speech tokens were sampled using a 16-bit A/D converter at a sample rate of 8kHz. This one CD-ROM corpus is structured like the TIMIT database. Available only by membership to the LDC.
The Switchboard-1 Telephone Speech Corpus includes a speech and transcript component. It was originally collected by Texas Instruments in 1990-1, under DARPA sponsorship. The first release of the corpus was published by NIST and distributed by the LDC in 1992-3. Since that release, a number of corrections have been made to the data files as presented on the original CD-ROM set; the original speech corpus (LDC catalogue number LDC93S7; intermediate version available) is no longer available at LDC and has been replaced with a revised version, SWITCHBOARD-1 Release 2.
SWITCHBOARD is a collection of about 2400 two-sided telephone conversations among 543 speakers (302 male, 241 female) from all areas of the United States. A computer-driven "robot operator" system handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person (the callee) to take part in a conversation, introducing a topic for discussion and recording the speech from the two subjects into separate channels until the conversation was finished. About 70 topics were provided, of which about 50 were used frequently. Selection of topics and callees was constrained so that: (1) no two speakers would converse together more than once and (2) no one spoke more than once on a given topic. Complete orthographic transcriptions were made for each conversation, with codes to identify overlapping portions (both speakers talking at the same time), certain non-speech events (laughter, coughs, etc) and interruptions/hesitations. Each conversation was also rated by transcribers for various quality factors (amount of cross-talk between channels, static and background noise, topicality, etc). In addition, each transcription was verified and then used in a forced speech-recognition algorithm to establish timing marks for word and utterance boundaries; transcriptions are provided in the corpus in both "plain text" and "time-aligned" forms. A description is published in the 1992 ICASSP Proceedings: Godfrey, McDaniel and Holliman, "SWITCHBOARD: A Telephone Speech Corpus for Research and Develpment.
SWITCHBOARD-1 Release 2 is sold (in a notebook-style binder with 23 CD-ROMs) by LDC at 2000$ price but the intermediate version is cheaper (100.
The Switchboard Corpus of telephone conversations is annotated in the Penn Tree Bank (cf. Penn TreeBank) also for disfluency. See also a sample.
The TDT Pilot Study corpus was created to support an initiative in "topic detection and tracking". This initiative is directed toward computer processing of language data, both text and speech. The objective is namely to explore techniques for detecting the appearance of new and unexpected topics and for tracking the reappearance and evaluation of them. The TDT corpus comprises a set of stories that includes both newswire (text) and broadcast news (speech). Each story is represented as a stream of text, in which the text is either taken directly from the newswire (Reuters) or is a manual transcription of the broadcast news speech (CNN). The corpus spans the period from July 1, 1994 to June 30, 1995. It contains approximately 16,000 stories, with about half taken from Reuters newswire and half from CNN broadcast news transcripts. An integral and key part of the corpus is the annotation of the corpus in terms of the events discussed in the stories. Twenty-five events were defined that span a variety of event types and that cover a subset of the events discussed in the corpus stories. Annotation data for these events are included in the corpus and provide a basis for training TDT systems.
Available by membership to the LDC or paying $200 price.
Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news. The TDT2 corpus was created to support three TDT2 tasks: find topically homogeneous sections (segmentation), detect the occurrrence of new events (detection), and track the reoccurrence of old or new events (tracking). This CD-ROM release consists of the English text component of the TDT2 Multilanguage Text Corpus. The data were collected daily over a period of six months (January-June of 1998) from the following sources:  Newswire Services: New York Times Associated Press; Video Sources: CNN Headline News ABC World News Tonight;  Radio Sources: PRI The World VOA English News Service.
Available by membership to the LDC or paying $2000 price.
The Topic Detection and Tracking (TDT) English and Mandarin Corpus.
See under Multilingual and Parallel Corpora section. The two subcorpora were released also separatedly, cf. TDT2 English Text corpus Version 2 and TDT2 Mandarin Text Corpus.
The Translational English Corpus is a computerized collection of contemporary translational English texts. It is freely available to the research community, with a set of software tools to allow scholars to investigate the language of translated English. The corpus is continually being enlarged and the software tools refined. Also software, tools and documentation are freely available on the site. [2001 April 23].
The main purpouse of this interesting page by Tim Johns (homepage) is to show that it is possible to begin to use a "data-driven" approach to language learning and teaching even if you do not have access to established corpus resources. A secondary purpouse is to discuss the potential of small, very specific corpora for ELT, providing also some simple recipe for cooking them. [2002 February 17].
The TIPSTER project is sponsored by the Software and Intelligent Systems Technology Office of the Advanced Research Projects Agency (ARPA/SISTO) in an effort to significantly advance the state of the art in effective document detection (information retrieval) and data extraction from large, real-world data collections. The detection data is comprised of a new test collection built at NIST (National Institut of Standards and Technology) to be used both for the TIPSTER project and the related TREC project (Text Retrieval Conference). The test collection consists of 3 CDROMs of SGML encoded documents distributed by LDC plus queries and answers (relevant documents) distributed by NIST.
The documents in the test collection are varied in style, size and subject domain. The first disk contains material from the Wall Street Journal (1986, 1987, 1988, 1989), the AP Newswire (1989), the Federal Register (1989), information from Computer Select disks (Ziff-Davis Publishing) and short abstracts from the Department of Energy. The second disk contains information from the same sources, but from different years. The third disk contains more information from the Computer Select disks, plus material from the San Jose Mercury News (1991), more AP newswire (1990) and about 250 megabytes of formatted U.S. Patents. The format of all the documents is relatively clean and easy to use, with SGML-like tags separating documents and document fields. There is no part-of-speech tagging or breakdown into individual sentences or paragraphs as the purpose of this collection is to test retrieval against real-world data.
Available by membership to the LDC or paying $500 price. The three CDs are also available separatedly (200$ each): TIPSTER Volume 1; Volume 2; Volume 3.
The Thai On-Line Library of bilingual texts, maintained by the TIE Project (Thai Internet Educational), is a tool for Thai students of English, and for foreign students of Thai. TOLL includes a built-in Thai-English/English-Thai dictionary – look up words by clicking, typing, or cut-and-paste. For the benefit of foreign (and younger Thai) readers, TOLL is able to insert spaces between Thai words. TOLL serves several purposes. It is: (a) a test-bed for innovative Internet software development, (b) a workshop for research in new approaches to language education, (c) a low/no-cost delivery system for high-quality educational resources, (d) a starting point in the long research struggle to build sophisticated Thai/English translation software.
See under Multilingual and Parallel Corpora section for more.
The 993 KB English corpus of Tottel's Miscellany (1557), Scholars' Press Ltd, Leeds 1966 (TACT format). Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
The 546 KB ME corpus of The Towneley plays, compiled by Michael J. Preston. Transcribed from: The Towneley plays / re-edited from the unique ms. by George England; with side-notes and introduction by Alfred W. Pollard, London: Published for the Early English Text Society by Oxford University Press, 1897 (Early English Text Society. Extra series; no. 71). Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
TRIPTIC is a trilingual corpus developed for the analysis of prepositions in English, French and Dutch. There is not a TRIPTIC page on the web and all the information are taken from Michael Barlow's Parallel Corpora Page. For further information see under Multilingual and Parallel Corpora section. Contact: Hans Paulussen.
It contains texts in English, French and Spanish from the Office of Conference Services at the UN in New York between 1988 and 1993.
See under Multilingual and Parallel Corpora section.
The 122 KB English corpus of Undergraduate verse, compiled by George Roberts, Jo Lloyd and Mick Herron. Transcribed from Oxford University Poetry Society (undergraduate verse 1987-8). Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
Texts from the Universal Copyright Convention, 8000 words in English, 1971. From Institut für Deutsche Sprache, Mannheim, Germany.
Available under subscription to TRACTOR.
The Universal Library, hosted by the Computer Science Department of Carnegie Mellon University, provides a good page of resources on e-texts in the Web. Mainly (but not exclusively) English ones.
Texts from US Army Foreign Military Studies Office (FMSO), Fort Leavenworth, KS, USA. From Institut für Deutsche Sprache, Mannheim, Germany.
Available under subscription to TRACTOR.
HTML texts from US Army Center of Military History, Washington, DC, USA. From Institut für Deutsche Sprache, Mannheim, Germany.
Available under subscription to TRACTOR.
Texts from the US Government, Washington DC, USA. From Institut für Deutsche Sprache, Mannheim, Germany.
Available under subscription to TRACTOR.
A 3.4 MB English corpus made by Speeches from the USA Presidential campaign 1992 (includes speeches and biographies of the contenders). Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
The USC Marketplace Broadcast News Corpus contains approximately 40 hours of audio data, which was recorded daily between May 1, 1996 and September 18, 1996. Corresponding transcript files were created by Federal Document Clearing House and enhanced by the LDC to include: story boundaries, disfluency markers, and speaker and gender identification. In keeping with Hub-4 style transcription conventions, LDC spelled all digit strings in standard orthography. Commercial and music segments, while a part of the audio publication, were excluded from the transcripts. The timestamps mark the beginning of each speaker turn relative to the beginning of the recording and are precise to the 100th of a second. Although the transcripts were created using Hub-4 conventions, the second and third pass quality checks, typically required by government sponsored evaluation projects, were skipped.
Available only by the LDC through membership or 800$ price.
A rich archive of Literary English Texts, but, unfortunately, restricted to Toronto University students and staff. Always the same old story.
VISL system from Department of Language and Communication University of Southern Denmark - Odense provides queries online to pure-text Corpora in Danish, German, English and Spanish and to the Portuguese tagged. The service is for members only (see at this page).
This corpus, developed at Birmingham by Wang Lixun (homepage), is still in progress. The present aim is to create a 10 million words parallel corpus (half English, half Chinese), for research and language-teaching purposes. For more details, cf. under the Parallel Corpora section. [2001 April 23].
In these page by R. J. C. Watt there are online concordances (kwic format, made with the Concordance tool) with e-texts aside of a few English poetry (Shelley, Coleridge, Keats, Blake, Hopkins). They seem quite nice.
The Web Concordancer site, by the Virtual Language Centre of the Polytechnic University of Hong Kong, presents a few indexed corpora (English, French, Chinese, Japanese) thant can be freely browsed with the ConcApp program. Corpora available include Brown Corpus, Sherlock Holmes stories, South China Morning Post, etc. [2002 February 17].
The texts of the main works of the famous French philosoph Gilles Deleuzes freely available directly on his site. Besides the French originals, English and Spanish translations are available as well, so you can construct at least a three language parallel text (if not a true parallel corpus).
One million words of spoken New Zealand English collected in the years 1988 to 1994. Ninety nine percent of the data (545 out of 551 extracts) was collected in the years 1990 to 1994. Of the eight remaining files, four were collected in 1988 (4 oral history interviews) and four in 1989 (4 social dialect interviews). The WSC was formerly known as A Computerised Corpus of English in New Zealand (ACCENZ). The corpus consists of 2,000 word extracts (where possible) and comprises different proportions of formal, semi-formal and informal speech. Both monologue and dialogue categories are included and there is broadcasting as well as private material collected in a range of settings.
WSC (together with WWC) can be buyed directly from the University of Wellington at 100 NZ $ (200 for institutions).
A version on CD-ROM is available from ICAME; a small sample and a manual availables.
The basic aim of the Wellington Corpus of Written New Zealand English is to provide a computerised (not tagged) sample of written New Zealand English which will allow direct comparisons with the Brown University Corpus of American English, the Lancaster-Oslo/Bergen Corpus of British English and, especially, with the Macquarie Corpus of Australian English. Since the Australian Corpus was not available while the Wellington Corpus was being developed, the New Zealand Corpus was based largely on the LOB Corpus, both in terms of content and also in terms of coding practice, though with one extremely significant difference. Both the Brown and the LOB corpora collected material published in 1961. By the time planning for the Wellington corpus began, it was known that there was an Australian project underway which would use 1986 as its baseline. Since it was realised from the outset that comparisons with Australian data would be of vital importance if any distinct New Zealand variety of written English was to be established, the year 1986 was also taken as the baseline for the Wellington Corpus. However, not enough suitable material was published in New Zealand in 1986, and in practice the Wellington Corpus, while most of the material it uses was published in 1986 or 1987, covers the years 1986-1990. Only minor differences in the coding exist between the LOB and the Wellington Corpora.
WWC (together with WSC) can be buyed directly from the University of Wellington at 100 NZ $ (200 for institutions).
A version on CD-ROM is available from ICAME; a small sample and a manual availables.
The XRCE (home page) MLTT team creates basic tools for linguistic analysis, e.g. morphological analysers, parsing and generation platforms and corpus analysis tools. These tools are used to develop descriptions of various languages and the relation between them. There are free web demos of some of their tools also for English. For more details cf. Tools section.
XTAG is an on-going project at Penn (i.e. the University of Pennsylvania; cf. the Penn Tools file) to develop a wide-coverage grammar for English using a lexicalized Tree Adjoining Grammar (TAG) formalism. Both the XTAG English Grammar, released on 2.24.2001, and the XTAG Tools are freely downloadable. There are also a lot of User Manuals and selected papers dealing with the various components of XTAG. For more details cf. under the Reference section. [2001 April 27].
London Newspapers from the mid 1660s to the beginning of the twentieth century. In development. Contact: Dr. Udo Fries, Englisches Seminar, University of Zurich, Platten str. 47, CH-8032 Zurich, Switzerland.
The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English (henceforth the Brooklyn Corpus) is a selection of texts from the Old English Section of the Helsinki Corpus of English Texts (henceforth the Helsinki Corpus), annotated to facilitate searches on lexical items and syntactic structure. It would represent for Old English what the PPCME is for Middle English and Penn Treebank is for modern English. It is intended for the use of students and scholars of the history of the English language. The Brooklyn Corpus contains 106,210 words of Old English text; the samples from the longer texts are 5,000 to 10,000 words in length. The texts (see list) represent a range of dates of composition, authors, and genres The texts are syntactically and morphologically annotated, and each word is glossed. The size of the corpus is approximately 12 megabytes. The syntactic annotation scheme was based on the one developed at the University of Pennsylvania for the first edition of the Penn-Helsinki Parsed Corpus of Middle English. The intent was to make the syntactic annotation of the two corpora as similar as possible, while taking into account the syntactic and morphological differences between Old and Middle English. The Brooklyn Corpus is available without fee for educational and research purposes, but it is not in the public domain. Downloading the manual is unrestricted, but the texts themselves and the PERL search scripts are available only to users who agree formally to the conditions of use by filling out the access request form and returning it via e-mail to Susan Pintzuk. The manual for the Brooklyn Corpus is available in PS, RTF and TXT formats. [Last checked 2001 August 5].
The Corpus of Early English Correspondence (CEEC) has been compiled for the study of social variables in the history of English. The timespan covered is from 1417 to 1681, and the size of the whole not tagged corpus is 2.7 million words. For more details cf. under the Modern English section.
See under the Modern English section.
The CME is an online querable collection of Middle English texts (see the Bibliography) assembled from works contributed by University of Michigan faculty and from texts provided by the Oxford Text Archive, as well as works created specifically for the Corpus by the HTI (Humanities Text Initiative ). All texts in the archive are valid SGML documents, tagged in conformance with the TEI Guidelines, and converted to the TEI Lite DTD for wider use. The full 61 texts collection (or any selection of them you would define) can be freely seached online, and every text can be freely and integrally accessed in HTML format. [2001 August 27].
The Old English machine-readable corpus is a complete record of surviving Old English except for some variant manuscripts of individual texts. This edition of the Corpus has been constructed from a version supplied by the Dictionary of Old English project in 1998. It represents a more correct version of the Corpus and supersedes all previous versions. The Corpus comprises over 3000 different texts from ca. 450-1100 A.D. The texts have been divided into meaningful units, usually editorial sentences. Non-Old English words and roman numerals are marked using the FOREIGN tag; no attempt is made to distinguish Greek and Latin. Words identified by the DOE editors as unclear in MS or emended in the edited sources are marked using the CORR tag. CORR and FOREIGN elements may be nested within each other, but do not overlap. Sadly the corpus is not available, since access is restricted to Harvard people. [2001 May 1].
The Helsinki Historical Corpus is a computerized collection of extracts of continuous text. The Corpus contains a diachronic part covering the period from c. 750 to c. 1700 and a dialect part based on transcripts of interviews with speakers of British rural dialects from the 1970's. The aim of the Corpus is to promote and facilitate the diachronic and dialectal study of English as well as to offer computerized material to those interested in the development and varieties of language. It is in plain orthographic text, tokenized but not tagged. The Helsinki corpus contains samples from texts covering the Old, Middle, and Early Modern English periods (850-1710), for about 1,500,000 words in total.
A version on CD-ROM is available from ICAME; a small sample and a manual are available. As general contact, mail to ICAME.
The Helsinki Corpus of Older Scots has been compiled at the University of Helsinki as a supplement to the diachronic part of the Helsinki Corpus of English Texts: Diachronic and Dialectal. It consists of 830.000 not tagged words from 1450 - 1700 texts. See the full file under the Scots section.
The Innsbruck Computer-Archive of Middle English Texts is made by three distinct subcorpora.
+ The Prose Corpus of ICAMET, consisting of about 2 million untagged words, is a compilation of 129 texts (March 1999) of Middle English prose (1100-1500), digitalized from extant editions and constantly enlarged by further files. The corpus can, of course, well be used for (comparative) linguistic analysis. But since it is a full-text database, it particularly aims at target groups of users who, unlike those of the Helsinki Corpus, are not so much interested in extracts of texts, but in their complete versions. The corpus thus allows literary, historical and topical analyses of various kinds, particularly studies of cultural history. As to language analysis, it invites linguists to raise questions of style, rhetorics or narrative technique, for which one would want a lengthier piece of text or even the complete text. Contact: e-mail. A manual is available with instructions on how to order it (DM 29). A version on CD-ROM is available from ICAME (alternative page). There is a small sample available.
+ The Letter Corpus of ICAMET contains 254 complete letters, arranged diachronically, from different sources. The letters were written between 1386 and 1688. The corpus particularly encourages, apart from language analysis, pragmatic and sociolinguistic studies, but also analyses concerning cultural life and lifestyle.
+ The ICAMET Varia Corpus is a mixture of tagged, normalized, translated and otherwise manipulated or synoptized texts, with the interventions motivated by the wish of making Middle English or Modern English texts fully accessible and more easily comparable with each other. The Varia Corpus is a field of constant experimenting, so that regular updatings must be reckoned with.
A simple page with versions of the Pater Noster in many germanic languages (Afrikaans, Alsatian, Bavarian, English, Danish, Dutch, Frisian, German, Gothic, Icelandic, Norn, Norwegian, Old Saxon, Pennsylvania Dutch, Plattdeutsch, Swedish) by Catherine Ball (see her homepage), the webmistress of the Old English Pages. There is also a simple interface that allows the comparation of any two texts. This page was prepared for the use of classes in linguistics, history of the English language, and Old English. [2001 July 13].
This page of the Forgotten Ground Regained site provides a good collection of medieval Germanic (mainly Middle English) poetry texts, ranging from Old English to Middle English and Scots, with some hints of Old Norse and Old (High and Low) German. All texts are freely downloadable (but often broken in more files). [Rev. 2001 December 2].
A 664 KB corpus of Middle English alliterative verse. Samples are from 14 mediaeval poems for approx 500 lines each; normalised by Hoyt N. Duggan. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
A 99.6 KB corpus of Middle English texts taken from the Anthology of Middle English texts by Dr Santiago Gonzalez y Fernandez-Corugedo, Various 1990. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
This excellent site (a part of ORB, the On-Line Reference Book for Medieval Studies), is maintained by Catherine Ball (see her homepage). It displays a lot of useful reference resources for Old English scholars and students. Besides pages on language, art, coins, fonts, history, sound files, materials like CD ROMs, instructional software and courses, and the Englisc List (a forum for modern composition in Old English; there are also amusing compositions like the "Ode to a Semiconductor" in Early West Saxon ...), the following page is particularly worthnoting:
+ Texts & MSS . It's a rich index to electronic editions of Old English texts, translations, and images of Anglo-Saxon manuscripts available on the Web. [2001 July 13].
The Penn-Helsinki Parsed Corpus of Middle English (PPCME) is a corpus of prose text samples of Middle English, annotated for syntactic structure to allow searching, not only for words and word sequences but also for syntactic structure, created at the University of Pennsylvania (Penn). It is designed for the use of students and scholars of the history of English, especially the historical syntax of the language. The samples are drawn largely from the Middle English section of the diachronic part of the Helsinki Corpus of English Texts: Diachronic and Dialectal, with certain additions and deletions. However, the size of the samples is considerably larger. [2001 April 28].
+ PPCME 2 is the current and completely revised edition of the corpus. This new edition includes a total of 1.3 million words of running text and contains larger samples of the corpus texts. For the earliest Helsinki time period, all texts except one are exhaustively sampled; the one exception is the Ancrene Riwle sample, which contains approximately 50,000 words. For later Helsinki time periods, two texts per period were expanded to 50,000 words. The remaining texts are represented by the Helsinki Corpus sample. The corpus comprises 55 text samples, each of which is given in three forms: a text file, a part-of-speech tagged file and a parsed file. In addition, there is a file with philological and bibliographical information about each text. There is also a manual describing in detail the annotation scheme of the corpus. The annotation scheme for the corpus follows the basic formatting conventions of the modern English Penn Treebank. The corpus-specific annotation scheme of the PPCME2 was devised by Anthony Kroch and Ann Taylor. Taylor supervised and carried out the annotation work.
+ CorpusSearch is a Java-based linguistically intuitive query tool is available for use with the PPCME2. The program was written by Beth Randall as a replacement for the somewhat inconvenient regular expression-based search tools available up to now. CorpusSearch can also be used with other syntactically annotated corpora in the Penn Treebank style, including specifically the Brooklyn Corpus of Old English There is an extensive freely downloadable user manual for CorpusSearch.
+ The PPCME2 and the CorpusSearch program are available on CD-ROM, either separately or together (order form). There is a charge of $200 for a five-user subscription to the corpus and $50 for a five-user site license for the search program. Contact Anthony Kroch if more extensive licenses are desired.
+ PPCME 1 was smaller and which used a simpler annotation scheme than his improved and extended version PPCME2. The PPCME1 contains 510,000 words of text and is still available at no cost to scholars and students. The PPCME1's syntactic annotation is similar to that of the PPCME2, marking clausal embeddings and clause-internal constituent structure, but in less detail. The major differences with the PPCME2 are that: words are not tagged for part of speech; the internal structure of noun phrases is not indicated; the annotation of several complex sentence and phrase types is less detailed. Documentation for the PPCME1, search tools, and the corpus itself are available at the corpus site. Downloading the documentation and tools is unrestricted, but the texts themselves are accessible only to users who agree formally to the conditions of use (free with license). The PPCME1 was conceived by Anthony Kroch of the University of Pennsylvania. The corpus annotation scheme was devised by Kroch and Ann Taylor. Taylor supervised and carried out the annotation work.
The 5 MB Complete Corpus of Old English (i.e. the Toronto Dictionary of Old English corpus). Compiled by the University of Toronto Centre for Medieval Studies. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
The 606 KB Corpus of the York miracle play cycle, compiled by Michael J. Preston. Transcribed from: The York plays, edited by Richard Beadle, London: Edward Arnold, 1982 (York medieval texts. Second series).From ms. Add. 35290 in the British Library. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement. York miracle play cycle / compiled by Michael J. Preston.
It is an archive of original and translated literary texts in Esperanto. Huge, but texts are stored in html, freely readable online, format and there aren't more download-friendly versions.
mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages, although the bulk of the collection is English.
The Estonian CHILDES corpus is downloadable as a zipped file.
See under Multilingual and Parallel Corpora section for a fuller file.
This page, from the Department of Asian and Pacific Linguistics - Institute of Cross-Cultural Studies - Tokyo University, maintained by Kazuto Matsumura, provides a few free Estonian e-Texts: the 1918 Declaration of Independence, five legal texts andthree newspaper articles. All HTML.
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.
The Corpus is querable online. The page is Estonian only.
1 million word corpus of Estonian, TEI tagged, from Dept of Computer Science and Dept of General Linguistics, University of Tartu, Tartu, Estonia.
Available under subscription to TRACTOR.
This 1 million words corpus comprises an untagged text (one sentence per line) and a collection of TEI tagged texts. All are downloadable in Zip files.