(1) - Corpora and Corpus Linguistics.
(2) - Multilingual and Parallel Corpora.
(3) - Electronic Literary Text Archives.
(4) - References, Standards & Educational Resources.
(5) - Tools.
This section collects: (1) sites and institutions involved in various aspects of corpus linguistics (but for chiefly corpora builders and providers go to section 2.1 "Corpora and Corpus Linguistics", and for mere tools developers go to section 2.5 "Tools"); (2) sites dealing with standards, encoding schemes, and scripting languages relevant for corpus linguistics; (3) link pages and bibliographies focusing on corpora; (4) tutorials, courses, and educational resources for CL and NLP in general; (5) journals and mailing lists; (6) relevant homepages of CL scholars; (7) other unclassified but useful resources. Please note that, as a rule, I have omitted references to congress homepages.
The Association for Computers and the Humanities is an international professional organization. Since its establishment, it has been the major professional society for people working in computer-aided research in literature and language studies, history, philosophy, and other humanities disciplines, and especially research involving the manipulation and analysis of textual materials. The official journal of the ACH is "Computers and the Humanities". Subscription: individual regular membership costs US $65 and is comprehensive of 6 issues of CH Journal. [2001 April 26].
The official reference site of Association for Computational Linguistics. There are infos for memberships, the page of the ACL journal "Computational Linguistics", Congress calls and minor announcements. The ACL mantains also the precious NLSR tool pages and the NLP/CL Universe search engine. [Last rev. 2001 April 26].
Founded in 1947, ACM is the world's first educational and scientific computing society. Today it has over 80,000 members. Books on sell, Digital Library, Journal and Magazines, Conferences, Proceedings and other infos are among what you can find on their site. Membership costs 95$ (there are facilitatiuons for students) and give you access to the ACM digital library. [2001 April 23].
Adam Berger is a PhD student in the Computer Science Department at Carnegie Mellon University, working with John Rafferty. There is a useful page on language modelling into the maxent/minimum divergence framework. This page contains informations mostly of a tutorial nature on the use of discrete exponential models in natural language processing. There is also some free software to download: a Trigger Toolkit and Align, a bilingual sentence-alignment system.
The Association for Literary and Linguistic Computing (ALLC) was founded in 1973 with the purpose of supporting the application of computing in the study of language and literature. As the range of available and relevant computing techniques in the humanities has increased, the interests of the Association's members have necessarily broadened, to encompass not only text analysis and language corpora, but also image processing and electronic editions. The ALLC's membership is international, is drawn from across the humanities disciplines, and includes students and established scholars alike. Membership of the Association is by subscription to its journal, the LLC, and costs £46/US$77 (individuals, 4 issues per year). [2001 April 26].
The AMALGAM project is an attempt to create a set of mapping algorithms to map between the main tagsets and phrase structure grammar schemes used in various research corpora. Software has been developed to tag text with up to 8 annotation schemes. This software is available by email and, shortly, using a web-browser. They are developing a Multi-tagged Corpus and Multi-Treebank, i.e. a single text-set annotated with all the above tagging and parsing schemes. Some useful demos are already online:
+ AMALGAM Multi-tagged Corpus (180 Eng. sentences).
+ AMALGAM Multi-Treebank (60 Eng. sentences).
The Amalgam Project provides various resources (besides the Amalgam MultiTagger and Corpora), in particular a useful web guide to different tagsets in common use (such as Brown, ICE, UPenn, LLC, LOB, POW, SEC etc.).
Armazi is the TITUS (Thesaurus Indogermanischer Text- und Sprachmaterialien) page on "Fundamentals of an Electronic Documentation of Caucasian Languages and Cultures". It deals mainly with developing encoding standards (cf. the Encoding standards for the languages of the Caucasus project) and informatic media (cf the Computer models for Caucasian languages project) for Georgian and other Caucasian languages. It hosts also important Georgian projects, such as the Digitization of Old Georgian texts from the Gelati school Project from the Gelati Academy of Sciences, and the Digitization of the Albanian palimpsest manuscripts from Mt. Sinai project. There are also some links to e-text resources from the TITUS server, both for Georgian and for Laz, Svan and Mingrelian. Beware only that these pages are encoded using Unicode / UTF8. The special characters as contained in them can only be displayed and printed by installing a font that covers Unicode such as the freely downloadable TITUS font TITUS Cyberbit Basic. [2001 may 18; Rev. 2001 August 30].
Started in Aug 1991, arXiv.org (formerly xxx.lanl.gov) is a fully automated electronic archive and distribution server for research papers. Covered areas include physics and related disciplines, mathematics, nonlinear sciences, computational linguistics, and neuroscience. Users can retrieve papers from the archive either through an on-line world wide web interface, or by sending commands to the system via e-mail. Similarly, authors can submit their papers to the archive either using the on-line world wide web interface, using ftp, or using e-mail. Authors can update their submissions if they choose, though previous versions remain available. Texts are usually in Tex / Latex format and can be freely retrieved. [2001 April 23].
Rosie Jones and Rayid Ghani in this interesting paper propose a way of building corpora for lesser studied languages extracting data from the web. Namely, they present an approach to language-specific query-based sampling which, given a single document in a target language, can find many more examples of documents in that language, by automatically constructing queries to access such documents on the world wide web. They propose a number of methods for building search queries to quickly obtain documents in the target language. The paper is freely downloadable in PS format; PDF is also available. [2001 May 1].
An interesting paper by Jim Cowie, Evgeny Ludovik, and Ron Zacharski dealing with a "text collector" web spider. A person using the spider specifies a target language code set pair and one or more starting URLs. The spider collects web pages that match this specification. This tool has been successfully used to create a moderate sized corpus (50MB) of Turkish text, as well as smaller corpora of Arabic and Russian text. This paper provides a general description of the design of the spider. In addition, the paper presents a detailed description of the algorithm used for language identification and compares the algorithm to those suggested by other researchers. [2001 May 1].
+ How to get things done with awk ? Sakari Mattila's page is a short but effective introduction to AWK. The best for a first reeding.
+ H. Churchyard's Awk Link Page is one of the best sources of infos on (G)AWK.
+ The GAWK Manual (1993) by Diane Barlow Close, Arnold D. Robbins, Paul H. Rubin & Richard Stallman. This is edition 0.15 (Free Software Foundations) of the classic manual, intended both as tutorial and reference, for the 2.15 version of the GNU implementation. Freely available.
+ AWK Language Programming (1996) by Arnold D. Robbins. This edition 1.0 January 1996 of the User's Guide for GNU AWK is newer, but still based on the GAWK Manual above.
+ A GAWK distribution, legally free, can be ordered at delivery cost ($ 25) from the GNU Organization.
+ All 2.15.1 - 3.1 (last) Versions can be freely downloaded from the GNU Organization FTP site.
+ Check also Cameron Laird's & Kathyn Soraiz's Choosing a scripting language paper, and the Do-It-Yourself site, with language and text tools in Perl and Tcl/Tk..
Beom-mo Kang is professor of Linguistics at Korea University in Seoul; his researches deal also with corpus and computational linguitics and with the "computers in the Humanities" field. This page is a rich repository of links to Computational Corpus Linguistics resources of general interest and, most notably, of specific Korean contents. A very useful page, but sadly until now only available in Korean; an English version does however exist of Beom-mo Kang’s personal page. Contact. [2001 April 23].
A good corpus linguistics bibliography by John Caruana (homepage) from Malta. [2001 August 4].
An interesting paper from SunWorld - October 1997. This paper introduce you to the basic concepts in scripting and tell you how the "big three" languages (Perl, Tcl, Python) compare. Cameron Laird's personal page on choosing a scripting language expands and updates this topics. [2002 February 19].
CataList is the official catalogue of LISTSERV lists. From this page, you can browse any of the 49,554 public LISTSERV lists on the Internet, search for mailing lists of interest, and get information about LISTSERV host sites. This information is generated automatically from LISTSERVER's Lists database and is always up to date. [2001 October 8].
Catherine N. Ball (e-mail) teaches at Georgetown University. On her rich personal page there are infos and links to her many interesting activities and researches, ranging from a good Tutorial on corpus linguistics, to Dead Language Aquisition (noteworthy!), Perl programming (there are also some downloadable tools), Old English (cf. the excellent Old English Pages, and the Germanic Pater Noster collection) and linguistic representation of the Sound of the World's Animals. [2001 July 13].
A general three-hour tutorial by Catherine Ball (see her homepage), derived from her Corpus Linguistics course hold at Georgetown University on Spring 1997. A good general introduction. [Checked 2001 July 13].
The Corpus Encoding Standard Document CES 1 (Version 1.5, last modified 20 March 2000) is the first version of the Corpus Encoding Standard (CES), which are a part of the EAGLES Guidelines. The CES is designed to be optimally suited for use in language engineering research and applications, in order to serve as a widely accepted set of encoding standards for corpus-based work in natural language processing applications. The CES is an application of SGML (ISO 8879:1986, Information Processing--Text and Office Systems--Standard Generalized Markup Language) compliant with the specifications of the TEI Guidelines. The CES specifies a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation (marking of structural and typographic information) as well as general architecture (so as to be maximally suited for use in a text database). It also provides encoding specifications for linguistic annotation, together with a data architecture for linguistic corpora. The CES is being developed at the New York University - Department of Computer Sciences - Vassar College in a bottom up fashion, starting with minimal specifications and expanding based upon feedback resulting from its use, and the input of the research community in general. CES 1 Document is fully online in HTML format. [2001 April 29].
Mostly personal and istitutional infos. His researches deal with the creation and exploitation of models of language which combine the insights of modern linguistic theories with the flexibility and practicality of probabilistic approaches. Contact.
An html freely browsable edition of the book by Chris Brew and Marc Moens. The book has three main aims: familiarity with tools and techniques for handling text corpora, knowledge of the characteristics of some of the available corpora, and a secure grasp of the fundamentals of statistical natural language processing. Specific objectives include: 1. Grounding in the use of UNIX corpus tools. 2. Understanding of probability and information theory as they have been applied to computational linguistics. 3. Knowledge of fundamental techniques of probabilistic language modelling. 4. Experience of implementation techniques for corpus tools. Perl and AWK scripting languages are well covered. [Last check 2001 July 11].
Published in June 1999 by MIT Press, this is the more recent thorough introduction to statistical approaches to natural language processing. This page is only the companion website for the book, with some information about, and sample chapters from. At this page you have instead some promo, with links to online bookstores were you can buy the book.
Chris Manning works on systems and formalisms that can intelligently process and produce human languages. Particular research interests include probabilistic models of language and statistical natural language processing, text understanding and mining, constraint-based theories of grammar (HPSG and LFG), computational lexicography (involving work in XML, XSL, and information visualization), information extraction, and syntactic typology. His pages, besides personal and institutional infos, provides also many of his papers.
Christopher Manning's Fall 1994 CMU course syllabus as a downloadable postscript file.
Here are other not very processed announcements to complement Chris Manning main page. It is an untidy page, but full of interesting stuff.
A rich annotated list of resources in corpus-based computational linguistics. It covers a lot of topics, and is a very useful reference site (it prooved invaluable also for collecting my pages!), although something must be updated. Contact: Christopher Manning.
Christopher Manning's Spring 1996 Carnegie Mellon University course materials on Statistical Natural Language Processing.
The Centre de Llenguatge i Computació (Universitat de Barcelona), formerly LaReLC (Laboratori de Recerca en Lingüística Computacional) is working mainly in Hispanic NLP and Lexical Aquisition (AQUILEX project). In collaboration with DLSI-UPC it has contributed in the development of NLP tools and in the maintenance of the DLSI-UPC/CLiC-UB Tools online querable Demo. The old site of LaReLc-UB is still working, but it is better to refer to the CLiC new one. [2001 April 30; rev. 2001 October 28].
The Computation and Language E-Print Archive (Cmp-Lg) was a fully automated electronic archive and distribution server for papers on computational linguistics, natural-language processing, speech processing, and related fields. Founded in April 1994 by Stuart Shiebe (homepage), the Cmp-Lg service has since been absorbed into, and superseded by, the CoRR (Computing Research Repository)
Colibri is a Newsletter (sent out from the Utrecht Institute of Linguistics every Wednesday afternoon MET) and WWW-site on language and speech technology and logic, sponsored by FoLLI (European Association for Logic, Language and Information) and OzsL (Dutch Research School in Logic). Colibri is specifically aimed at people interested in the fields of natural language processing, speech processing and/or logic. Colibri contains messages of general interest and ones of regional interest. As an example, there is a Dutch "sub-Colibri" covering the Netherlands and Flanders. Subscribers can choose which subsections they wish to receive. Subscriptions (cf. details on this page) are free and possible to any combination of a thematic area and a region. By default, the Colibri newsletter will only contain short messages (at most 40 lines). For longer messages only announcements will be made.
The ACL (Association for Computational Linguistics) journal, published by the MIT Press, is one of the primary forum for research on computational linguistics and natural language processing.
It is the official journal of The Association for Computers and the Humanities (ACH. It is published since 1996 by Kluvert. Subscription to 6 issues per year costs EUR 376.50 / USD 377.00 (institution) or EUR 162.50 / USD 163.00 (individuals), but notice that the Journal is comprised in the membership to the ACH, which is fairly cheaper (individual regular membership costs US $65). [2001 April 26].
Developed mainly for English language by Finnish scholars, constraint grammars started in late 1980's, and the first robust version was built in Esprit II project (1989-1992). Later, the EngCG syntax was essentially rewritten by Timo Järvinen in the Bank of English project (1993-1995) where 200 million words were analysed using the EngCG. The parsing software (Pasi Tapanainen) and morphological disambiguation grammar (Atro Voutilainen), cf. the EngCG-2 Tagger and the EngCG Parser, have still been developed to make the EngCG more applicable for further analysis. Applications begins to appear also for other languages, cf. the SweCG POS Disambiguator. Infact, constraint grammars have been proved especially useful for tagger, parser and disambiguation software. This page also offers links to other CG resources.
The Content Analysis Resources Quantitative Analysises of Texts, Transcripts and Images page provides some useful links, especially for software.
The CORPORA list is the main mailing list for Corpus Linguistics people, and it is open for information and questions about text corpora such as availability, aspects of compiling and using corpora, software, tagging, parsing, bibliography, etc. The list is unmoderated, but (they say) it may be moderated in the future. Messages are sent automatically to all the memberes on the list (more than 1200). At the moment, only members can send messages to the list. For subscribing send a message to MAJORDOMO@UIB.NO with the following line in the body of the letter: subscribe corpora. You can also freely consult the Corpora List Archive in Hypermail. [2001 April 23].
A link page to various corpus linguistics resources on the Web maintained in Italian by Federico Zanettin (homepage). [2001 April 23].
The online Computing Research Repository (CoRR) has been established in September 1998 in order to provide a single repository to which researchers from the whole field of computing could submit reports and have them published on the web in 24 hours. The CoRR, through a partnership of the ACM (Association for Computing Machinery), arXiv.org e-Print Archive, and NCSTRL (Networked Computer Science Technical Reference Library), is freely available to all members of the community at no charge. Several formats are accepted, from TEX to PDF. The CoRR has superseded and absorbed the Cmp-Lg E-Print Archive. [2001 April 23].
The CRIBeCu (Centro di Ricerca per i BEni CUlturali) provides some tools for Computational Textual Analysis (e.g. the commercial TReSy engine for XML/SGML and SAM, a free tool for text indexing), and some online querable SGML Italian Literary text (cf. CRIBeCu Italian Texts Online).
A good list of links on corpus linguistics maintained by Cristiana De Santis (e-mail) from CILTA (Bologna University Centro Interfacoltà di Linguistica Teorica e Applicata 'L. Heilmann'). Particularly worth noting are the sections on e-text sources and concordancing tools. [2001 July 7]
The Computing Research Laboratory (CRL) at New Mexico State University is a non-profit, self-supporting research enterprise committed to basic research and software development in advanced computing applications. CRL's basic research efforts are concentrated on practically all extant approaches to multilingual processing of natural language texts.
+ CRL Software: they have some good software (Xconcord, Cíbola/Oleada etc.) they offer freely after you have signed the CRL Software License Agreement so obtainig password and username to login (it's easy: they dont ask you money or embarassing questions!).
The Guide to Digital Resources for the Humanities (home) edited by Sarah Porter, Michael Fraser, Sophie Clarke provides a very rich, although raw, list of resources. This is the fourth edition of the CTI Textual Studies Guide to Digital Resources and is available online only in a Table of Contents format now available: at this moment the only way to access the full Guide is to buy the printed edition, but over the coming months this online version will expand (at least so they promise) to make the full content available through the web. Orders of the printed version can be made following this link. The Guide aims to give an overview of digital resources which may have application for Higher Education teaching and research in the disciplines supported by the Centre (Literary Studies in all languages and periods, Literary Linguistics, Philosophy, Theology & Religious Studies, Classics, Film and Media Studies, and Drama). The Guide is currently being updated. Revised sections will be made available as soon as they are completed.
This is the home of the Comprehensive Unification Formalism, a unification-based grammar formalism, developed in the ESPRIT project DYANA and extended within projects DYANA2 and B5 in the Sonderforschungsbereich 340 "Sprachtheoretische Grundlagen für die Computerlinguistik" at the IMS Stuttgart. There is a good description by Jochen Dörre and Michael Dorna of the CUF formalism available as PS file; besides this there is also a manual (PS, or online HTML) and Esther König's Tutorial (PS gzipped or TEX). The CUF implementation is freely available after license.
These are the material of three educational lectures given by Daniel Hardt in the Fall of 1997 at the Center for Sprogteknologi / Centre for Language Technology. Topics range from corpora, to basic Unix command ([e]grep, etc.), to concordance, trigram and Machine Learning. [2001 May 1].
A rich and well organized collection of links, meant mainly for linguists/language teachers (and not computational linguists/NLP researchers). Especially worth noting are the section on English Corpora, neatly arranged in thematic subsections, and on CALL based methods. There are also section for Courses, FAQs, Tools, References, Journals, Conferences, etc. [2001 November 28, last checked 2002 October 16].
This index page of the Department of Computer Science - University of Sheffield provides some infos on the projects in progress at Sheffield. [2001 April 27].
The main research fields of the Departament de Llenguatges i Sistemes Informatica (Universitat Politècnica de Catalunya) are related to the use of multilingual lexical resources, information extraction from documents, design of NL interfaces, basic NLP techniques (tagging, parsing, sense disambiguation), NL understanding and Knowledge Representation. The group has been working as a pluri-disciplinary group since 1986, together with linguists from the CLiC (Centre de Llenguatge i Computació - Universitat de Barcelona). This collaboration was developed in several projects, among which is a suite of NLP tools, viz. MACO+ (morphological analyzer corpus-oriented), EWN (Top-ontology semantic analyzer), Relax (POS Tagger), TreeTagger (POS Tagger), TACAT (parser). A demo of the full suite, viz. the DLSI-UPC/CLiC-UB Tools, is freely querable online. Availabilty is otherwise unknown: contact Núria Castell i Ariño. [2001 April 30].
These useful pages on Language Exploration and Manipulation Tools for Translators, Writers, and Language Students made by Jon Fernquest (Yangon, Myanmar; e-mail) are a good introduction to Scripting Languages for Computer Aided Language Learning Systems (CALL Glue). The focus is on Perl and Tcl/Tk (but there's also something on AWK and Python) tutorial and links. [2001 July 11].
The Expert Advisory Group on Language Engineering Standards (EAGLES) is an initiative of the European Commission, within DG XIII Linguistic Research and Engineering Programme, which aims to accelerate the provision of standards for: (a) Very large-scale language resources (such as text corpora, computational lexicons and speech corpora); (b) Means of manipulating such knowledge, via computational linguistic formalisms, mark up languages and (c) various software tools; (d) Means of assessing and evaluating resources, tools and products. Numerous well-known companies, research centres, universities and professional bodies across the European Union are collaborating under the aegis of EC DGXIII to produce the EAGLES Guidelines which set out recommendations for de facto standards and for good practice in the above areas of language engineering. The EAGLES initiative is coordinated by Consorzio Pisa Ricerche, Pisa, Italy which also manages the EAGLES home page and the EAGLES free ftp server. [Last check 2001 April 29].
+ EAGLES documentation and Gudelines can be freely accessed and downloaded from the Browse Page or directly from the ftp server.
+ The EAGLES project is now continued at a world-wide level by the ISLE initiative.
+ The CES (Corpus Encoding Standard) is also part of the EAGLES Guidelines.
+ The XCES (Corpus Encoding Standard for XML) is under beta release.
+ Cf. also the GLOSIX (Document LSD 2) on character encoding in EAGLES frame.
Active in the fields of Gender theory, Corpora of Romance newspapers, Phraseology, Linguistics and new media, her Corpus Linguistics page, Online Korpusanalyse mit Hilfe von TactWeb (cf. details under Corpora general section), offers some small but useful Italian, French and Spanish corpora freely querable online via TACTweb. [2001 April 23].
A "very provisional" (it's online since October 24, 2000), he says, but also very useful collection of references possibly relevant to the design of encoding / markup for ANE texts made by Robin Cover.
Èulogos is a commercial Italian site for Language Engineering. It maintains some free services: (1) an Italian online Morphological Dictionary, based on the SLI (Sistema Lessicale Integrato) technology, (2) the 9 languages IntraText library, and (3) the Italian Censor readability GULPEASE and basic vocabulary test (e-mail submission)
EURALEX is the European Association for Lexicography: an international association which was founded in 1983, with the aims of furthering all aspects of the broad field of lexicography, and of promoting the exchange of ideas and information. It is committed to the development of lexicography in all European languages (as well as other non-European languages). And corpus linguistics plays now a great role in doing so. EURALEX sponsors the International Journal of Lexicography (IJL). Personal membership is available to individuals who are interested in lexicography. The annual subscription for Full Membership per year is (GBP) £37.00 in Europe and (USD) $63 outside Europe. It entails subscription to 4 issues of the IJL and membership to the EURALEX mailing list. [Last checked 2001 April 26].
Fabio Tamburini’s (cfr. home) 2000-2001 course on Fundamentals of Computational linguistics (for Applied linguistics). A synthetical but clear introduction to Corpus linguistics from a technological point of view ("La linguistica dei corpora da un punto di vista tecnologico") is freely available in PDF format (but only in Italian). [2001 October 8. Updated 2002 February 19].
Fabio Tamburini’s main interests span through corpus linguistics, speech processing and computational linguistics. He is currently involved in Computational Linguistics and Corpus Linguistics projects as a researcher of the CILTA (Bologna University Centro Interfacoltà di Linguistica Teorica e Applicata 'L. Heilmann'). On his homepage there are freely downloadable PDF versions of his papers. [2001 July 7].
Databases, properly, are not dealt with by my CLR Guide, but there are still obvious connections with standard NLP (e.g. importing and build up a database from the output of a concordancer, etc.). This page provides a general introduction to the topics and a useful and rich collections of links, and is a good starting point for corpus linguistics people interested in the subject. [2002 February 17].
The homepage of the renowned father of SUSANNE, CHRISTINE and LUCY and author of Educating Eve and Empirical Linguistics (two most recomendable readings, available everywhere, also from Amazon). There are a lot of infos on his projects and activities, and his complete bibliography, with online versions of recent articles. But most of all there is the invaluable downloadable research resources page, the reference page for all the corpora and tools produced by Sampson and his research team, providing (a) links to web pages describing the resources and to full documentations, and (b) links allowing you to freely download the resources themselves. [Updated 2004 March 25].
GLOSIX Document LSD 2. Part 1-1. (Version 0.5. Last modified 28 April 1996) is the Multext \ Eagles introductions to character encoding. ISO and Unicode standards are speficifically dealt with. [2001 July 10].
Text Analysis Info is a free information source for everything that deals with the analysis of content of human communication, mostly but not limited to text. It deals also with programs that support the coding of audio, video, or even chatroom sources. This site, as a matter of fact, doesn't deal with NLP and Corpus Linguistics, but there are also convergences, and it's a very rich one, especially for software. [2002 February 18].
This page is only the schematic description of a stimulating EALC 222 Winter 2002 course held at UCLA by Hongyin Tao (homepage), but provides also some good references, especially in CJK computational analysis. [2002 February 17].
The Journal of ICAME (International Computer Archive of Modern and Medieval English) is published once a year since 1977, with articles, conference reports, rewiews and notices related to corpus linguistics. ICAME subscription fee is 250 Norwegian kroner (NOK) and is comprehensive of ICAME mailing list membership. Contents of back issues 9-24 are available on the site. [2001 April 26].
http://www.ilsp.gr/info_eng.html (Greek also)
The Institute for Language and Speech Processing - Institóutos Epexergasías tóu Lógou was founded in Athens, Greece, with the aim to support the development of Language Technology. Among the activities of ILSP is the development of Language Technologies for Greek. Specifically, ILSP develops environments for translating from and into the Greek language, as well as computational tools and products which assist the translation task; develops CD-ROMs for computer assisted Greek language learning; creates electronic dictionaries (monolingual and multilingual), computational lexica and electronic dictionaries for children; develops prototypes for speech recognition, synthesis and compression; creates text correction tools. Cf. also the HNC (Hellenic National Corpus). [2001 May 1].
A raw but rich link page on Information Extraction, maintained by Ion Mulsea.
The International Journal of Corpus Linguistics (IJCL) presents a wide range of views on the role of corpus linguistics in language research, lexicography and natural language processing. It is published twice per year since 1996 by Kluver (cf. the Kluwer IJCL page). Contents and Abstracts (and some full paper as well) are available online for all the issues. IJCL offers also a Discussion Forum. Subscriptions costs NLG 298.-- / EUR 135.23 (incl. postage/handling) per year; supplementary special issue "Text Corpora and Multilingual Lexicography", NLG 120.-- / EUR 54.45.; complete set (Vol. 6 plus special issue) NLG 418.-- / EUR 189.68 (incl. postage/handling). [2001 April 26].
The International Journal of Lexicography (IJL) was launched in 1988 and is sponsored by EURALEX. Interdisciplinary as well as international, it is concerned with all aspects of lexicography, including issues of design, compilation and use, and with dictionaries of all languages, though the chief focus is on dictionaries of the major European languages - monolingual and bilingual, synchronic and diachronic, pedagogical and encyclopedic. The Journal recognizes the vital role of lexicographical theory and research, and of developments in related fields such as computational linguistics, and welcomes contributions in these areas; corpus linguistics, in fact, is a frequent topics and the number IX(1996)3 is monographically devote to it. Subscription to 4 issues per year costs £96/US$167, but is already comprised in the membership fee of EURALEX, which costs only £37/US$63 (special offer to new members £25). [2001 April 26].
A small library of interactive hypertexts for free reading and search maintained by Èulogos. All literary texts, many religious (the BRI, Bibliotheca Religiosa). Nine languages are till now supported (Albanian, German, English, Spanish, French, Italian, Latin, Finnish).
The purpose of the IQLA (an emanation of the LDV, i.e. Linguistische Datenverarbeitung / Computerlinguistik an der Universität Trier) is to promote the development of all aspects of quantitative linguistics and to stimulate world-wide communication of scientists working in QL. In order to realise these objectives, the Association publishes a newsletter, holds international conferences, establishes chapters, and sponsors other activities consistent with its objectives. His the official journal is the JQL. IQLA personal membership costs normally 70 US$ per year (for other conditions cf. this page) and is comprensive of subscription to the association journal. [Last checked 2001 August 22].
ISLE, the world-wide prosecution of the EAGLES project, is both the name of a project and the name of an entire set of co-ordinated activities regarding the Human Language Technology (HLT) field. ISLE acts under the aegis of the EAGLES initiative (Expert Advisory Group for Language Engineering Standards), which has seen successful development and broad deployment of a number of recommendations and de facto standards. The project general coordinator is Antonio Zampolli.
+ The aim of ISLE is to develop HLT standards within an international framework, in the context of the EU-US International Research Cooperation initiative. There is an increasing Asian interest for the initiative and the relevance of standards in the field of HLT. Its objectives are to support national projects, HLT RTD projects and the language technology industry in general by developing, disseminating and promoting de facto HLT standards and guidelines for language resources, tools and products. ISLE targets the 3 areas: multilingual lexicons, natural interaction and multimodality (NIMM), and evaluation of HLT systems. These areas were chosen not only for their relevance to the current HLT call but also for their long-term significance. (1) For multilingual computational lexicons, ISLE will: extend EAGLES work on lexical semantics, necessary to establish inter-language links; design standards for multilingual lexicons; develop a prototype tool to implement lexicon guidelines and standards; create exemplary EAGLES-conformant sample lexicons and tag exemplary corpora for validation purposes; develop standardised evaluation procedures for lexicons. (2) For NIMM, a rapidly innovating domain urgently requiring early standardisation, ISLE will develop guidelines for: the creation of NIMM data resources; interpretative annotation of NIMM data, including spoken dialogue in NIMM contexts; annotation of discourse phenomena. (3) For evaluation, ISLE will work on: quality models for machine translation systems; maintenance of previous guidelines - in an ISO based framework (ISO 9126, ISO 14598). There will be intensive interaction among the groups, as several topics lie within the sphere of interest of more than one group, thus broadly-based consensus will be achieved.
+ The first results of this major standardization initiative are already online at the ILC site; all documents can be freely downloaded. [2001 April 26].
Lengoaia Naturalaren Prozesamendurako IXA Taldea has been working for more than ten years on the Natural Language Processing and all the outcomes it has achieved are related to Basque. The site provides some information on NLP projects involving the Basque language and refers some of the most important results of the Group, such as: MORFEUS Basque Morfological Analizer, EDBL (The Lexical DataBase for Basque), a database of about 70.000 entries, EUSLEM, a Basque lemmatizer/tagger, and XUXEN a speller for Basque. Only the last is a commercial software (distributed by HIZKIA Informatika, Atrium - le Forum, F-64100 Baiona, e-mail). Informations on availability of all other products is lacking. You can however make inquiries to the group’s e-mail. [2001 April 30].
James Allen's research interests span from natural language understanding, discourse and knowledge representation, to common-sense reasoning and planning, focusing on dialogue, planning and plan recognition, and temporal reasoning. There are links to papers and projects (e.g. TRAINS, the Natural Spoken Dialogue and Interactive Planning project now continued by TRIPS) in these fields, and to a couple of parser tools. [2001 may 18].
http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/doc/notes/corpora.txt (alternative page)
This dated but still useful page is the electronic version of Chapter 10 (pp. 263-310) of the book: Jane A. Edwards & Martin D. Lampert (eds), Talking Data: Transcription and Coding in Discourse Research, London and Hillsdale (NJ), Erlbaum, 1993.
Besides resources for teachers of English to speakers of other languages and for students of English as a second language, she offers in the following page a lot of links to resource and tools concerning collocation, concordancing, etc.
Joakim Nivre's web-based course in statistical natural language processing is meant to provide the basic material for a distance learning course, although some local supervision or tutoring will normally be required. There is also an inventory of available tools and resources for statistical NLP, including the Viterbi Triclass Tagger.
A very rich bibliografy on Corpus Linguistics and Written Corpora by Joaquim Llisterri from Universitat Autònoma de Barcelona. Useful! [2001 May 1].
John Elliot's main area of interest is in the field of unsupervised natural language learning. In particular, the search for generic human and inter-species language universals to devise computational methods by which language can be descriminated from non-language and core structural syntactic elements of unknown languages can be detected. Aims of his research include: contributing to the understanding of language structure and the detection of intelligent language-like features in signals, to aid the search for extra-terrestrial intelligence. [2001 July 28].
John Lafferty and Roni Rosenfeld's Spring 1997 Carnegie Mellon University course on Language and Statistics offers on web at least a syllabus with some bibliography.
This paper in Spanish by Joseba Abaitua (e-mail) of the Universidad de Deusto is the text of a seminary on "La ingeniería lingüística en la sociedad de la información" held at Soria (Fundación Duques de Soria), 17-21 July 2000. It is a rich and detailed reference on bilingual parallel and comparable corpora, provided with a large bibliography that makes this page even more useful. [2002 February 23].
The JQL, an international forum for the publication and discussion of research on the quantitative characteristics of language and text in an exact mathematical form, is the Official Journal of the IQLA (International Quantitative Linguistics Association). The Journal of Quantitative Linguistics is important reading for all researchers in the following disciplines who are interested in quantitative methods and observations: linguistics, mathematics, statistics, artificial intelligence, cognitive science, and stylistics. There are also contents of the individual issues and abstracts available on the site. Subscription to the IQLA costs normally 70 US$ per year (price for non-student individual; for other conditions cf. this page) and entails membership to IQLA.
A new (dated Summer 2001) CL resources link page by Kerstin Fischer of Bremen University. [2001 August 4].
To see all the stuff they have can make you sick, but their resources list is also well commented, so you could read it as a sort of tutorial. German laguage.
Do you need some Linux howto? Try at the LDP: the Linuxdoc Org has one of the best collection in the Web of howtos, guides, FAQs, mans and the likes. See also the Doc directory of the MetaLab.unc.edu FTP.
These are the online materials for a course at Brown University, held by Thomas Dean, Sonia Leach and Hagit Shatkay. Lots of neatly arranged info. The home page provides only the introduction and all the stuff is at this page.
By Brigitte Krenn and Christer Samuelsson.
A new collection of links to Corpus linguistics and Corpora resources by Montserrat Civit Torruella of the Departament de Llenguatges i Sistemes Informàtics - Universitat Politècnica de Catalunya. [2002 November 8].
If you really want to do any serious NLP you need an Unix OS. Linux is the free Unix-type operating system originally created by Linus Torvalds with the assistance of developers around the world. Developed under the GNU General Public License, the source code for Linux is freely available to everyone. Linux is great, free and well documented: so why don't you try it?. And the official Linux Org site is the best place to start with. There you can find links to distributions, documentations, support etc.
For documentation see also the LDP site and the Doc directory of the MetaLab.unc.edu FTP.
It is published since 1986 by Oxford University Press on behalf of the Association for Literary and Linguistic Computing (ALLC). LLC is an international journal which publishes material on all aspects of computing and information technology applied to literature and language research and teaching. Papers include results of research projects, description and evaluation of techniques and methodologies, and reports on work in progress. Corpus linguistics is a frequent topics; cf. especially volumes VIII(1993)4 and IX(1994)1 with papers from the 1992 Pisa Workshop on Corpora. Subscription to 4 issues per year costs £46/US$77 (individuals; other conditions are advertised in the page hereupon referred) and entails membership to the ALC. [2001 April 26].
The Edinburgh Language Technology Groupis a research and development group working in the area of natural language engineering. Based in the Institute for Communicating and Collaborative Systems of Edinburgh's Division of Informatics. Among the various resources, cf. the LTG Software, the LTG Helpdesk FAQ, the Edinburgh Tools, the Tokenization FAQ, etc.
The Helpdesk FAQ of the Edinburgh Language Technology Group (LTG) is a gold mine of information on Computational Linguistics. Many of the questions concern issues related to corpora and tagging. The files (edited by scholars such as Chris Brewing and Colin Matheson) are usually clear and rich in references and links.
MATE aims to develop a preliminary form of standard and a workbench for the annotation of spoken dialogue corpora. Specifically, MATE will treat spoken dialogue corpora at multiple levels, focusing on prosody, (morpho-) syntax, co-reference, dialogue acts, and communicative difficulties, as well as inter-level interaction. The results of the project will be of particular benefit to developers of spoken language dialogue systems but will also be directly useful for other applications of language engineering. The MATE Workbench, developed at Edinburgh by the LTG, is now freely available. [2001 April 26].
MetaLab archives over 55 gigabytes of Linux programs and documentation freely available for download via FTP and/or WWW access. It is the reference and of course free FTP site for Linux people. Especially if you are a newby, first give a look at the welcome page. There are a lot of mirror sites (cf. the list at the following ftp).
The METER (MEasuring TExt Reuse) project, developed at the Department of Computer Sciences of Sheffield University (DCS-Shieffield), aims to investigate the issue of text reuse and explore NLP/LE techniques for detecting and measuring text reuse. Currently, this project focuses on the domain of journalism. However, it is envisaged that the techniques developed in this project will be applicable on a wide range of genres/domains. In this project, various techniques are being explored, including n-gram approach, dot-plot technique and text alignment algorithms, to address the issue of text reuse. In a long run, integrating all successful techniques and algorithms explored in this project, a system will be developed which will be capable of detecting and measuring the probability of derivation for a suggested derived text. Once finished, such system will be applicable in various areas such as plagiarism detection, information extraction/retrieval, etc. One of results of this project is the METER Corpus. [2001 April 29].
A basic bibliography of corpus based computational linguistics. Some e-mail addresses for difficult to find publications are also provided at this page.
Perhaps the most famouse page of links to computational resources on the web. A classics. Contact. [Last checked 2001 August 4].
One of the best introduction to Parallel Corpora, this page gives sources of information concerning tools, texts, and research related to parallel corpora. Contact. [Last checked 2001 August 4].
NCSTRL (pronounced "ancestral") is an international collection of computer science research reports and papers made available for non-commercial use from a number of participating institutions and archives. Texts can be freely retrieved. [2001 April 23].
The NLP/CL Universe is a very useful Web catalog/search engine maintained by the ACL that is devoted to Natural Language Processing and Computational Linguistics Web sites. It exists since March 18, 1995. [2001 April 26].
A good Korean reference site on NLP - at least if you are interested in the "Korean" point of view on NLP and, of course, if you know a bit of the language, because the site, except from home and navigation frame, is strictly in Korean ... [2001 April 26].
EAGLES description of what parallel corpora are. Short but clear.
Parseit is a mailing list for English teachers, students, and others who want to use easy programming languages like Python, Perl, Awk, Tcl/Tk and Visual Basic to accomplish linguistic tasks like concordancing, parsing text, or creating online CALL activities for their students. Last time I checked, however, the page was down [2001 July 11].
The first annual parsing contest based on a fixed set of sentences and a fixed set of tasks to be performed on that set of sentences is hold by Ergo Linguistic Technologies. The contest will be based on a comparison of results for one hundred sentences (included at end of this message) and various tasks that can be performed as a result of those parses. That is, the comparison will be based on the actual parse tree and the ability to use that parsed output to generate theory independent output and to perform various NLP tasks.
By Daniel Jurafsky and James Martin. Covers Statistical NLP stuff, as well as symbolic NLP and speech.
+ Check also Cameron Laird's & Kathyn Soraiz's Choosing a scripting language and the Do-It-Yourself site, with language and text tools in Perl and Tcl/Tk.
+Perl's roots are in UNIX but you will find Perl on a wide range of computing platforms (Windows as well). Because Perl is an interpreted language, Perl programs are highly portable across systems. Perl.com is the main source for Perl resources on the web, ranging from tutorial to software downloads: Perl is, of course, Open Source software and you can download it for free as a source code or as a pre-compiled binary distribution.
Class notes prepared by Phil Benson, Hong Kong University, for a MA in Applied Linguistics, April 1997.
This page is the short outline (in Catalan) of a course held by Pilar Sánchez-Gijón (home) of Dep. de Traducció i d'Interpretació, UAB. [2002 February 22].
The Pizza Chef pages, acting as a TEI tagset selector, will help you design your own TEI-conformant document type definition (DTD) in either SGML or XML format. The TEI Guidelines define several hundred SGML elements and associated attributes, which can be combined to make many different DTDs, suitable for many different purposes, either simple or complex. With the aid of the Pizza Chef, you can build a DTD that contains just the elements you want, suitable for use with any SGML or XML processing system.
This project is a first result of an initiative taken by the Portuguese Ministry of Science and Technology to improve the area of computational processing of the Portuguese language. The project is part of the Ministry's aim to grant native speakers of Portuguese easy access to the ever-increasing information society. This site provides a lot of useful information on Portuguese language processing and also online access to some Portuguese Corpora, cf. Corpora do Processamento computacional do português.
Python is another open source scripting language, that can be used like Tcl, Perl, or AWK for NLP (check also Cameron Laird's & Kathyn Soraiz's Choosing a scripting language paper). This is the official homesite where you can find nearly all you may want to know about Python. [2002 February 19].
A small Italian link page to corpora, courses, reference and statistics resources by Raffaele Cocchi (homepage). [2001 August 5. Updated 2002 February 19].
This is a site devoted to the linguistic topic of Rhetorical Structure Theory . It is maintained by Bill Mann. RST raises issues about communication, semantics, and especially the nature of the coherence of texts. However RST was originally developed as part of studies of computer-based text generation. A team at Information Sciences Institute (part of University of Southern California) was working on computer-based authoring. In about 1983 part of the team, (Bill Mann, Sandy Thompson and Christian Matthiessen) noted that there was no available theory of discourse structure or function that provided enough detail to guide programming any sort of author.
+ For a good intro see the following link.
+ For software cf. this page; for RSTtool especially see this http.
This outline of the course Sabine Reich held at the Englisches Seminar, Universität zu Köln is short but clear and with good bibliographical references and useful links, though limited to English language. [2001 April 23].
The South East Asian Computing And Linguistics Center does pure and applied research in computing, linguistics, and natural language processing. It focuses on Thai, Lao, Khmer, and Burmese, and the problems they present for information technology in both applied and academic disciplines. Among other useful informations it hosts the CRCL (Center for Research in Computational Linguistics - Bangkok), the TIE (Thai Internet Educational) \ TOLL (Thai-English On-Line Library) projects and the SEALDA (Southeast Asian Language Data Archives)
SENSEVAL is a project concerned in Evaluating Word Sense Disambiguation Systems. There are now many computer programs for automatically determining which sense a word is being used in. One would like to be able to say which were better, which worse, and also which words, or varieties of language, presented particular problems to which programs. SENSEVAL is designed to meet this need. The first SENSEVAL took place in the summer of 1998, for English, French and Italian, culminating in a workshop held at Herstmonceux Castle, Sussex, England on September 2-4. The second is planned for Pisa, Spring 2001. They let you have, as free demo, English dictionary entries and tagged examples for 35 words.
Systemic-Functional Linguistics (SFL) is a theory of language centred around the notion of language function. While SFL accounts for the syntactic structure of language, it places the function of language as central (what language does, and how it does it), in preference to more structural approaches, which place the elements of language and their combinations as central. SFL starts at social context, and looks at how language both acts upon, and is constrained by, this social context. SFL grew out of the work of JR Firth, a British linguist of the 30s, 40s, and 50s, but was mainly developed by his student M. A. K. Halliday. He developed the theory in the early sixties (seminal paper, Halliday 1961), based in England, and moved to Australia in the Seventies, establishing the department of linguistics at the University of Sydney
+ For a general intro go to this page.
+ SFL has been prominent in computational linguistics, especially in Natural Language Generation (NLG). Penman, an NLG system started at Information Sciences Institute in 1980, is one of the three main such systems, and has influenced much of the work in the field. John Bateman (Darmstadt, Germany) has extended this system into a multilingual text generator, KPML. Robin Fawcett in Cardiff have developed another systemic generator, called Genesys. Mick O'Donnell has developed yet another system, called WAG.
+ One of the earliest and best-known parsing systems is Winograd's SHRDLU, which uses system networks and grammar as a central component. Since then, several systems have been developed using SFL (e.g., Kasper, O'Donnell, O'Donoghue, Cummings, Weerasinghe), although this work hasn't been as central to the field as that in NLG.
By Philip Resnik.
for Computational Linguistics): http://www.clres.com/siglex.html
SIGLEX provides an umbrella for a variety of research interests ranging from lexicography and the use of online dictionaries to computational lexical semantics. Here you can find information about publicly available, online lexical resources. Links to publicly available lexical resources (dictionaries and corpora). SIGLEX is trying to include a full and comprehensive set of links of available electronic corpora and lexicons/dictionaries for use in natural language processing.
http://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/ (Japanese also)
A big reference archive by Kenji Kita, Tokushima University, covering a lot of topics, ranging from NLP, corpora, speech analysis and Chinese and Japanese language processing. I point out especially the following pages:
+ NLP and Computational Linguistics, dealing mainly with institutional references (association, research organization, universities, companies and online proceedings).
+ Corpora & Text Resources, reach also in minor language data.
+ Software Tools for NLP, one of the richest reference list available on the Web!
+ Web Resources in Japan, rich but Japanese only.
+ Chinese Language Processing, smaller than the Japanese one but in English. [2001 April 28].
The page of the Statistical Natural Language Processing CPS 370 1997 Courses provides a lot of useful links to important papers available online. [Last checked 2001 April 29].
Steven Abney, formerly an assistant professor of Computational Linguistics at the University of Tübingen (so his page can soon move!), researches mainly grammatical inference and parsing ("grammatical inference is basically about writing computer programs to learn (human) languages. Parsing is about computing the meaning of sentences, once you've learned the language"). His CASS Partial Parser is freely downloadable. Contact.
A downloadable PS version of Abney's paper.
Both HTML readable online and PS downloadable versions are available. Editorial Board: Ronald A. Cole (Editor in Chief), Joseph Mariani, Hans Uszkoreit, Annie Zaenen, Victor Zue; Managing Editors: Giovanni Battista Varile, Antonio Zampolli; Sponsors: National Science Foundation, European Commission; Additional support was provided by: Center for Spoken Language Understanding, Oregon Graduate Institute (USA), University of Pisa (Italy).
An exact and up-to-date introduction to Corpus Linguistics edited by Ronald A. Cole. Contents by Chapter: Chapter 1: Spoken Language Input; Chapter2: Written Language Imput; Chapter 3: Language Analysis and Understanding; Chapter 4: Language Generation; Chapter 5: Spoken Output Technologies; Chapter 6: Discourse and Dialogue; Chapter 7: Document Processing; Chapter 8: Multilinguality; Chapter 9: Multimodality; Chapter 10: Transmission and Storage; Chapter 11: Mathematical Methods; Chapter 12: Language Resources; Chapter 13: Evaluation.
This tutorial was prepared by Susan Hockey (homepage), University of Alberta, for a workshop given at the North American Symposium on Corpora in Linguistics and Language Teaching, University of Michigan, Thursday 20 May 1999, 9am - 12pm. Main covered topics are sources, design and encoding of corpora, and analysis tools (mainly for frequency lists, concordance, collocations and POS tagging). There is also a good bibliography with useful references and links. [2001 April 26].
A small page of links to WSD (world sense disambiguation) and corpus linguistics by Tanja Gaustad (see homepage). [2001 July 28].
+ There are binary installers for Windows and Macintosh, and source releases for UNIX platforms; all are freely downloadable (download ranges in size from 2 to 3.5 Megabytes) from the site. On the Tcl Developer site there are as well all the infos, documentations, tutorial and news you can need.
+ Check also Cameron Laird's & Kathyn Soraiz's Choosing a scripting language and the Do-It-Yourself site, with language and text tools in Perl and Tcl/Tk.
This page (dated 19 June 1995) provides links to some preliminary but useful material which Chris Brew prepared and collected in association with a reading group studying Charniak's "Statistical Language Learning".
The Text Encoding Initiative (now continued by a new consortium, cf. this page) is an international project to develop guidelines for the preparation and interchange of electronic texts for scholarly research, and to satisfy a broad range of uses by the language industries more generally. The TEI Guidelines provide the most widespread SGML/XML based standards for electronic text encoding now available. Guidelines and other useful material is freely available directly from the TEI site
+ The old TEI site, alledged to be dead in summer 1999, is still alive (August 2000).
+ There is also a handy querable web edition of the TEI P3 Guidelines online at the University of Michigan.
A rich page on Data-driven Learning by Tim Johns (homepage), focussing mainly on classroom concordancing, providing a good bibliography; samples of DDL materials produced by participants in a workshop in Usti nad Labem (North Bohemia) 21st-25th March 2000; a description of the work undertaken at Birmingham under an EU-funded Lingua-Socrates project on the development of Multiconcord, a Windows-based Multilingual Parallel Concordancer for classroom use; and a lot of other infos and links related to this topics. [2001 April 23].
The main purpouse of this interesting page by Tim Johns (homepage) is to show that it is possible to begin to use a "data-driven" approach to language learning and teaching even if you do not have access to established corpus resources. A secondary purpouse is to discuss the potential of small, very specific corpora for ELT, providing also some simple recipe for cooking them. [2002 February 17].
Tony Berber Sardinha (homepage) provides some useful material and information on Corpus Linguistics (Brazilian and English) and links to corpora online.
These tutorial pages are a supplement to the book "Corpus Linguistics" by Tony McEnery and Andrew Wilson published by Edinburgh University Press (ISBN: 0-7486-0808-7 cased; 0-7486-0482-0 paperback).
The Text REtrieval Conference (TREC), co-sponsored by the National Institute of Standards and Technology (NIST) and the Defense Advanced Research Projects Agency (DARPA), was started in 1992 as part of the TIPSTER Text program. Its purpose was to support research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies.
The Tuscan Word Centre is a non-profit Association (situated in hill country in the heart of Tuscany, midway between Pisa and Florence) devoted to promoting the scientific study of language. TWC organises one-week high-level courses for language researchers and workers in the language industries. TWC Courses concentrate on: Use of electronic corpora for different purposes, including: translation, automatic or machine-aided language processing, tagging, parsing etc.; Language teaching support; Language learning assistance; Lexicography and language reference. Other Activities of TWC: Advisory and consultancy services; Language processing and software evaluation; Project design and management; Language product development; Organisation of academic and professional events, e.g. conferences, seminars, workshops. [2001 April 26].
A good introduction to Corpus annotation, from POS Tagging, to grammatical parsing, word sense annotation, anaphoric annotation and prosodical annotation.
It is the freely downloadable gzipped PS file of the DECIDE [MLAP 93-19] deliverable D-1b, Nov. 94, 133 pages. The paper is dated 1994, so it cannot be very up-to-date, but is still a good benchmark of query tools and a shurely recommendable reading.
This is the (English language) homepage of the UNED Group in Natural Language processing of Felisa Verdejo. In this page there are infos and links to the activities of the group; in this other one there are links to some useful free services, such as: an online demo of Porter’s stemmer; an online version of Brill’s POS Tagger for English; an online version of MACO+ Morphological analyzer for Spanish, alone or in combination with Relax POS tagger; an automatic Spanish to English online translation system; etc. [2001 April 30].
The Unicode Consortium, made by software corporations and researchers, is aimed at standardizing international character encoding: the Unicode Standard is the biggest effort in character standardization after ASCII, and is often the actual foundation for internationalization and localization of software. The Unicode site maintains useful resources, such as a FAQ and the complete Unicode 3.0 collection of Character Chart. There is also a public FTP.
The page of the Corpus Research Group at the University of Birmingham is maintained by Oliver Mason. Besides some useful links and infos there is also some freely downloadable software: Cue, Qwick and Qtag. There is also a free service of tagging by e-mail for plain TXT English texts. [Rev. 2001 November 27].
The Machine Learning research group at UT Austin, led by Raymond J. Mooney, focuses on combining empirical and knowledge-based learning techniques, including applications such as natural language acquisition, knowledge refinement, learning for planning, and recommender systems. This page provides some demos of software developed by ther Group, and a good list of links related to these topics.
This large ICT4LT module by Marie-Noëlle Lamy and Hans Jørgen Klarskov Mortensen (with introduction by Graham Davies) aims to introduce language teachers to the use of concordances and concordance programs in the modern foreign languages classroom. It provides an useful introduction to concordancing as well. [2002 February 17]
The page of Using the Web to Solve Crossword Puzzles CPS 370 1998 Courses provides some link to papers related to this topics which are available online. [2001 April 29].
This is the online html version of the ELRA Work Package 3 first draft. It’s the reference guide by the Lancaster people to corpora validation (authors: Tony McEnery and Lou Burnard, with Andrew Wilson and Paul Baker): from tagset and markups, to EAGLES guidelines and mappings. [2001 April 26].
A rich introduction to Corpus Linguistics by the W3-Corpora Project at the University of Essex.
This site is a collection of online resources for research in the field of information retrieval and information extraction from the web. These pages contain materials that are related to the state of the art IR (Information Retrieval) and IE (Information Extraction) techniques used for and on the web. Such techniques use, as well as traditional techniques, hypertext structure and meta-data, the structure and nature of the web, observed human behaviour on the web, other search engines, and more.
On this page there is a Beta release of XCES, which instantiates the EAGLES Guidelines of CES (Corpus Encoding Standard) for XML. It is being developed by the New York University - Department of Computer Sciences - Vassar College and by the Equipe Langue et Dialogue at the LORIA. XCES is under development, and its documentation as well. Because the XML framework provides with means to go well beyond the capabilities of SGML, this development is taking several forms: (1) XML support for additional types of annotation and resources, including discourse/dialogue, lexicons, and speech; (2) creation of additional XSLT scripts to perform common operations and trasduce among formats (including different annotation formats); (3) development of a set of XML schemas instantiating an abstract data model for linguistic annotations, together with a hierarchy of derived types for a broad range of annotation types; and (4) creation of a repository of annotation formats for "off the shelf" use or easy modification via the XCES schemas. Seven DTDs for XCES are however already available, and you can download them singularly or in single ZIP file. [2001 April 29].
The XML Cover Pages is a comprehensive online reference work for the Extensible Markup Language (XML) and its parent, the Standard Generalized Markup Language (SGML). The reference collection features extensive documentation on the application of the open, interoperable "markup language" standards, including XSL, XSLT, XPath, XLink, XPointer, HyTime, DSSSL, CSS, SPDL, CGM, ISO-HTML, and others. In this rich site you can find most of the links you need to "markup language" resources available on the web. This document is also a guide to many text collections using SGML. [Checked 2001 July 14].
XTAG is an on-going project at Penn (i.e. the University of Pennsylvania; cf. the Penn Tools file) to develop a wide-coverage grammar for English using a lexicalized Tree Adjoining Grammar (TAG) formalism. XTAG also serves as a system for the development of TAGs and consists of a parser, an X-windows grammar development interface and a morphological analyzer. There is also an on-going project of deloping a Korean XTAG system. Both the XTAG English Grammar released on 2.24.2001 and the XTAG Tools are freely downloadable. There are also a lot of User Manuals and selected papers dealing with the various components of XTAG. [2001 April 27].
This accurate page on Learner Corpora and Second Language Acquistion (formerly hosted at Lancaster University, now at Meikai University) provides a large selection of links to resources on this branch of Corpus Linguistics and related topics. There is also a section with freely downloadable papers. Perhaps the best reference on Learner Corpora on the Web. [2001 May 2. Rev. 2002 September 4].
Zipf's law, named after the Harvard linguistic professor George Kingsley Zipf (1902-1950), is the observation that frequency of occurrence of some event (P ), as a function of the rank (i) when the rank is determined by the above frequency of occurrence, is a power-law function Pi ~ 1/ia with the exponent a close to unity. Zipf's Law is a classics in statistical NLP. As a matter of fact the most famous example of Zipf's law is the frequency of English words. At this page you can see a count of the top 50 words in 423 TIME magazine articles (total 245,412 occurrences of words), with "the" as the number one (appearing 15861 times), "of" as number two (appearing 7239 times), "to" as the number three (6331 times), etc. When the number of occurrences is plotted as the function of the rank (1, 2, 3, etc.), the functional form is a power-law function with exponent close to 1.
The Zipf's Law page (prepared by Wentian Li of Rockefeller University, New York City) offers a detailed presentation of the law, its history, its application and a huge bibliography.