General Resources.

(1) - Corpora and Corpus Linguistics.
(2) - Multilingual and Parallel Corpora.
(3) - Electronic Literary Text Archives.
(4) - References, Standards & Educational Resources.
(5) - Tools.

(5) – Tools.

This "Tools" section focuses mainly on corpus-oriented NLP software (esp. taggers, parsers, chunkers, corpus query systems, etc.), but also text analysers (concordancers, etc.) are taken in count, and other applications as well, inasmuch they have some interest also for NLP and corpora maintenance and query. As a rule automatic speech recognition systems, translation tools, e-dictionaries and exotic languages typing facilities have been left aside. The software resources on the Web are really huge and every selection cannot be someway arbitrary; moreover, important links could have slipped me: please e-mail me every addition you wish! Note that link sites are referred here only if mainly concerned with tools; you can find more general reference pages in section 2.4 "References, Standards & Educational Resources".

@nnotate: http://www.coli.uni-sb.de/sfb378/negra-corpus/annotate.html

@nnotate is a tool for the efficient semi-automatic annotation of corpus data. It facilitates the generation of context-free structures and additionally allows crossing edges. Functions for the manipulation of such structures are provided. Terminal nodes, non-terminal nodes, and edges are labelled. It was used for the NEGRA project. @nnotate runs under Solaris and Linux. It needs the GNU C-Compiler, Tcl/Tk, and an installation of mSql.
@nnotate will be freely available for research purposes: contact Thorsten Brants.

AGFL Grammar Work Lab: http://www.cs.kun.nl/agfl/

The AGFL (Affix Grammar over Finite Lattices) Grammar Work Lab has a a collection of software systems for Natural Language Processing, based on the AGFL-formalism. The AGFL formalism has been developed by the Department of Software Engineering, University of Nijmegen. It is a formalism in which context free grammars can be described by means of two-level grammars: a context free grammar is augmented with a feature level for expressing agreement between syntactic categories. The formalism is suited for specifying both morphological and syntactic analysis. Grammar rules can be extended with transduction parts, which specify the output. The default ouput consists of parse trees. Furthermore, mechanisms are provided to handle open classes of words, which enables the construction of robust parsers.
There is a manual both in online html and in downloadable postscript format.
AGFL is distributed under the GNU General Public License; registration is made at this page. The latest stable public release of the AGFL is version 1.9.0. You can obtain it from their FTP site by following the present link. Linux, Solaris and Win 95-8+NT versions available.

Alembic Workbench: http://www.mitre.org/technology/alembic-workbench/

A workbench for the development of tagged corpora, including a tagger based on Brill's TBL approach. Basically Alembic Workbench is an SGML-based annotation system. Apart from the usual kinds of textual annotations, the workbench enables various kinds of specialized annotations including co-reference annotations (cf. the Message Understanding Conference markup rules), various kinds of user-defined inter-tag pointers, and (shortly) general template annotation (aka relations, frames, or events). The Alembic multi-lingual NLP system provides access to taggers for a wide variety of extraction levels, and applications have now been built for several languages. The software has a sophisticated visualisation component. It runs on Sun workstations and is freely distributed but license is required.

Align: http://www.cs.cmu.edu/~aberger/software.html

Align is a C++ freely downloadable package by Adam Berger for aligning, at the sentence level, a pair of text files which are translations of one another. The problem Align was designed to solve is this: you have a pair of text files which are translations of one another. Each file may contain "spurious" (extra) sentences, not appearing in the other file. The translations may also be impressionistic. Relying on dynamic programming and a user-provided routine for calculating the probability of a word-to-word translation between the two languages, Align will (ideally, anyway) weave an optimal sentence-to-sentence alignment between the two files. Align takes as input a pair of ascii files to be aligned. Each file contains one "sentence" per line, the words of which are space-delimited. That is, newlines delimit sentences, and spaces delimit words. I put the word "sentence" in quotes because Align doesn't actually care what syntactic units appear on each line; however, the output of Align will be an alignment between lines of the input files. (If you so desire, you may put paragraphs or just phrases on each line, to align at a coarser or finer level of granularity).

Altmann Fitter 2.0: http://www.ldv.uni-trier.de:8080/~iqla/soft.html

Altman Fitter is an interactive program sponsored by the IQLA running under Windows 95 and Windows NT which provides more than 200 discrete distributions for all areas of empirical research. Among its functions a selection assistant and automatical fitting can be found. Goodness of fit is tested by Chi-square, P(Chi-square), and the contingency coefficient C. Optimisation parameters can be configured. The software comes with a comprehensive documentation (user manual: 15 p, distributions and bibliography: 141 p)-For information or a demo version please contact: RAM-Verlag, Stüttinghauser Ringstr. 44, D-58515 Lüdenscheid - Germany; Fax: +49 2351 973071.

AMALGAM Multi-Tagger: http://www.scs.leeds.ac.uk/ccalas/amalgam/amalgsoft.html

The AMALGAM (Automatic Mapping Among Lexico-Grammatical Annotation Models) project is an attempt to create a set of mapping algorithms to map between the main tagsets and phrase structure grammar schemes used in various research corpora. The AMALGAM Multi-Tagger (actually a retrained version of the Brill tagger) has been developed to tag text with up to 8 annotation schemes. The tagger is intended for English text and it will not work for languages other than English. This is a system that allows you to enter plain text, and have it tagged. You can choose from 8 different tagging schemes. The system can be freely used via email and, "shortly" (but since 20th August 1996 nothing happened ...), by using a web-browser.
For more information see AMALGAM (Reference, Standard etc. section).

AMPLE: http://www.sil.org/computing/catalog/ample.html

AMPLE is a morphological parser for linguistic exploration developed by SIL Computing; it works under Win 3.1-98 + NT, DOS, MAC and Unix. When given the necessary information about the morphology of a language, AMPLE will analyze each word in a text and break it into morphemes. AMPLE is oriented to the "item and arrangement" approach to the description of morphological phenomena. It can handle nonconcatenative phenomena only indirectly. AMPLE can work together with other SIL applications.

APP (Apple Pie Parser): http://cs.nyu.edu/cs/projects/proteus/app/

The parser by Satoshi Sekine (Department of Computer Science - New York University) is a bottom-up probabilistic chart parser which finds the parse tree with the best score by best-first search algorithm. Its grammar (of English) in the distribution is a semi-context sensitive grammar with two non-terminals and it was automatically extracted from Penn Tree Bank, syntactically tagged corpus made at the University of Pennsylvania. The framework of the algorithm was reported at the International Workshop on Parsing Technologies 1995. That is a fully automatic acquisition of grammar from a syntactically tagged corpus, instead of human labors or statistically aided human labor which have been used in many conventional projects. Although there are some problems with this strategy, such as the availability of such a corpus and domain restrictions, the performance of the grammar is fairly good. The parser generates a syntactic tree just like the Penn TreeBank (PTB) bracketing. Although PTB has argument structure labels, this parser does not produce such labels. Also APP is just trying to make a parse tree as accurate as possible for reasonable sentences. Here reasonable sentences means, for example, sentences in newspapers or well written documents. Hence, it is aiming neither to parse some reasonable ill-formed sentences (like conversation) nor to refuse absolutely ill-formed sentences. You may be surprised that the parser can make a parse tree for a sentence with number dis-agreement or it can't parse correctly a very simple English sentence. But this is a result of how APP is designed. The author knows that the performance is not the best compared with the state of art parsers which have been reported recently. However, the author thinks that the main difference between his parser and these parsers it's the usage of lexical information. And he is planning to incorporate this information into the parser and hopefully to release new versions (the latest one is Ver. 5.9 of April 1997). Don’t be misled by apples: PTB runs as usually under Unix (and sperimentally Win NT) and not Mac! Now also an executable for Windows is available.
The APP is freely downloadable by FTP as a TAR gzipped file.

ARIES Natural Language Tools: http://www.mat.upm.es/~aries/

Lexicons and morphological analysis for Spanish: the ARIES Natural Language Tools make up a lexical platform for the Spanish language. They include: a large Spanish lexicon, lexical maintenance and access tools and morphological analyser/generator. There is a free demo for single words, or you can submit a text by e-mail (following this link) for word morphological analysis and spelling check, but the real lexicons and C/C++ access tools cost money.

Automorphology: http://humanities.uchicago.edu/faculty/goldsmith/Automorphology/

A system (for Windows) for automatically learning the morphological forms of words in a corpus by John Goldsmith. (freely downloadable).

AWK: cf. the file in the References & Standards section.
Bilingua Language Engineering: http://www.bilingua.com/maine.htm

Bilingua Language Engineering provides solutions for translation, lexicography, terminology, and language for special purposes. The Bilingua team have been engaged in cross-cultural communication and training for over thirty years and in the development of language software since 1986. They produce the well knowns Duplex system with Term Shuttle and Term Lists.

Birmingham E-Mail Tagger: http://www.clg.bham.ac.uk/tagger.html

The Birmingham Corpus Research Group has been working with parts-of-speech tagging since 1994, and there are two main results of their work, both free, the Q-Tagger and the e-mail tagging service. If you have a (reasonably short) English text that you want to have tagged, you can send it by e-mail: this email will be automatically processed by the tagger, and the result will be sent back to you within a very short time. This service is fully automated and the tagger can only cope with plain text. Please do not send your text as an attachment or in a compressed format, as this will not be processed properly. The text should be in the main body of the mail.
+ The tagset used at Birmingham is also available on the Web. [Rev. 2001 November 27].

Brill's Tagger: http://www.cs.jhu.edu/~brill/code.html

Brill's Transformation-based learning Tagger is one of the best know freely available taggers. It comes with a lexicon based on the Wall street journal. The whole thing is in Perl and C (the bits which need to be fast are in C).
Freely downloadable by anonimous ftp both program and manual in the /Programs (the software) and /Papers (the documentation) directories. The program only is also directly downloadable by the Web from Eric Brill's Code Page.
+ There is also a DOS compilation of Brill's Tagger with djgpp made by Takahashi (cf. Takahashi Software Plaza). Freely downloadable (there is NOT the original archive: please get the Original archive before you execute the MSDOS executables).
+ There is also an online version mantained by the UNED Grupo de Procesamiento del Linguaje Natural. [Added 2001 April 30].

Canterbury Text Compression Corpus: http://corpus.canterbury.ac.nz/

The Canterbury Corpus is a benchmark to enable researchers to evaluate lossless compression methods; it replaces the Calgary Corpus. This site includes freely downloadable test files and compression test results for many research compression methods. For a full descrition see under the English Section. [2001 April 28].

CASS Partial Parser: http://www.sfs.nphil.uni-tuebingen.de/~abney/

The CASS Partial Parser made by Steven Abney is a partial parser designed for use with large amounts of noisy text. Robustness and speed are primary design considerations. The package consisting of Cass and utility programs is called SCOL. The most recent bug fixes were made on 20 Jun 97 (version 0.1d). Version 0.1e (24 Sep 97) contains no substantive changes, only some minor modifications to make it compile more smoothly. It has been successfully compiled on a sun4m running SunOS 4.1.3_U1, a sun4u running SunOS 5.5.1, and an i686 running Linux 2.0.24 (architecture from uname -m, OS from uname -sr). It is freely downloadable with his manual. Contact.

Categorial Grammar Laboratory:

gopher://nic.merit.edu:7055/40/linguistics/software/mac/cglab1.11.cpt.hqx
Directly downloadable from the gopher above. It is a fully functional demo version of a program for writing and testing categorial grammars.

ChaSen: http://chasen.aist-nara.ac.jp/

ChaSen is a free Japanese Morphological analyser by the Computational Linguistics Laboratory, Graduate School of Information Science, Nara Institute of Science and Technology (NAIST). Version 1.0 was officially released on 19 February 1997; the last release is 2.5.1, 2002/1/30. It grew out of developing JUMAN version 2.0 and has made a significant improvement in system performance. This tool was christened with the Japanese name for the tea whisk because it was developed by the NAIST, situated at Takayama (Nara), which is famous for producing a tea whisk used in traditional Japanese tea ceremony. ChaSen can be freely downloaded from the site in UNIX and Linux versions; the dictionary (IPADIC), with which it works, is available also for Windows. [2002 February 17].

Chunkie NP-Chunker: http://www.coli.uni-sb.de/~skut/chunker/

The task of a chunker consists in roughly grouping words into rather simple phrases, such as non-recursive NPs, PPs, APs or verbal complexes. Approaches to chunking vary, but have a few features in common: (a) Since the output of a chunker usually serves as input for further processing, a reasonable accuracy is more important than a wide coverage. Thus, only rarely do chunkers go beyond the recognition of base (non-recursive) phrases. (b) Top-down information is not available, so that the chunker has to rely on hints provided by local lexical contexts, i.e., short sequences of words and/or parts of speech. Chunkie is based on a principle similar to that underlying standard POS tagging techniques. It assigns tree fragments to sequences of POS tags. The most likely sequence of tree segments is determined using Viterbi search on the basis of trigram frequencies (in other words, it is a 2-nd order Markov Model). For this kind of tagging, it uses the excellent HMM-based tagger TnT. The chunker is capable of recognising trees of depth <= 3, which means that it can be used for parsing more complex structures than just base NPs. Depending on parametrisation, 8 - 9 chunks out of 10 are assigned the correct internal structure. Training sample: 12,000 sentences from the NEGRA newspaper corpus. Release is scheduled later this year, maybe in October.

Cíbola/Oleada: http://crl.nmsu.edu/Research/Projects/oleada/

Cíbola and Oleada, developed by the Computing Research Laboratory (CRL) at New Mexico State University, are two related systems that provide multilingual text processing technology to language instructors, learners, translators, and analysts. The systems consist of a set of component tools that have been designed with a user-centered methodology. This methodology takes observations made of users in real-work environments as the starting point for interface design. Iterative refinements are made as a result of continued user observation and testing. These systems are devloped using an Unicode text processing cabability represented by the Multilingual Unicode Text Toolkit, or MUTT. The individual components comprise Multilinguial Text, Concordance, Dictionaries, Custom Databases, Translation Memory, Glossaries, Document Management, and Word Frequencies. Learners use OLEADA to: (a) identify relevant texts; (b) view words/expressions in context; (c) discover the frequency of words in texts; (d) segment Chinese or Japanese texts; (e) retrieve parallel English translations; (f) examine real life examples.
+ Cf. also the OLEADA page.
+ Various unsupported versions of these tools are freely available for download after you have signed the CRL Software License Agreement (it's easy: they dont ask you money or embarassing questions!), obtain password and username to login.

CLAN programs (incl. Concordancer and Analizer): http://childes.psy.cmu.edu/

(mirroring sites: Antwerp - Belgium, Chokyo - Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. They are based on the CHAT format which makes them easily analyzed by using the CLAN programs. In particular the CLAN concordancer is freely available; cf. also the manual.

CLAWS tagger: http://www.comp.lancs.ac.uk/computing/research/ucrel/claws/

UCREL POS tagging software for English text, CLAWS (the Constituent Likelihood Automatic Word-tagging System), has been continuously developed since the early 1980s. The latest version of the tagger, CLAWS4, was used to POS tag c.100 million words of the BNC. CLAWS has consistently achieved 96-97% accuracy (the precise degree of accuracy varying according to the type of text). Judged in terms of major categories, the system has an error-rate of only 1.5%, with c.3.3% ambiguities unresolved, within the BNC. [Rev. 2001 April 28].
+ A site licence for academic institutions for CLAWS4 may be purchased from UCREL for the huge fee of £750. This fee includes introductory assistance and an information pack which contains program details, relevant papers, a reference list and tagset examples. The system requires an UNIX (SPARC) workstation running SunOS4.x, or an UNIX (SPARC) workstation running Solaris with binary compatibility installed.
+ Besides this, there is an in-house tagging service at Lancaster University. Text is sent to UCREL in electronic form (the submit form is on the web page), UCREL marks up POS tags using CLAWS4 tagger and delivers the completed tagged text within an agreed time schedule. The cost of this service is based on actual text itself, any new material, necessary adjustments, the size and any specific individual requirements.
+ Free CLAWS WWW trial service. The trial service offers free access (you are asked only for e-mail) to the latest version of the tagger, CLAWS4, with either C5 or C7 tagsets. You can enter up to 300 words of English running text. If you enter more, it will be cut off after the 300th word.

CMA - Chinese Morphological Analyzer: http://www.basistech.com/products/Chinese-analyzer.html

The Chinese Morphological Analyzer (CMA) from Basis Technology is a portable engine that incorporates comprehensive Chinese dictionaries for segmenting Chinese texts in both Traditional Chinese and Simplified Chinese scripts. KMA can segment Chinese text into words, index and search large collections of Chinese documents (or text fields in databases), generate list of words from free-running Chinese text, and identify parts of speech and word-formation processes. Price is not stated, and to request an evaluation version you have to send them an e-mail. There is however also an online demo, for both Simplified and Traditional Chinese. [2002 February 17].

CMU-Cambridge Statistical Language Modeling toolkit:

http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html
(Something downloadable).

COAL Tools suite: http://www-rali.iro.umontreal.ca/arc-a2/BAF/Programmes.html

COAL is a simple format for representing alignements (i.e. parallel segmentations of a pair of texts such that the Ne segment of the first text and the Ne segment of the second text were mutual translations) and mappings (i.e. sets of pair of positions which mark exact correspondences between the two texts) in bi-texts (i.e. pair of texts in two different language that are mutual translations). At RALI (mainly for the BAF Corpus project) have made a number of small tools (Perl and Emacs Lisp) for manage COAL format texts. The full suite is freely downloadable as a single TAR gzipped file. [2001 April 23].

Cogilex: http://www.cogilex.com/

Cogilex, a company of computational linguists and knowledge engineers, offers expert tools and services for natural language processing. Beside providing software (e.g. the QuickTag and QuickParse) they can also act as advisors, designers, developpers or quality-control experts on your NLP-based projects.

Conc: http://www.sil.org/computing/conc/

Conc is a concordance generator for the Macintosh by SIL Computing. It produces keyword-in-context concordances of words in a text. It can handle both ordinary flat text and multiple-line interlinear text. In the case of interlinear text (produced by IT), it can concord morphemes and also correspondences between two annotation lines. Conc can also do letter concordances to facilitate phonological analysis. Conc permits the user to limit the concordance to just those words that match a specified pattern (GREP expression). Freely available under agreement to the SIL standard free license.

ConcApp: http://vlc.polyu.edu.hk/pub/concapp/

ConcApp is a free concordancing program by the Virtual Language Centre of the Polytechnic University of Hong Kong. Version 1 runs on Win 3.1-95, Version 2 on Win 95, 98 and NT. Both can be freely downloaded with full setup in zip format (v.1 and v.2) or also in reduced packages. The Web Concordancer site presents some applications of ConcApp to a few small indexed corpora (English, French, Chinese, Japanese). [2002 February 17].

Concordance: http://www.rjcw.freeserve.co.uk/

The Concordance software by R. J. C. Watt although recently released (1999), accordingly to his author already has registered users in twenty-four countries. The program is proving valuable to anyone who needs to study texts closely or analyse them in depth. It is being used in: (a) Language teaching and learning; (b) Literary scholarship; (c) Translation and language engineering; (d) Corpus linguistics; (e) Natural language software development; (f) Lexicography; (g) Content analysis in many disciplines including accountancy, history, marketing, musicology, politics, geography, and media studies. It can make full concordances to texts of any size, limited only by available disk space and memory, or fast concordances, picking your selection of words from text. It support different Western languages and character sets - not just English - and User-definable alphabet. Built-in file viewer can display files of unlimited size and Built-in editor allows fast editing of files up to 16MB. There is also a File conversion tools - from OEM to ANSI character sets and from Unix to PC files. Save and export concordances as plain text, as a single HTML file, or as a Web Concordance (viz. linked HTML files, ready for publishing on the Web: cf. some sample on the Web Concordances site).
+ The new Version 2.0.0 has been released 18 December 2000; it runs also on Win 2000. Free upgrade are available for registered users.
+ Concordance can be ordered online. It has a reasonable price (89$, 40$ for each additional copy). You must first download the unregistered shareware version (which lasts only 30 days and is fully functional, but adds 'Unregistered' messages to the files you create with it), than order and pay your registration. [Rev. 2001 Nov. 28].

Conexor: http://www.conexor.fi/main.htm

Conexor is a linguistic software company based in Finland. Conexor was founded in 1997 for developing and selling linguistic software for use in applications related to information technology, speech processing, human-machine communication, lexicography, grammar and style improvement, and terminology. At the present, they sell computer programs for linguistic analysis of English texts, e.g. EngCG-2 tagger and the EngCG Parser (both with the Constraint Grammar of English).

Corpus Wizard: http://www2d.biglobe.ne.jp/~htakashi/software/CW_E.HTM

You have to jump some frames to go to the Corpus Wizard page, where there are free download links but few informations. The description above is instead taken from the Euralex Computing Tool at this page. Corpus Wizard for Win32-95 + 3.1 and NT by Hamagushi Takahashi is a kind of concordancer, which can produce and sort KWIC concordance. You can extract more selected results from the concordance. You can use regular expressions. English, French and Japanese are officially supported. Corpus Wizard have also some other utilities such as FREQ. Corpus Wizard is posted to WinSite.com, so you can get Corpus Wizard from various ftp sites around the world.
The software is shareware and the fee is low (cf. this page): Corpus Wizard for Win16 $6 (Single User) – Corpus Wizard Plus! $6 (Single User) $30 (Site Licence) – Corpus Wizard for Win32 $35 (Single User) $50 (Site Licence) – KPL Text Processing Utils. $6 (Single User) $30 (Site Licence) – DeHTML for Windows $5 (Single User) $20 (Site Licence) – KPL Printing $10 (Single User) $40 (Site Licence).

CRCL (Center for Research in Computational Linguistics): http://seasrc.th.net/main/main.htm

The CRCL (Center for Research in Computational Linguistics, Bangkok) pages produced by Doug Cooper for the SEASRC (South East Asian Computing And Linguistics Center) lie at the intersection of computing and linguistics in Southeast Asia. SEASRC publicize and encourage cooperative research activity in and around Thailand, and provide data, tools, and contacts to scholars around the world. There is a lot of valuable and usually free resources (especially for Thai) on this site (cf. the index), spanning from the TIE project (with the TOLL bilingual texts) to fonts and related tools. A great site!

CUE Corpus Storage System: http://www-clg.bham.ac.uk/CUE/

CUE is a development system by the Corpus Research Group - University of Birmingham for applications that require access to corpus data. It provides a high level of abstraction that makes it easy to select corpora, to get concordance lines for a certain query, and also to compute collocational scores. A developer no longer has to deal with files and characters, but instead handles corpora, tokens and corpus positions. CUE is a collection of Java class libraries, which allows for platform independent development. It uses data compression to keep the space requirements of corpora at a minimum, and through the use of an inverted index the speed of retrieval is extremely fast.
+ Version 1.3, the first public release, is now freely downloadable. At present there is only an outline of the documentation, together with the API of the most important classes, but more documentation will be supplied as time permits. In order to access your corpus data through CUE you will need to index the data; documentation of this is available in a separate file.

CWB (The IMS Corpus Workbench): http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/

A set of powerful pre-compiled tools developed by Oliver Christ and Bruno Maximilian Schulze of the IMS to the purpose of querying large text corpora. Tokenized texts can be processed into an internal storage format, single tokens or groups of tokens can be queried by using regular expressions. Results can be viewed, sorted, grouped, and printed. The core components of the suite are the CQP (Corpus Query Processor), the query language interpreter, and the Xkwic sorting utility. The corpus query processor CQP is a command-language based query interpreter, which can be used independently or by Xkwic, which is a X-windows graphical user interface. The CQP system supports a restricted set of annotations and parallel corpora, but not SGML syntax.
The software is available free of charge, after license, for research and educational purposes (for Solaris, Linux, and IRIX machines), cf. the present link. Contact: Arne Fitschen

Dan Melamed's Tools: http://www.cis.upenn.edu/~melamed/ (alternative page)

On Dan Melamed’s page at Penn (i.e. the University of Pennsylvania; cf. the Penn Tools file) there are a lot of miscellaneous tools developed by the author and nearly all freely downloadable by ftp, viz.: a statistical machine Translation Toolkit; a simple Perl Simulated Annealing Program; an XTAG Morpholyzer Post-processors for English stemming; a GB to ASCII Converter for punctuation and numerals; Good-Turing Smoothing Software; 170 general text processing tools (mostly in Perl 5); 75 text statistics tools (mostly in Perl 5); 40 bitext geometry tools (mostly in Perl 5). There are also links to other’s stuff and to a lot of freely downloadable PS papers by Melamed himself. Contact. [Last Rev. 2001 April 27].

DBT: http://www.ilc.pi.cnr.it/pisystem/terza.htm

DBT is the Textual Data Base component of the PiSystem suite made by Eugenio Picchi at the Pisa ILC. It is the more widespread software of this category in Italy (it is, for example, used by the popular LIZ text collection of Italian Literature).
+ Commercial versions of DBT is licensed by the CNR and can be buyed from LEXIS distribution house, via Acireale 19 - ROMA - Italia; Tel./Fax: +39-6-70302626; E-mail; cf. also the web page.
+ DBT is now (April 2001) available at discount price (480.000 IL) also from Libreria Chiari.
+ There is also a web version with a demo online.
+ For a detailed list of the moduli of this procedure see the following page.

Definite Clause Grammar Laboratory:

gopher://nic.merit.edu:7055/40/linguistics/software/mac/defclauselab1.11.cpt.hqx
A fully functional demo version of a MAC program for writing and testing definite clause grammars.

DELIS: http://www.linglink.lu/le/projects/delis/index.html

DELIS (Descriptive Lexical Specifications and Tools for Corpus-based Lexicon building) aimed to provide tools for efficient lexicographic corpus construction, exploration and selective retrieval of lexicographically relevant material. It provided an easy to use and well documented descriptive scheme for lexical representation, improving consistency over manual and semi-automatic data entry. The tools also support importation and exportation of lexical information. The user community of the DELIS tools includes: (a) Lexicographers in dictionary publishing houses. (b) Glossary builders and terminologists in translation and documentation companies. The objectives of the project focused on lexicon-based syntactically oriented retrieval of corpus evidence from morphosyntactically and syntactically annotated text corpora (Search Condition Generator) and exemplification with a fragment of semantically, syntactically and morphosyntactically described verbs of perception and communication in 5 languages: EN, FR, DK, NL, IT. As a formalism for lexical representation the project has employed a 'Frame Semantics' for lexical semantic description, HPSG-like syntax and Typed Feature Structures (TFS). Corpus tools are programmed in C (and X/MOTIF GUI: Xkwic). The Search Condition Generator provides support for lexicon-driven corpus querying and TFS encoding of lexical descriptions based on a model as a starting point. The tool generates corpus queries in the format of English Constraint Grammar (ENGCG: Helsinki). The project, still in progress, produced prototypes of a data entry facility: TFS-mode for emacs; hierarchy viewer, TFS feature structure viewer (XmFed) and a more widely usable Search Condition Generator (e.g. for BNC). Several hundreds of sentences have been encoded in detail for each language (20+ types of semantic, syntactic and morphosyntactic annotations). A TFS dictionary has been produced with entries for perception verbs of EN, FR, IT, DK, NL, related to the corpus sentences. In addition reports on the methodology with detailed examples are available. Contact: Ulrich Heid (Project Manager) - Universität Stuttgart, IMS-CL - Azenbergstr. 12 - 70174 Stuttgart - Germany -- Tel: +49-711-121-1373 - Fax: +49-711-121-1366 - Email: heid@ims.uni-stuttgart.de.

DLSI-UPC/CLiC-UB Tools Demo: http://nipadio.lsi.upc.es/cgi-bin/demo/demo.pl

This Demonstration page of Morphosyntactic analysis, tagging and parsing of unrestricted text allows you to freely submit some sentences in Spanish, Catalan or English to the full suite of tools developed at the DLSI-UPC (Departament de Llenguatges i Sistemes Informatica - Universitat Politècnica de Catalunya) in collaboration with the CLiC (Centre de Llenguatge i Computació - Universitat de Barcelona). The components of the suite are MACO+ (morphological analyzer corpus-oriented), EWN (Top-ontology semantic analyzer), Relax (POS Tagger), TreeTagger (POS Tagger), TACAT (parser). [2001 April 30; last checked 2001 October 28].

Emdros: http://emdros.org/

Emdros (Engine for MdF Database Retrieval, Organization, and Storage) by Ulrik Petersen (cf. his homepage) is a theory-neutral Open-Source text database engine for storage and retrieval of analyzed or annotated text. It use the MQL (that is a descendant of another query-language, QL, which was the fruit of Crist-Jan Doedens' labors in his PhD thesis. QL was an extremely powerful query-language to go with the MdF model) for asking relevant questions of the data. Emdros implements the EMdF text database model; the primary advantage of this particular model of text over XML's data model is that object types (such as pages and chapters) need not be hierarchically structured or embedded, but may overlap. In addition, objects (such as a clause or a phrase) need not be contiguous, but may have gaps. Emdros can output its results in XML. The XML carries its own standalone DTD and validates with a validating parser.
Emdros has wide applicability in fields that deal with analyzed or annotated text. Application domains include linguistics, publishing, text processing, and any other fields that deal with annotated text. Emdros is good both for corpus linguistics (large amounts of text) and for field-linguistics (smaller amounts of data). MQL supports both simple and complex queries on the data. Queries on syntactic analyses are particularly well supported, but all other linguistic levels are supported as well. Queries for constructions like subject inversion, embedded relative clauses, elliptic clauses, PPs with pospositions, and DP phrases with pre-head complements can all be easily and intuitively formulated, provided the underlying data has the required categories.
Emdros is licensed under the GNU GPL license, and the documentation as well as the Linux and Win32 binaries can be freely downloaded from this page. [2005 January 5].

EngCG Parser: http://www.lingsoft.fi/doc/engcg/

EngCG, the Constraint Grammar Parser of English by Pasi Tapanainen (1993), performs morphosyntactic analysis (tagging) of English text. There is an online demo at Lingsoft site. It is sold by Lingsoft: for availability (it is a commercial software!) you have to ask to info@lingsoft.fi.

EngCG-2 Tagger: http://www.ling.helsinki.fi/~avoutila/cg/index.html

EngCG-2, by Pasi Tapainen and Atro Voutilainen, is a program that assigns morphological and part-of-speech tags to words in English text at a speed of about 3,000 words per second on a Pentium running Linux. It is an improved version of the original EngCG tagger, which is based on the Constraint Grammar framework advocated by a team of computational linguists in Helsinki, Finland. The documention (and several articles) are available online.
EngCG-2 tagger with the English Constraint Grammar can be licensed from Conexor. There is also a free demo available. Contacts: Tapanainen and Voutilainen.

Englex: http://www.sil.org/computing/catalog/englex.html

Englex is an English parsing description for PC-Kimmo by SIL Computing. Englex is a morphological parsing lexicon of English. It uses the standard orthography for English. It is intended for use with PC-Kimmo (or programs that use the PC-Kimmo parser, such as K-Text and K-Tagger). With such software and Englex, you can morphologically parse English words and text. Practical applications include morphologically preprocessing text for a syntactic parser and producing morphologically tagged text. Englex works under Win 3.1-98 + NT, DOS, MAC and Unix. freely available under agreement to the SIL standard free license.

Ergo Linguistic Technologies: http://www.ergo-ling.com/

Ergo Linguistic Technologies (2800 Woodlawn Drive, Suite 175 - Honolulu, HI 96822 - tel:808+539-3920 - fax:808+539-3924) has developed NLP software that can analyze English sentences quickly and thoroughly enough to make it possible to greatly improve grammar checker and translation software, foreign language tutoring software and most important, this technology makes it possible to develop language interactivity that allows interactive dialogs with game characters, software applications and even household appliances.There are demos available which include grammar analysis, PennTreeBank style bracketings, first order predicate calculus, inferencing, and Q&A. Of particular interest are the Parser Demo online and the Parsing Contest pages.

Euralex Computing Tools: http://www.ims.uni-stuttgart.de/euralex/tools/

These Euralex pages, host by the IMS Stuttgart, offers a fair amount of links and information on concordancing and lexical representation tools.

EuroWordNet: http://www.hum.uva.nl/~ewn/

The commercial version of Wordnet, for various European languages (Commercial software).

EUSLEM Basque Lemmatizer/Tagger: http://ixa.si.ehu.es/ingeles/dokument/EUSLEM.html

The EUSLEM automatic lemmatizer/tagger, developed by the Lengoaia Naturalaren Prozesamendurako IXA Taldea, is a basic tool for applications such as automatic indexation, documental databases, syntactic and semantic analysis, analysis of text corpora, etc. Its job is to give the correct lemma of a text-word, as well as its grammatical category. The lemmatizer/tagger is being of great help for the second phase of the EEBS project (Systematic Compilation of Modern Basque). A tagset system has been also developed for Basque: it is a three level system which the user can parametrise when using the programme. In the first level seventeen general categories are included (noun, adjective, verb, etc). In the second one each category tag is further refined by subcategory tags. The last level includes other interesting morphological information (case, number, etc.). Information on availability is lacking. [2001 April 30].

EWN: http://www.lsi.upc.es/~nlp/descr-eines.html

EWN top-ontology semantic analyzer accepts as input morphologically analized text (the output of MACO+) and adds to each lemma the nodes in EWN top-ontology that subsume it. EWN is developed at the DLSI-UPC (Departament de Llenguatges i Sistemes Informatica - Universitat Politècnica de Catalunya) in collaboration with the CLiC (Centre de Llenguatge i Computació - Universitat de Barcelona). Availabilty is unknown: contact Núria Castell i Ariño. [2001 April 30].
+ EWN can be queried freely online (English, Spanish and Catalan) as part of the DLSI-UPC/CLiC-UB Tools Demo.

Free Text

FreeText is a free MAC concordance program. It should be available from this site (but last time I tried I didn't succeded in logging in ...).

GATE (General Architecture for text Engineering): http://www.dcs.shef.ac.uk/research/groups/nlp/gate/

GATE, developed over the last three years at the University of Sheffield, is a domain-specific software architecure and development environment that supports researchers in Natural Language Processing and Computational Linguistics and developers who are producing and delivering Language Engineering systems. To contact the GATE project, mail gate@dcs.shef.ac.uk. For accessing the GATE ftp (from this page) you have to register at the following link.

GATTO 2.3 (Gestione degli Archivi Testuali del Tesoro delle Origini): http://www.csovi.fi.cnr.it/pagweb12.htm

GATTO is a lexicographic tool made by Domenico Iorio-Fili, with the collaboration of Francesco Leoncino, at the OVI. GATTO was created in order to maintain, lemmatize and query the text corpora database of the Vocabolario Storico della Lingua Italiana, now in progress at the OVI. GATTO works under Windows 3.1x, Windows 95, Windows 98 or Windows NT. It is freely available (postal delivery charge only) for scholars: the GATTO CD-ROM contains the program itself, the manual in Word 6.0 format and a small sample corpus. Contact: Domenico Iorio-Fili.

Genesys: http://cirrus.dai.ed.ac.uk:8000/systemics/Software.html#communal

The Communal group (Robin Fawcett, Gordon Tucker, etc.) have developed a sentence generation system called Genesys, An integrated environment for developing Systemic grammars (cf. Systemic-Functional Linguistics). It doesn't generate from semantic input, but rather requires the user to traverse the system network, choosing a feature at each point. Large semantic-oriented network. For more Information e-mail Robin Fawcett or Gordon Tucker.

Harald Klein's Text Analysis Sources: http://www.intext.de/TEXTANAE.HTM

A rich and detailed link page of text analysis (and some NLP as well) software. Software availability, distribution, authors' home page are cleary stated. There's also some more general link.

HUM Concordance and Text Analysis software:

http://www.ltg.ed.ac.uk/helpdesk/faq/Tools-html/0055.html
A package of programs for literary and linguistic computing is available, emphasizing the preparation of concordances and supporting documents. Both keyword in context and keyword and line generators are provided, as well as exclusion routines, a reverse concordance module, formatting programs, a dictionary maker, and lemmatization facilities. There are also word, character, and digraph frequency counting programs, word length tabulation routines, a cross reference generator, and other related utilities. The programs are written in the C programming language, and implemented on several Version 7 Unix systems at Berkeley. HUM, developed by William Tuthill, is freely available as hum.tar.Z file by anonymous-ftp; there is also a downoadable DOC manual. Contact: William Tuthill - Comparative Literature Department - University of California - Berkeley, California 94720.

ICETree 2: http://www.ucl.ac.uk/english-usage/ice-gb/icetree/download.htm

ICETree 2 is a dedicated software package written at the Survey of English Usage for developing ICE corpora. ICETree allows researchers to build and manipulate syntactic trees. Using ICETree, you can view, build or manipulate syntactic trees for sentences, phrases or other groups of words. ICETree has been used to build the parse trees for the ICE-GB Corpus. ICETree will be of use to researchers building language corpora. The software was written for the ICE corpora but, with some changes to data files that accompany the program, it can be used on other corpora. ICETree can be used to build trees from scratch but is best used to edit existing data, such as the output from an automatic parser.
The trial version is a full, working copy of the program but it is limited to 10 minute sessions. After each 10 minute session, the program will close itself. The trial version comes with a test corpus - a small collection of trees for you to practise on. Download from http://www.ucl.ac.uk/english-usage/ice-gb/icetree/form.htm
The full version of ICETree is available at a cost of 99 UK pounds from the Survey of English Usage. Please email the Survey to arrange your order.

ICECUP (the ICE Utility Program): http://www.ucl.ac.uk/english-usage/ice-gb/icecup.htm

ICECUP is a state-of-the-art corpus exploration program designed for parsed corpora such as ICE-GB. ICECUP is supplied with ICE-GB and is available now with the ICE-GB Sample Corpus for free download. ICECUP allows you to perform a variety of different queries, including using the parse analysis in the corpus to construct Fuzzy Tree Fragments to search the corpus.

Index of /~archive/linguistics/software/mac: http://www.umich.edu/~archive/linguistics/software/mac/

It's a useful free download page for freeware and shareware MAC NLP tools. There are Conc, K-Text, Pc-Kimmo etc.

Index of Michigan Archives: gopher://nic.merit.edu:7055/11/linguistics/software

A few gopher free download pages for freeware and shareware MAC and DOS NLP tools.

InfoBlast: http://www.1source.com/~pollarda/textview/

InfoBlast is a Text Indexing tools that runs under Windows. InfoBlast lets you search all those text files and ebooks on your hard drive blindingly fast! No more scrolling top to bottom and then back up again. Search for it and your are there virtually instantly -- even if your text file is several hundred megabytes in size or more! InfoBlast takes text files and indexes them by building an index where all of the words within the text file are located much like the index for a book. This will allow you to view and conduct searches on the text in the same way the search engines search the internet.

INTEX

It is a finite-state transducer analysis system for English, French, and Italian that runs under NextStep. Contact: Max Silberztein.

ISIP tools: http://www.isip.msstate.edu/projects/speech/software/

The main aim is a publically available speech recognition system, but along the way there are also toolkits for discrete HMMs and statistical decision trees (freely downloadable).

JavaScript: cf. the file in the References & Standards section.
JMA - Japanese Morphological Analyzer: http://www.basistech.com/products/japanese-analyzer.html

The Japanese Morphological Analyzer (JMA) from Basis Technology is a portable segmentation engine for Japanese text combined with Japanese dictionaries. It can index and search large collections of Japanese documents (also text fields in databases), generate word lists and verify consistency between kanji and yomi forms. Imput texts must be in Unicode UCS-2 format. Price is not stated, and to request an evaluation version you have to send them an e-mail. There is however also a demo online. [2002 February 17].

KKC for DOS: http://www2d.biglobe.ne.jp/~htakashi/software/KKC_E.HTM

You have to jump some frames to go to the Corpus Wizard page, where there are free download links but few informations. The software is by Hamagushi Takahashi and is stored in the Takahashi Sotware Plaza.
KKC is a text find utility to output in Kwic. Available also a version for OS/2. Both version are available also for Japanese as shareware. The European version instead is freeware.

KMA - Korean Morphological Analyzer: http://www.basistech.com/products/Korean-analyzer.html

The Korean Morphological Analyzer (KMA) from Basis Technology is a portable linguistic segmentation engine for Korean text. The Korean language presents challenges for morphological analysis, and recognition of word boundaries is often difficult. KMA analyzes and extracts keywords from Korean text based on linguistic characteristics and an optimized dictionary of essential modern Korean words. KMA performs morphological analysis on Korean words (Eojeol), including: segregation of morphemes according to POS, grammar or relational function of each morpheme; examination of likelihood of combination between morphemes; stemming (reducing to root form) of irregular verbs/adverbs/adjectives; presumption of compound nouns; recognition of unknown/unregistered words; support for a user-defined dictionary; support for multiple reference dictionaries; decomposition of compound nouns; generation of a list of words from Korean texts; identification of the root form and part-of-speech (POS) information for each morpheme that constitutes Eojeol; and recognition of patterns for morphological structures of Eojeol. Price is not stated, and to request an evaluation version you have to send them an e-mail. There is however also a demo online. [2002 February 17].

Korean morphological/lexical analyzer: http://nlp.kookmin.ac.kr/HAM/eng/download-e.html

The Korean morphological/lexical analyzer by Seung-Shik Kang (see his homepage in English or in Korean) from Kokmin University is a part of the Hangul Analysis Module (HAM). It has now reached Version 5.0.0a and it works on Win 98, 2000 and NT, Linux or Solaris. Freely downloadable directly from the site. An online demo (korean page only!) is also available. [2002 February 17]

Korean University NLP Lab: http://nlp.korea.ac.kr/index_e.html

(Korean also)
This is the page of the Dept. of Computer Science and Engineering, Korea University (1, 5-ka, Anam-dong, SEOUL, 136-701, KOREA) established in 1991. They have developed various working systems for processing the natural language (morphological analysis, part-of-speech tagging, word sense disambiguation, etc.) and for language-dependent applications (information retrieval, spelling correction, linguistic knowledge acquisition , etc.). Recently, their resarch interests are concentrated on the syntactic analysis and multilingual applications like multilingual information retrieval and machine translation. They have also some online demos of their works, such as, for ex., a Korean morphological analyser, a Korean POS tagger, a Korean-English cross-language information retrieval system, etc. [2001 April 26].

KPML (Komet-Penman Multi Lingual):

http://cirrus.dai.ed.ac.uk:8000/systemics/Software.html#penman
A sentence-generation system developed at Information Sciences Institute (ISI) in Los Angeles. Principal architects Bill Mann and Christian Matthiessen. Development is almost stopped in ISI, but development is continuing in several places, most noticeably, John Bateman's current version called KPML (Komet-Penman Multi-Lingual). KPML is widely used, for instance in FAW (ULM), ITRI (Brighton), University of Waterloo, etc. KPML offers sentence generation from a semantic input (SPLs: cf. Systemic-Functional Linguistics). Graphing of system networks, systemic structures, etc. Can handle multiple grammars simultaneously. Platform: Sun/Unix for KPML. The pre-multi-lingual version is available for Macs, TI, Symbolics also.
+ Download it freely at this address.

K-Tagger: http://www.sil.org/computing/catalog/ktagger.html

K-Tagger is a POS tagger by SIL Computing based on PC-Kimmo. It works under Win 3.1-98 + NT, DOS, MAC and Unix, using the shell of the PC-Kimmo parser. Freely available under agreement to the SIL standard free license.

K-Text: http://www.sil.org/computing/catalog/ktext.html

K-Tagger is a text analyzer by SIL Computing based on PC-Kimmo. KTEXT reads a text from a disk file, parses each word using the PC-Kimmo parser, and writes the results to a new disk file. This new file is in the form of a structured text file where each word of the original text is represented as a database record composed of several fields. Each word record contains a field for the original word, a field for the underlying or lexical form of the word, and a field for the gloss string. K-Text works under Win 3.1-98 + NT, DOS, MAC and Unix. Freely available under agreement to the SIL standard free license.

KWIC page: http://www.cs.cmu.edu/~Compose/html/ModProb/KWIC.html

Mary Shaw's page provide a good introduction to the KWIC (KeyWord In Context) algorithm, which provides the grounds for many concordance tools (e.g. Xkwic, Xconcord, etc.). A concise definition of the Keyword in Context problem is provided from [Parnas72]: "The KWIC index system accepts an ordered set of lines, each line is an ordered set of words, and each word is an ordered set of characters. Any line may be "circularly shifted" by repeatedly removing the first word and appending it at the end of the line. The KWIC index system outputs a listing of all circular shifts of all lines in alphabetical order". Contextual indices have been used for many years. For example, Biblical concordances have approximately this form, except for the rotations. The usual source for the problem as now known, however, is the Parnas definition. In his paper of 1972, Parnas used the problem to contrast different criteria for decomposing a system into modules [Parnas72]. He describes two solutions, one based on functional decomposition with shared access to data representations, and a second based on a decomposition that hides design decisions. The latter was used to promote information hiding, a principle that underpins the use of abstract data types and of object-oriented design. Since its introduction, the problem has become well-known and is widely used as a teaching device in software engineering. Garlan, Kaiser, and Notkin also use the problem to illustrate modularization schemes based on data-driven tool invocation [Garlan92]--sometimes referred to as reactive integration. While KWIC can be implemented as a relatively small system it is not simply of pedagogical interest. Practical instances of it are widely used by computer scientists. For example, the "permuted" [sic] index for the Unix Man pages is essentially such a system.

KWICFinder: http://miniappolis.com/KWiCFinder/KWiCFinderHome.html

KWICFinder is a brand new, revolutionary concordancer by William H. Fletcher. KWICFinder rides on the back of a standard search engine, enabling the whole WWW to be used as a text corpus. Pre-release version 5 (February 2002) of KWiCFinder is now freely available for download. It requires Windows 95/98/ME & Internet Explorer 5.0 or greater. [2002 February 17].

LDC Linguistic Annotation page: http://morph.ldc.upenn.edu/annotation/

A well annotated link page by Steven Bird on tools and formats for creating and managing linguistic annotations of Corpora. The focus is on tools which have been widely used for constructing annotated linguistic databases, and on the formats commonly adopted by such tools and databases. There is also an Italian version by Piero Cosi.

LEXA Corpus Processing Software 7.0: http://nora.hd.uib.no/lexainf.html

LEXA is a set of programmes which puts at the disposal of the interested linguist the tools he or she would require in order to process linguistically relevant data, most probably from an available corpus, with a high degree of automation on a personal computer. The package is divided into several groups which perform typical functions. Of these the first, lexical analysis, will be of immediate concern. The main programme, Lexa, allows one to tag and lemmatise any text or series of texts with a minimum of effort. All that is required is that the user specify what (possible) words are to be assigned to what lemmas. The rest is taken care of by the programme. Lexa properly is the main programme of the current set. It allows one to automatically lemmatise any input ASCII texts, to create frequency lists of the types and tokens occurring in any loaded text, to generate lexical density tables, to transfer textual data in a user-defined manner to a database environment, to mention just the main procedures which are built into Lexa. The results of all operations are stored as files and can be examined later. Each item of information used by Lexa when manipulating texts is specifiable by means of a setup file which is loaded after calling Lexa and used to initialize the programme in the manner desired by the user. For a description of the many other additional components cf. the page http://nora.hd.uib.no/lexainf.html. Lexa is not free, and vailablity is not documented: you have to contact Raymond Hickey, Universität GH Essen, FB 3 Literatur- und Sprachwissenschaften, FB 3 Anglistik / Linguistik, D - 45177 ESSEN, Germany. Tel. +49 201 183 3441. Fax. +49 201 183 3437. E-mail: lan300@vm.hrz.uni-essen.de.

Lingsoft: http://www.lingsoft.fi/

Lingsoft, Inc. is a linguistic software company based in Helsinki, Finland. They have (cf. the catalogue at http://www.lingsoft.fi/en/products.html) proofing tools (for Danish, Finnish, Swedish and Norwegian), Swedish grammatical checker, Finnish CD-ROM dictionary, indexing and retieval tools, the EngCG Parser, the English NPtool, and the Swedish Constraint Grammar system SweCG POS Disambiguator.
There aren't price on their site: you have to ask to info@lingsoft.fi (it is commercial software!), but there are some free online demos.

Linguist List Software page: http://www.emich.edu/~linguist/software.html (European mirror site)

The Software page of the Linguist List (Eastern Michigan University - Wayne State University) has an imposing quantity of links to linguistic software on the web, spanning from text analyser, to spelling utilities, teaching tools, electronic dictionary and, yes, also NLP applications.

Linguist's Shoebox: http://www.sil.org/computing/shoebox/

The Linguist's Shoebox is a integrated data management and analysis for the field linguist by SIL Computing. The Shoebox is a computer program that helps field linguists and anthropologists integrate various kinds of text data: lexical, cultural, grammatical, etc. It has flexible options for selecting, sorting, and displaying data. It is especially useful for helping researchers build a dictionary as they use it to analyze and interlinearize text. The name Shoebox recalls the use of shoe boxes to hold note cards on which the definitions of words were written in the days before researchers could use computers in the field. The Shoebox works under Win 3.1-98 + NT, DOS, MAC. The software and the manual (PDF format) are both freely available: you can download it directly from the home page, or have it by ftp or as a CD-ROM by snail mail.

Link Grammar Parser: http://www.link.cs.cmu.edu/link/ (or this link).

The Link Grammar Parser, made by Davy Temperley Daniel Sleator John Lafferty, is a syntactic parser of English, based on link grammar, an original theory of English syntax. Given a sentence, the system assigns to it a syntactic structure, which consists of set of labeled links connecting pairs of words. The parser has a dictionary of about 60000 word forms. It has coverage of a wide variety of syntactic constructions, including many rare and idiomatic ones. The parser is robust; it is able to skip over portions of the sentence that it cannot understand, and assign some structure to the rest of the sentence. It is able to handle unknown vocabulary, and make intelligent guesses from context about the syntactic categories of unknown words. It has knowledge of capitalization, numerical expressions, and a variety of punctuation symbols. In a test of 100 financial news sentences, version 3.0 of the parser identified 82% of constituents correctly, and it had an average speed of 2.7 seconds per sentence on a 266 MHz Pentium II. They have made the entire system freely available for download from their ftp. The system is written in generic C code, and runs on any platform with a C compiler. There is an application program interface (API) to make it easy to incorporate the parser into other applications.
+ Version 3.0 of the parser was released in April 1998, and was available by ftp.
+ Version 4.1 was released in August 2000. Among the new features of version 4.0 there is a system which derives a "constituent" representation of a sentence (showing noun phrases, verb phrases, etc.); version 4.1 is essentially similar to version 4.0 but have a few bugs fixed in 4.
+ Win Ver 4.1, a Windows version of version 4.1 of the parser, was released in April 2001.
+ There is also documention of the parser, papers related to Link Grammar, and a small demo online.

Lisp (Martin Cracauer's Lisp Page): http://www.cons.org/cracauer/lisp.html

Lisp is a high adaptability programming language, often used in NLP. Common Lisp is the Lisp dialect that is standardized with commercial users in mind and CMUCL is an implementation of Common Lisp that focuses on it's superb compiler. The CMUCL system produces overhead-free code (that means, it is as fast or even faster than C) for a large number of computation constructs. Many implementations of advanced programming languages produce overhead-free code for their own preferred operations, like List processing and searching, but CMUCL is capable of doing the same for constructs that are usually not in focus of advanced languages, like computations on floating point arrays and intensive integer data processing (encryption, data compression, parsing of text and network protocols). Compared to other good Common Lisp compilers, the CMUCL compiler needs less declarations to reach the same speed and it offers verbose messages to help the programmer in formulating the required declarations. The CMUCL design also features fast I/O and an interface to C libraries that requires no glue code, speeding up both using and implementing C libraries.
+ cf. also the Cons Org home page, both with some useful link and tutorial.
+ the Common Lisp Hyhpermedia Service is unavoidable reference point for Lisp resources. Release 70.33 is now available on the FTP site for most platforms. This includes significant performance improvements, particularly on the Lisp Machine and Macintosh platforms, many new features, and numerous bug fixes. Major components include a mature HTTP 1.1 server, a programmatic client client, a constraint-guided Web Walker, a proxy server and full-text indexation & retrieval. CL-HTTP has a mature port to Microsoft Windows. Completely free systems are available running under FreeBSD UNIX & Linux on x86 hardware.

LTG Software: http://www.ltg.ed.ac.uk/software/index.html

The Edinburgh Language Technology Group (LTG) makes available various software packages. For research purposes, these are often available for free to academic research groups and for a small fee to industrial R&D groups (contact). Besides the MATE Workbench, these are the tools offered [2001 April 26]:
+ LT TTT, a text tokenization system and toolset which enables users to produce a swift and individually-tailored tokenisation of text.
+ LT XML, an integrated set of XML tools and a developers tool-kit, including a C-based API.
+ RXP, a validating XML parser written in C, available under the GNU Public Licence.
+ LT NSL, a library of normalised SGML tools and a developers tool-kit, including a C-based API.
+ LT POS, an HMM POS tagger; there is also an interactive demo.
+ LT Pleuk, a grammar development shell.
+ LT Chunk, a surface parser which identifies noun groups and verb groups.
+ LT Thistle, a parameterizable display engine and editor for linguistic diagrams.
+ LT Biblio, a software for processing bibliographical references
+ LT TCR, a text categorization and text retrieval software.
+ XLM Perl, a rule based XML transformation language which uses Perl in the bodies of rules. This requires the LTXML parser and a Perl interpreter.

MACO+ (Morphological Analizer Corpus-Oriented): http://www.lsi.upc.es/~nlp/descr-eines.html

The MACO+ Morphological Analizer Corpus-Oriented accepts unrestricted text as input. The tool tokenizes the text, and performs and produces as output all morphological interpretetions possible for each token. It is able to recognize and deal with numbers, proper nouns, punctuation, dates, abbreviations, multiwords, etc. Spanish, Catalan and English versions available. MACO+ is developed at the DLSI-UPC (Departament de Llenguatges i Sistemes Informatica - Universitat Politècnica de Catalunya) in collaboration with the CLiC (Centre de Llenguatge i Computació - Universitat de Barcelona). Availabilty is unknown: contact Núria Castell i Ariño. [2001 April 30].
+ MACO+ can be queried freely online (English, Spanish and Catalan) as part of the DLSI-UPC/CLiC-UB Tools Demo.
+ A MACO+ only online tagging service is freely provided by UNED.
+ A Maco+ & Relax online tagging service is also freely provided by UNED.

Malaga System for Automatic Language Analysis:

http://www.linguistik.uni-erlangen.de/Malaga.en.html
Malaga is a software package for linguistic applications within the framework of Left-Associative Grammar (LAG). It contains a programming language for the modelling of morphology and syntax grammars. Malaga was developed with the GNU-C-Compiler on HP 9000 Series 700 workstations, but it should be easy to port to any other Unix system with an ANSI-compliant C compiler. There are a number of installations on Intel 80x86 PCs running Linux. Malaga may be used and redistributed under the terms of the GNU Public License.
Malaga 4.3 sources and binaries for HP 9000/700 with HP-UX and PC with Linux can be freely downloaded directly from the site. The package includes a German toy syntax as well as a simple morphological parser for English number words and some grammars for formal languages. Some demos are available online, as well as the documentation in PS/HTML/DVI formats.

MATE Workbench: http://www.ltg.ed.ac.uk/software/mate

The MATE Workbench is a parametrisable XML editor made by the Edimburgh LTG based around using stylesheets to specify the appearance and behaviour of the editor. The Mate Workbench is a program designed to aid in the display, editing and querying of annotated speech corpora. It can also be used for editing/displaying arbitrary sets of hyperlinked XML encoded files. The program was developed as part of the European Union funded research project MATE. The last release, viz. Ver. 0.17 (8 Aug. 2000) can be freely downloaded from the LTG site. Notice that the MATE workbench can only be run with Java 1.2.x . Changes in Java 1.3 means that the sound facilities of the workbench will not work, and the display may behave oddly. [2001 April 26].

MBT (Memory-based Pos Tagger): http://ilk.kub.nl/~zavrel/tagtest.html

MBT is a POS tagger made by Jakub Zavrel and Walter Daelemans (from the ILK group, Catholic University Brabant). It hasd been trained for Dutch, English, Spanish, Swedish, Slovene and German. On the site there is a free demo online (working for all languages) and some downloadable publications about the MBT. For more informatios (price, availability and so on) contact Jakub Zavrel (homepage) and Walter Daelemans (homepage).

Mem: http://wwwhome.cs.utwente.nl/~terdoest/mem/

A Perl implementation of Generalized and Improved Iterative Scaling by Hugo WL ter Doest. (Freely downloadable).

Micro Concord: http://www.liv.ac.uk/~ms2928/

Micro Concord, made by Mike Scott, the same author of Wordsmith, was published in 1993 by OUP. It is a concordancer, operating on IBM PCs running DOS. DOS is faster than Windows but the number of concordance lines is limited to around 1,500, and you can't save a concordance except as a text file. It is very useful for a quick analysis, and may be easier for students to use than. Freely downloadable from this site.

MonoConc: http://www.athel.com/mono.html#monopro

MonoConc by Athelstan is a concordance program for researchers, language teachers and language students (and anyone who works with texts). It is an easy to use Windows software, available at reasonable proce from Athelstan.
+ MonoConc Pro Version 2.0 released March 1, 2000. New Features: Advanced Search: Full Regular Expression search; Tag Search; Meta-tag Definitions; Save Workspace. This professional concordance program is used in commercial and educational settings (the site licence version of the program is installed in several computer labs in universities in the U.S. and abroad). MonoConc Pro is stable and operates well under a variety of configurations. It has the intuitive interface of the simpler concordance program MonoConc 1.5 (see below), yet it offers a variety of options that make it capable of complex and extensive text searches. Available for an educational price of $85 for a single user licence (site licence for 15 computers is $550). For further information and any questions about licensing, send email to info@athel.com. There is a also freely downloadable demo to try the program.
+ MonoConc 1.5. Athelstan best-selling Windows concordance program over the three years since the original (1.0) version was produced in 1996. A good choice for concordance novices and occasional users. Originally $89, now $69 (Educ. price); upgrade path to MonoConc Pro 2.0 costs $45. Also for this older version there is a demo freely available.

MORFEUS Basque Morfological Analizer : http://ixa.si.ehu.es/ingeles/dokument/MORFEUS.html

MORFEUS, developed by the Lengoaia Naturalaren Prozesamendurako IXA Taldea, has a basic task in the automatic processing of Basque. It assigns to each token in a text its lemma as well as all its possible morphological analysis. The rest of the modules will make use of that output so as to accomplish disambiguation and identify lexical units. The output is given in text-format but they are currently working to give it in SGML format. Information on availability is lacking.

Morphy Morphologiesystem: http://www-psycho.uni-paderborn.de/lezius/

Morphy, by Wolfgang Lezius, presents a German morphology and tagger in one package. Morphy runs under 1.1 for Win 95-NT. The Morpholgy comprises 50.000 lemmas for 350.000 forms; the tagger works either with a small tagset (51 tags) or with a large tagset (456 tags), reaching resp. 96% and 85% of correctly tagged words.
Morphy 1.1 is freely downloadable.

MSLR parser (Morphological and Syntactic LR parser):

http://tanaka-www.cs.titech.ac.jp/pub/mslr/index.html (Japanese also)
The Morphological and Syntactic LR parser is a tool for simultaneous analysis of syntax and morphological form. Although it have been developed to analyze Japanese sentences, you can also use it for other languages. Unfortunately only the parser is provide, so you must prepare the grammar and the dictionary to analyze sentences. The MSLR parser, however, is free: you can download the Ver 0.92 plain or with manual (japanese only). The basic package includes MSLR parser and related software with the following characteristics: performs simultaneous analysis of syntax and morphological form; allows input of limitation specifications inside brackets to improve results of analysis; can handle Probabilistic Generalized LR Model. MSLR parser only works on Unix. The lexicon look-up module that MSLR parser uses has been developed at the Matsumoto research laboratory of the Nara Advanced Institute of Science and Technology as SUFARY (high-speed string search system). Contact. [2001 April 29].

MtRecode (the Multext character translation program):

http://www.lpl.univ-aix.fr/projects/multext/MtRecode/
MtRecode by Claude de Loupy (CNRS & Université de Provence, Laboratoire Parole et Langage), is a program for translation between various character sets, developed in the framework of the MULTEXT project. It has some of the functionality of the GNU recode tool, but it is based on different principles and is oriented towards SGML text manipulation. ISO 10646 is used internally as a pivot in the character translation process. When exact translation into a character is not possible, MtRecode can use SGML entities as a fallback. Conversely, MtRecode understands SGML entities in the input and can recode them into characters of the target character sets, if they exist. MtRecode is completely customizable: the user can add new character sets and/or entities by providing tables that map characters and entities to ISO 10646.
MtRecode used to be freely downloadable directly from this page, but "has been disrupted because of technical problems. We regret the inconvenience and hope that the procedure will be restored shortly" – at least this is what they say. Contact the Multext mailbox or Claude de Loupy.

MULTEXT Tools: http://www.lpl.univ-aix.fr/projects/multext/MUL7.html

Multext is developing a series of tools for accessing and manipulating corpora, including corpora encoded in SGML, and for accomplishing a series of corpus annotation tasks, including token and sentence boundary recognition, morphosyntactic tagging, parallel text alignment, and prosody markup. Annotation results may also be generated in SGML format. All Multext tools will ultimately follow the software specifications and data architecture developed within the project. However, the tools are in various stages of development and, in their current state, conform to the Multext specifications to varying degrees. Upon completion, all tools will be publicly available for non-commercial, non-military use; at present, only some tools are publicly available and all of them exist in test versions only. Contact: Jean Veronis. Cf. also the MULTEXT file in the Multilingual Corpora section.

MultiConcord: http://web.bham.ac.uk/johnstf/lingua.htm

Multiconcord is a Windows-based Multilingual Parallel Concordancer for classroom use developed at the University of Birmingham under Lingua project. What is distinctive about the work at Birmingham is that the alignment at sentence level is made 'on the fly' when a concordance is requested: and that while most other work in this area has sought to elaborate the methods proposed by Gale and Church in order to achieve greater accuracy, the Birmingham approach has been to simplify those methods. The other distinctive feature of the Lingua project, in fact, is that its primary focus is practical: our primary aim has not been to invent new methods of test alignment (though that has been an incidental spin-off), but to develop a working program and a methodology for teachers and students to exploit the program in language-learning. Users should be able their add their own pairs of texts to the corpus, using simple and easily-learned mark-up conventions based on SGML.
The program is available from CFL Software Development, price £40. Downloadable parallel texts for Multiconcord without restrictions on distribution are available without extra charge from the Parallel Texts Library. There is also a free downloadable demo, which has all the features of the full version, except that it will work only with the three short texts in English, French and German supplied. [2001 April 23].

mu-TBL: http://www.ling.gu.se/~lager/mutbl.html

The mu-TBL system by Torbjürn Lager represents an attempt to use the search and database capabilities of the Prolog programming language to implement a generalized form of transformation-based learning and it can be used for POS tagging and other things. The &924;-TBL system is designed to be: General (The system supports four types of transformational operators – four types of rules – by means of which not only traditional "Brill-taggers", but also Constraint Grammar disambiguators, are possible to train), Easily extensible (Through its support of a compositional rule/template formalism and "pluggable" algorithms, the system can easily be tailored to different learning tasks) and Efficient (A number of benchmarks have been run which show that the system is fairly efficient – an order of magnitude faster than Brill's contextual-rule learner). You may download papers and software, and there are example applications to experiment with. Freely downloadable; send mail to Torbjorn.Lager@ling.uu.se if you want to be notified of further developments of the software.

MXPOST (Maximum Entropy POS Tagger): http://www.cis.upenn.edu/~adwait/statnlp.html

It is the well-known Maximum Entropy POS Tagger by Adwait Ratnaparkhi (homepage) from the Penn Tools. There is a freely downloadable JAVA version with also the MXTerminator, a sentence boundary detector. On the site there are also a lot of freely downloadable papers from the author dealing with the MXPOST. [2001 April 27].

NAIST Natural Language Tools: http://cactus.aist-nara.ac.jp/staff/matsu/misc/nlt.html (Japanese also)

NAIST-NLT (Nara Intitute of Science and Technology Natural Language Tools) provides a flexible natural language processing environment. The system consists of JUMAN (a morphological analysers for Japanese and English), SAX (a compiler of a DCG to a bottom-up Chart parser), VisIPS (a visual interface for showing the partial results of the parsing process), and supporting programs for implementing natural language grammars. Modularity and extensibility are important features of the tools, and various customization is possible by the users. Although they are meant to be used in an integrated way, each of the compoments can be used as a stand-alone system. Japanese manuals are obtainable as NAIST technical reports (English manuals will be provided shortly). All the tools (5M bytes approximate size of the whole tools) are free with no limitation and are downloadable by ftp site in UNIX file format or EUC for Japanese (Platform Sun OS 4.1.3; implementation languages SICStus Prolog, gcc, X11R5). Contact person: Yuji Matsumoto. [2001 April 28].
+ JUMAN Japanese Morphological Analyzer. The system is implemented in GCC and produces a lattice-like structure of Japanese morphemes given a Japanese sentence. It works as a UNIX filter. Besides, an interface to SICStus Prolog is provided, so that the system is invoked from Prolog and returns a lattice of morphemes back to the Prolog program. The attached dictionary contains about 120,000 entries. The most important feature of the system is that the basic definition of Japanese morphological grammar system, such as the set of part of speech, inflection rules, and connection rules of morphemes. Since a number of Japanese grammars have been proposed, this feature is indispensable.
+ JUMAN English Morphological Analyzer. The English morphological analyzer deals with inflection of English nouns, verbs and adjectives. Since the treatment of the information given by inflection differs in systems, the detailed information is assumed to be written in grammar rules by the user.
+ SAX Concurrent Chart Parser. The SAX parsing transforms a Definite Clause Grammar (DCG) [Pereira 80] into a Prolog program that realizes a bottom-up Chart parser. The system is implemented by a collection of Prolog clauses directly derived from DCG rules. They are called SAX clauses. A set of the SAX clauses with the same predicate name corresponds to a grammatical phrase and defines a concurrent process. Parsing is performed through data communication between those processes. The system is implemented in two levels: The first consists of the transformed grammar rules, and the second works as an interface with other programs as well as the interface to the user. A number of supporting programs are provided to make users easy to implement their own grammars in the system, e.g., interface programs with the morphological analyzers and an visual interface described in the next section, a unification progam for feature structures.
+ VisIPS Visual Interface for Parsing Systems. The VisIPS system is a visual interface to parsing systems that shows partial parse results of the parser in an intuitive way. The SAX parsing system works as a black box for the users in that the user-defined DCG is transformed into a bottom-up concurrent Chart parser. When running the system, directly looking into the Prolog code is quite complicated since the original grammar rules are transformed in a nontrivial way. However, in a development phase of a grammar or a dictionary, it is indispensable to have some way to figure out the system's behaviour. It should be noted that the users are usually not interested in the transformation details of the system. The system, therefore, should inform the behaviour that is related only with the user defined grammar and dictionary. The VisIPS system is originally developed to monitor the behaviour of the SAX processes. It is, however, applicable to any phrase structure based parsing systems. It shows occurrences of phrases in a triangular table. Two versions of the system have been developed: One is a batch system where the information of phrase structures are written out into a file in a predetermined format and VisIPS shows the results after the parsing process terminates. The other is an interactive mode where a newly constructed phrase structure is immediately presented. Both versions are implemented in C and the X11R5 system. Current system uses socket I/O facility of SICStus Prolog for data communication.

Ngram: http://www.jaist.ac.jp/~shigeru/ngram.html (Japanese also)

Ngram takes N-gram statistics for text file. "N-gram statistics" in this context means counting cooccurrence frequency of N adjacent words(characters) in text data. Using N-gram statistics, we will get expression which appear frequently in the text data. Especially, if N=1 then N-gram called unigram, else if N=2 then it called bigram, else if N=3 then it called trigram. Ngram takes N-gram statistics for any N. (in default setting, N is less or = 2048, but you can change N to any value when compile Ngram). Among the advantages of Ngram there is that it takes N-gram statistics for both words and characters; it takes N-gram statistics for any N. (in integer) with no limitation of size of input text file; memory required by Ngram is very small. There are moreover specific advantages using Japanese text. Ngram allows encoding of character code of input text file both EUC-JP and ISO-2022-JP, and can detect encoding of character code of input text file automatically. If both alphabet in ASCII character (what is called "HANKAKU") set and alphabet in jisx0208 character set (what is called "ZENKAKU") are intermixed in input text file, Ngram regards these as same thing. And also if some kind of punctuation marks (some kind of punctuation marks exist in jisx0208) are intermixed in input text file, Ngram regards these as same thing. Documents written in English of source package of Ngram isn't available yet, but the current release version of Ngram, ngram-0.6.1.tar.gz, is freely downloadable. [2001 April 29].

Naive Bayes algorithm: http://www.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html

Software from the Rainbow/Libbow software package that implements several algorithms for text categorization, including naive Bayes, TF.IDF, and probabilistic algorithms. Accompanies Tom Mitchell's ML text (freely downloadable).

NLSR (Natural Language Software Registry): http://www.dfki.de/lt/registry/

The Natural Language Software Registry, an initiative of the ACL host by the DFKI (German Research Center for Artificial Intelligence) at Saarbrücken, offers a rich summary of the sources of language processing software available to researchers. It comprises academic, commercial and proprietary software with detailed specifications and terms on which it can be acquired; but links to software homepages are often missing. The categories available are the followings: Speech Signal Analysis, Morphological Analysis, Syntactic Analysis, Formalism , Semantic and Pragmatica Analysis, Generation, Knowledge Representation Systems, Multicomponent Systems, NLP Tools, Data Sets, Application and Text Processing.

NPtool: http://www.lingsoft.fi/doc/nptool/

NPtool, by Atro Voutilainen, is a fast and accurate system for extracting noun phrases from English texts. It is sold by Lingsoft: for availability (it is a commercial software!) you have to ask to info@lingsoft.fi.

PAPPI: http://www.neci.nj.nec.com/homepages/sandiway/pappi/index.html

PAPPI is a Multilingual Parsing System for the Principles-and-Parameters Framework by Sandiway Fong (NEC Research Institute). It works on Sun Sparcstations with Quintus PROLOG. Sample implementation of GB are already supplied for English, French, Spanish, Japanese, Korean, Dutch, German and some other language. Contact: sandiway@research.nj.nec.com.
+ PAPPI 3.1 release is freely downloadable as tar gzipped file from the PAPPI 3.1 page. Screenshots supplied for English, French, Turkish, Hungarian, Bengali
+ PAPPI 2.0 release is freely downloadable as tar gzipped file from the PAPPI 2.0 page. Screenshots supplied for English, French, Spanish, Japanese, Korean, Dutch, German, Hindi and Chinese.

ParaConc: http://www.ruf.rice.edu/~barlow/parac.html

Michael Barlow's ParaConc is a bilingual/multilingual concordance program (in different formats) designed to be used for contrastive corpus-based language research.
+ The original parallel concordancer (programmed in HyperTalk) runs on a Mac. It's free and can be downloaded as a binhexed on this site. You are asked to send an email message to barlow@ruf.rice.edu when you do this. The program is for individual, research use only and cannot be loaded onto a network without purchasing a site licence agreement.
+ Some sample aligned text files can also be downloaded via ftp (English, Spanish).
+ An html manual is available online.
+ A commercial Windows version is announced for Summer 2000. It will be available on the Athelstan site.
+ A Windows beta version is said to be available, but the link when I checked doesn't seem to work.

Parser Demo online (Ergo): http://www.ergo-ling.com/ (follow the link to Parser Demo).

It is an online demo tagging service for English by Ergo Linguistic Technologies. It is free, but you have to give your name and e-mail. Sentences of the size and complexity of the Wall Street Journal or the New York Times may not work because this demo is using a limited dictionary (approximately 40,000 words).

PC-Kimmo: http://www.sil.org/pckimmo/

PC-Kimmo, a two level processor for morphological analysis by SIL Computing, is a new implementation for microcomputers of a program dubbed Kimmo after its inventor Kimmo Koskenniemi (see Koskenniemi 1983). It is of interest to computational linguists, descriptive linguists, and those developing natural language processing systems. The program is designed to generate (produce) and/or recognize (parse) words using a two-level model of word structure in which a word is represented as a correspondence between its lexical level form and its surface level form. The PC-Kimmo program is actually a shell program that serves as an interactive user interface to the primitive PC-Kimmo functions. These functions are available as a C-language source code library that can be included in a program written by the user.The two functional components of PC-Kimmo are the generator and the recognizer. The generator accepts as input a lexical form, applies the phonological rules, and returns the corresponding surface form. It does not use the lexicon. The recognizer accepts as input a surface form, applies the phonological rules, consults the lexicon, and returns the corresponding lexical form with its gloss. Up until now, implementations of Koskeniemmi's two-level model have been available only on large computers housed at academic or industrial research centers. As an implementation of the two-level model, PC-Kimmo is important because it makes the two-level processor available to individuals using personal computers. Computational linguists can use PC-Kimmo to investigate for themselves the properties of the two-level processor. Theoretical linguists can explore the implications of two-level phonology, while descriptive linguists can use PC-Kimmo as a field tool for developing and testing their phonological and morphological descriptions. Teachers of courses on computational linguistics can use PC-Kimmo to demonstrate the two-level approach to morphological parsing. Finally, because the source code for the PC-Kimmo's generator and recognizer functions is made available, those developing natural language processing language processing applications (such as a syntactic parser) can use PC-Kimmo as a morphological front end to their own programs.
PC-Kimmo is freely available by FTP. For infos cf. this link.

PC-PATR: http://www.sil.org/pcpatr/

PC-PATR is an Unification-based syntactic parser by SIL Computing. It is an implementation of the PATR-II computational linguistic formalism for personal computers. PC-PATR is still under development; however it is already available for MS-DOS, Microsoft Windows, Macintosh, and Unix. (The Microsoft Windows implementation uses the Microsoft C QuickWin function, and the Macintosh implementation uses the MPW C SIOW function.): PC-PATR version 0.99 beta 5 was released on October 21. 1997.The PATR-II formalism can be viewed as a computer language for encoding linguistic information. It does not presuppose any particular theory of syntax. It was originally developed by Stuart M. Shieber at Stanford University in the early 1980's. A PATR-II grammar consists of a set of rules and a lexicon. Each rule consists of a context-free phrase structure rule and a set of feature constraints that is, unifications on the feature structures associated with the constituents of the phrase structure rules. The lexicon provides the items that can replace the terminal symbols of the phrase structure rules, that is, the words of the language together with their relevant features.
+ There is a manual available.

PennTools: http://www.cis.upenn.edu/~adwait/penntools.html

The page collects information on a variety of NLP systems, tools and resources located at Penn (i.e. the University of Pennsylvania), many of which are available externally. Cf. for example Penn Treebank, Middle English Treebank, Old English Treebank, Chinese Treebank, Dan Melamed’s Tools, Maximum Enthropy Tagger, SED Tokenizer, XTag Project and Tools, etc. There are also some smaller tools, such as Graphical Parse Tree Viewers (e.g. Viewtree, a small lisp script) and LaTeX macros for tree writing, and cf. the page on Korean NLP.

Perl: cf. the file in the References & Standards section.
Php: cf. the file in the References & Standards section.
PiSystem: http://www.ilc.pi.cnr.it/pisystem/intro.htm

PiSystem is a complex but well integrated suite of tools by Eugenio Picchi and his "DBTficio" specially designed for linguistic and lexicographic anaysis of italian litterary texts. The core module of this system is the well known DBT, but a lot of other components are already available or under development for more specific works. The DBT system is patented by CNR and is adopted also by the renowned LIZ collection.

PiTagger (cf. PiSystem):

http://www.ilc.pi.cnr.it/pisystem/PiTagger.htm
PiTagger is a tagging and automatic lemmatization procedure made by Eugenio Picchi, and is a part of the PiSystem suite. It relies mainly on three component. (1) A morphology of the Italian language, which assigns all possible lemmata to each form. (2) A statistical approach, which, relying on a previously prepared training corpus, allows disambiguation of the multiple tags. (3) An interactive post-editor, by which it is possible to check interactively the results of the automatic tagging procedure and to manually correct the mistakes. The annotated texts (both by the automatic procedure alone or by the interactive post-editor) are compatible with the DBT corpus manager.

Portuguese tagger (Projecto Natura) on the web:

http://alfa.di.uminho.pt/~mesric/portag.html
(Not free; e-mail or Web trials available).

POSPAR: http://nlp.postech.ac.kr/DownLoad/k_api.html

PosPar is a Korean Syntactic Analyzer using Korean Combinatory Categorial Grammar formalism (including POSTAG) by PosTech Laboratory (cf. also the related paper file in Korean). The N-Best POSPAR99 beta-0.9 demo version (0.01) (binary) (Dec/01/1999) is freely downloadable for non commercial use. There is also an online demo. [2002 February 20].

POSTAG: http://nlp.postech.ac.kr/DownLoad/k_api.html

PosTag is a Korean Morphological Analyzer and POS tagger with generalized unknown morpheme handler by PosTech Laboratory (cf. also the README file in Korean). The AG99 beta-1 demo version (binary) (including 100,000 full vocabulary) (Dec/06/1999) is freely downloadable for non commercial use. There is also an online demo. [2002 February 20].

PosTech NLP Laboratory: http://nlp.postech.ac.kr/

This Korean NLP group (located at San 31, Hyoja-Dong, Pohang, 790-784, Korea) reasearch mainly in Korean TTS using prosody and phonetic analysis & NLP for speech recognition and Korean morphological anlysis and POS tagging Some of the resources and tools developed at PosTech are available as free or open source for research communities (on commercial use), such as:
+ POSTAG Korean Corpus, a POS tagged corpus of about 100,000 morphemes;
+ POSTAG, a Morphological Analyzer / POS tagger with generalized unknown morpheme handler;
+ POSPAR, a Syntactic Analyzer using Korean Combinatory Categorial Grammar formalism
+ POSNIR, a Korean Natural Language Information Retrieval System (Search Engine, compound noun Indexer, NL query processing (including POSTAG and POSPAR) - cf. also the README file in Korean and the online demo.
+ POSTTS, a Korean text to speech system which converts general Korean text sentences into their corresponding phoneme sequences (cf. the README file in Korean). [2002 February 20].

Prototype Java Summarisation applet (System Quirk): http://www.mcs.surrey.ac.uk/SystemQ/

(Freely downloadable).

Python: cfr. in the References & Standards section.
QTag (Birmingham Part of Speech Tagger): http://www-clg.bham.ac.uk/QTAG/

Qtag is an implementation of a probabilistic tagger, based roughly on HMM technology, by the Corpus Research Group - University of Birmingham (Oliver Mason). It is written in Java, which means it should run on any piece of computing equipment for which a Java implementation is available.Qtag is language independent, i.e. it should work easily with languages other than English (for which it was originally designed). All you need to start is a (manually) tagged training corpus in order to generate the language specific resource files needed. If you want to use Qtag in your own applications, please refer to the API Documentation.
+ The latest version is freely available (cf. this address). It is written in pure Java and should run on all platforms.
+ The older version of the tagger implements a client-server model, where a central server dishes out information on tag probabilities of words to tagger processes which are running as clients, connecting to the server over a network. This setup is advantageous if you have several people working with the tagger, as the resources can be shared. If you want to use the tagger at home, download the new version. The older version is freely downloadable from this page, but please read License before.
+ An E-mail tagging service is also freely available at Birmingham.
+ The tagset used is available on the Web. [Rev. 2001 November 27].

QuickTag & Parse: http://www.cogilex.com/products.htm

Quick Tag & Quick Parser are two English Language tools by Cogilex. Both runs under Win, and have an online free version y