(1) - Corpora and Corpus Linguistics.
(2) - Multilingual and Parallel Corpora.
(3) - Electronic Literary Text Archives.
(4) - References, Standards & Educational Resources.
(5) - Tools.
This "Tools" section focuses mainly on corpus-oriented NLP software (esp. taggers, parsers, chunkers, corpus query systems, etc.), but also text analysers (concordancers, etc.) are taken in count, and other applications as well, inasmuch they have some interest also for NLP and corpora maintenance and query. As a rule automatic speech recognition systems, translation tools, e-dictionaries and exotic languages typing facilities have been left aside. The software resources on the Web are really huge and every selection cannot be someway arbitrary; moreover, important links could have slipped me: please e-mail me every addition you wish! Note that link sites are referred here only if mainly concerned with tools; you can find more general reference pages in section 2.4 "References, Standards & Educational Resources".
@nnotate is a tool for the efficient semi-automatic annotation of corpus data. It facilitates the generation of context-free structures and additionally allows crossing edges. Functions for the manipulation of such structures are provided. Terminal nodes, non-terminal nodes, and edges are labelled. It was used for the NEGRA project. @nnotate runs under Solaris and Linux. It needs the GNU C-Compiler, Tcl/Tk, and an installation of mSql.
@nnotate will be freely available for research purposes: contact Thorsten Brants.
The AGFL (Affix Grammar over Finite Lattices) Grammar Work Lab has a a collection of software systems for Natural Language Processing, based on the AGFL-formalism. The AGFL formalism has been developed by the Department of Software Engineering, University of Nijmegen. It is a formalism in which context free grammars can be described by means of two-level grammars: a context free grammar is augmented with a feature level for expressing agreement between syntactic categories. The formalism is suited for specifying both morphological and syntactic analysis. Grammar rules can be extended with transduction parts, which specify the output. The default ouput consists of parse trees. Furthermore, mechanisms are provided to handle open classes of words, which enables the construction of robust parsers.
There is a manual both in online html and in downloadable postscript format. AGFL is distributed under the GNU General Public License; registration is made at this page. The latest stable public release of the AGFL is version 1.9.0. You can obtain it from their FTP site by following the present link. Linux, Solaris and Win 95-8+NT versions available.
A workbench for the development of tagged corpora, including a tagger based on Brill's TBL approach. Basically Alembic Workbench is an SGML-based annotation system. Apart from the usual kinds of textual annotations, the workbench enables various kinds of specialized annotations including co-reference annotations (cf. the Message Understanding Conference markup rules), various kinds of user-defined inter-tag pointers, and (shortly) general template annotation (aka relations, frames, or events). The Alembic multi-lingual NLP system provides access to taggers for a wide variety of extraction levels, and applications have now been built for several languages. The software has a sophisticated visualisation component. It runs on Sun workstations and is freely distributed but license is required.
Align is a C++ freely downloadable package by Adam Berger for aligning, at the sentence level, a pair of text files which are translations of one another. The problem Align was designed to solve is this: you have a pair of text files which are translations of one another. Each file may contain "spurious" (extra) sentences, not appearing in the other file. The translations may also be impressionistic. Relying on dynamic programming and a user-provided routine for calculating the probability of a word-to-word translation between the two languages, Align will (ideally, anyway) weave an optimal sentence-to-sentence alignment between the two files. Align takes as input a pair of ascii files to be aligned. Each file contains one "sentence" per line, the words of which are space-delimited. That is, newlines delimit sentences, and spaces delimit words. I put the word "sentence" in quotes because Align doesn't actually care what syntactic units appear on each line; however, the output of Align will be an alignment between lines of the input files. (If you so desire, you may put paragraphs or just phrases on each line, to align at a coarser or finer level of granularity).
Altman Fitter is an interactive program sponsored by the IQLA running under Windows 95 and Windows NT which provides more than 200 discrete distributions for all areas of empirical research. Among its functions a selection assistant and automatical fitting can be found. Goodness of fit is tested by Chi-square, P(Chi-square), and the contingency coefficient C. Optimisation parameters can be configured. The software comes with a comprehensive documentation (user manual: 15 p, distributions and bibliography: 141 p)-For information or a demo version please contact: RAM-Verlag, Stüttinghauser Ringstr. 44, D-58515 Lüdenscheid - Germany; Fax: +49 2351 973071.
The AMALGAM (Automatic Mapping Among Lexico-Grammatical Annotation Models) project is an attempt to create a set of mapping algorithms to map between the main tagsets and phrase structure grammar schemes used in various research corpora. The AMALGAM Multi-Tagger (actually a retrained version of the Brill tagger) has been developed to tag text with up to 8 annotation schemes. The tagger is intended for English text and it will not work for languages other than English. This is a system that allows you to enter plain text, and have it tagged. You can choose from 8 different tagging schemes. The system can be freely used via email and, "shortly" (but since 20th August 1996 nothing happened ...), by using a web-browser.
For more information see AMALGAM (Reference, Standard etc. section).
AMPLE is a morphological parser for linguistic exploration developed by SIL Computing; it works under Win 3.1-98 + NT, DOS, MAC and Unix. When given the necessary information about the morphology of a language, AMPLE will analyze each word in a text and break it into morphemes. AMPLE is oriented to the "item and arrangement" approach to the description of morphological phenomena. It can handle nonconcatenative phenomena only indirectly. AMPLE can work together with other SIL applications.
The parser by Satoshi Sekine (Department of Computer Science - New York University) is a bottom-up probabilistic chart parser which finds the parse tree with the best score by best-first search algorithm. Its grammar (of English) in the distribution is a semi-context sensitive grammar with two non-terminals and it was automatically extracted from Penn Tree Bank, syntactically tagged corpus made at the University of Pennsylvania. The framework of the algorithm was reported at the International Workshop on Parsing Technologies 1995. That is a fully automatic acquisition of grammar from a syntactically tagged corpus, instead of human labors or statistically aided human labor which have been used in many conventional projects. Although there are some problems with this strategy, such as the availability of such a corpus and domain restrictions, the performance of the grammar is fairly good. The parser generates a syntactic tree just like the Penn TreeBank (PTB) bracketing. Although PTB has argument structure labels, this parser does not produce such labels. Also APP is just trying to make a parse tree as accurate as possible for reasonable sentences. Here reasonable sentences means, for example, sentences in newspapers or well written documents. Hence, it is aiming neither to parse some reasonable ill-formed sentences (like conversation) nor to refuse absolutely ill-formed sentences. You may be surprised that the parser can make a parse tree for a sentence with number dis-agreement or it can't parse correctly a very simple English sentence. But this is a result of how APP is designed. The author knows that the performance is not the best compared with the state of art parsers which have been reported recently. However, the author thinks that the main difference between his parser and these parsers it's the usage of lexical information. And he is planning to incorporate this information into the parser and hopefully to release new versions (the latest one is Ver. 5.9 of April 1997). Don’t be misled by apples: PTB runs as usually under Unix (and sperimentally Win NT) and not Mac! Now also an executable for Windows is available.
The APP is freely downloadable by FTP as a TAR gzipped file.
Lexicons and morphological analysis for Spanish: the ARIES Natural Language Tools make up a lexical platform for the Spanish language. They include: a large Spanish lexicon, lexical maintenance and access tools and morphological analyser/generator. There is a free demo for single words, or you can submit a text by e-mail (following this link) for word morphological analysis and spelling check, but the real lexicons and C/C++ access tools cost money.
A system (for Windows) for automatically learning the morphological forms of words in a corpus by John Goldsmith. (freely downloadable).
Bilingua Language Engineering provides solutions for translation, lexicography, terminology, and language for special purposes. The Bilingua team have been engaged in cross-cultural communication and training for over thirty years and in the development of language software since 1986. They produce the well knowns Duplex system with Term Shuttle and Term Lists.
The Birmingham Corpus Research Group has been working with parts-of-speech tagging since 1994, and there are two main results of their work, both free, the Q-Tagger and the e-mail tagging service. If you have a (reasonably short) English text that you want to have tagged, you can send it by e-mail: this email will be automatically processed by the tagger, and the result will be sent back to you within a very short time. This service is fully automated and the tagger can only cope with plain text. Please do not send your text as an attachment or in a compressed format, as this will not be processed properly. The text should be in the main body of the mail.
+ The tagset used at Birmingham is also available on the Web. [Rev. 2001 November 27].
Brill's Transformation-based learning Tagger is one of the best know freely available taggers. It comes with a lexicon based on the Wall street journal. The whole thing is in Perl and C (the bits which need to be fast are in C).
Freely downloadable by anonimous ftp both program and manual in the /Programs (the software) and /Papers (the documentation) directories. The program only is also directly downloadable by the Web from Eric Brill's Code Page.
+ There is also a DOS compilation of Brill's Tagger with djgpp made by Takahashi (cf. Takahashi Software Plaza). Freely downloadable (there is NOT the original archive: please get the Original archive before you execute the MSDOS executables).
+ There is also an online version mantained by the UNED Grupo de Procesamiento del Linguaje Natural. [Added 2001 April 30].
The Canterbury Corpus is a benchmark to enable researchers to evaluate lossless compression methods; it replaces the Calgary Corpus. This site includes freely downloadable test files and compression test results for many research compression methods. For a full descrition see under the English Section. [2001 April 28].
The CASS Partial Parser made by Steven Abney is a partial parser designed for use with large amounts of noisy text. Robustness and speed are primary design considerations. The package consisting of Cass and utility programs is called SCOL. The most recent bug fixes were made on 20 Jun 97 (version 0.1d). Version 0.1e (24 Sep 97) contains no substantive changes, only some minor modifications to make it compile more smoothly. It has been successfully compiled on a sun4m running SunOS 4.1.3_U1, a sun4u running SunOS 5.5.1, and an i686 running Linux 2.0.24 (architecture from uname -m, OS from uname -sr). It is freely downloadable with his manual. Contact.
Directly downloadable from the gopher above. It is a fully functional demo version of a program for writing and testing categorial grammars.
ChaSen is a free Japanese Morphological analyser by the Computational Linguistics Laboratory, Graduate School of Information Science, Nara Institute of Science and Technology (NAIST). Version 1.0 was officially released on 19 February 1997; the last release is 2.5.1, 2002/1/30. It grew out of developing JUMAN version 2.0 and has made a significant improvement in system performance. This tool was christened with the Japanese name for the tea whisk because it was developed by the NAIST, situated at Takayama (Nara), which is famous for producing a tea whisk used in traditional Japanese tea ceremony. ChaSen can be freely downloaded from the site in UNIX and Linux versions; the dictionary (IPADIC), with which it works, is available also for Windows. [2002 February 17].
The task of a chunker consists in roughly grouping words into rather simple phrases, such as non-recursive NPs, PPs, APs or verbal complexes. Approaches to chunking vary, but have a few features in common: (a) Since the output of a chunker usually serves as input for further processing, a reasonable accuracy is more important than a wide coverage. Thus, only rarely do chunkers go beyond the recognition of base (non-recursive) phrases. (b) Top-down information is not available, so that the chunker has to rely on hints provided by local lexical contexts, i.e., short sequences of words and/or parts of speech. Chunkie is based on a principle similar to that underlying standard POS tagging techniques. It assigns tree fragments to sequences of POS tags. The most likely sequence of tree segments is determined using Viterbi search on the basis of trigram frequencies (in other words, it is a 2-nd order Markov Model). For this kind of tagging, it uses the excellent HMM-based tagger TnT. The chunker is capable of recognising trees of depth <= 3, which means that it can be used for parsing more complex structures than just base NPs. Depending on parametrisation, 8 - 9 chunks out of 10 are assigned the correct internal structure. Training sample: 12,000 sentences from the NEGRA newspaper corpus. Release is scheduled later this year, maybe in October.
Cíbola and Oleada, developed by the Computing Research Laboratory (CRL) at New Mexico State University, are two related systems that provide multilingual text processing technology to language instructors, learners, translators, and analysts. The systems consist of a set of component tools that have been designed with a user-centered methodology. This methodology takes observations made of users in real-work environments as the starting point for interface design. Iterative refinements are made as a result of continued user observation and testing. These systems are devloped using an Unicode text processing cabability represented by the Multilingual Unicode Text Toolkit, or MUTT. The individual components comprise Multilinguial Text, Concordance, Dictionaries, Custom Databases, Translation Memory, Glossaries, Document Management, and Word Frequencies. Learners use OLEADA to: (a) identify relevant texts; (b) view words/expressions in context; (c) discover the frequency of words in texts; (d) segment Chinese or Japanese texts; (e) retrieve parallel English translations; (f) examine real life examples.
+ Cf. also the OLEADA page.
+ Various unsupported versions of these tools are freely available for download after you have signed the CRL Software License Agreement (it's easy: they dont ask you money or embarassing questions!), obtain password and username to login.
(mirroring sites: Antwerp - Belgium, Chokyo - Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. They are based on the CHAT format which makes them easily analyzed by using the CLAN programs. In particular the CLAN concordancer is freely available; cf. also the manual.
UCREL POS tagging software for English text, CLAWS (the Constituent Likelihood Automatic Word-tagging System), has been continuously developed since the early 1980s. The latest version of the tagger, CLAWS4, was used to POS tag c.100 million words of the BNC. CLAWS has consistently achieved 96-97% accuracy (the precise degree of accuracy varying according to the type of text). Judged in terms of major categories, the system has an error-rate of only 1.5%, with c.3.3% ambiguities unresolved, within the BNC. [Rev. 2001 April 28].
+ A site licence for academic institutions for CLAWS4 may be purchased from UCREL for the huge fee of £750. This fee includes introductory assistance and an information pack which contains program details, relevant papers, a reference list and tagset examples. The system requires an UNIX (SPARC) workstation running SunOS4.x, or an UNIX (SPARC) workstation running Solaris with binary compatibility installed.
+ Besides this, there is an in-house tagging service at Lancaster University. Text is sent to UCREL in electronic form (the submit form is on the web page), UCREL marks up POS tags using CLAWS4 tagger and delivers the completed tagged text within an agreed time schedule. The cost of this service is based on actual text itself, any new material, necessary adjustments, the size and any specific individual requirements.
+ Free CLAWS WWW trial service. The trial service offers free access (you are asked only for e-mail) to the latest version of the tagger, CLAWS4, with either C5 or C7 tagsets. You can enter up to 300 words of English running text. If you enter more, it will be cut off after the 300th word.
The Chinese Morphological Analyzer (CMA) from Basis Technology is a portable engine that incorporates comprehensive Chinese dictionaries for segmenting Chinese texts in both Traditional Chinese and Simplified Chinese scripts. KMA can segment Chinese text into words, index and search large collections of Chinese documents (or text fields in databases), generate list of words from free-running Chinese text, and identify parts of speech and word-formation processes. Price is not stated, and to request an evaluation version you have to send them an e-mail. There is however also an online demo, for both Simplified and Traditional Chinese. [2002 February 17].
COAL is a simple format for representing alignements (i.e. parallel segmentations of a pair of texts such that the Ne segment of the first text and the Ne segment of the second text were mutual translations) and mappings (i.e. sets of pair of positions which mark exact correspondences between the two texts) in bi-texts (i.e. pair of texts in two different language that are mutual translations). At RALI (mainly for the BAF Corpus project) have made a number of small tools (Perl and Emacs Lisp) for manage COAL format texts. The full suite is freely downloadable as a single TAR gzipped file. [2001 April 23].
Cogilex, a company of computational linguists and knowledge engineers, offers expert tools and services for natural language processing. Beside providing software (e.g. the QuickTag and QuickParse) they can also act as advisors, designers, developpers or quality-control experts on your NLP-based projects.
Conc is a concordance generator for the Macintosh by SIL Computing. It produces keyword-in-context concordances of words in a text. It can handle both ordinary flat text and multiple-line interlinear text. In the case of interlinear text (produced by IT), it can concord morphemes and also correspondences between two annotation lines. Conc can also do letter concordances to facilitate phonological analysis. Conc permits the user to limit the concordance to just those words that match a specified pattern (GREP expression). Freely available under agreement to the SIL standard free license.
ConcApp is a free concordancing program by the Virtual Language Centre of the Polytechnic University of Hong Kong. Version 1 runs on Win 3.1-95, Version 2 on Win 95, 98 and NT. Both can be freely downloaded with full setup in zip format (v.1 and v.2) or also in reduced packages. The Web Concordancer site presents some applications of ConcApp to a few small indexed corpora (English, French, Chinese, Japanese). [2002 February 17].
The Concordance software by R. J. C. Watt although recently released (1999), accordingly to his author already has registered users in twenty-four countries. The program is proving valuable to anyone who needs to study texts closely or analyse them in depth. It is being used in: (a) Language teaching and learning; (b) Literary scholarship; (c) Translation and language engineering; (d) Corpus linguistics; (e) Natural language software development; (f) Lexicography; (g) Content analysis in many disciplines including accountancy, history, marketing, musicology, politics, geography, and media studies. It can make full concordances to texts of any size, limited only by available disk space and memory, or fast concordances, picking your selection of words from text. It support different Western languages and character sets - not just English - and User-definable alphabet. Built-in file viewer can display files of unlimited size and Built-in editor allows fast editing of files up to 16MB. There is also a File conversion tools - from OEM to ANSI character sets and from Unix to PC files. Save and export concordances as plain text, as a single HTML file, or as a Web Concordance (viz. linked HTML files, ready for publishing on the Web: cf. some sample on the Web Concordances site).
+ The new Version 2.0.0 has been released 18 December 2000; it runs also on Win 2000. Free upgrade are available for registered users.
+ Concordance can be ordered online. It has a reasonable price (89$, 40$ for each additional copy). You must first download the unregistered shareware version (which lasts only 30 days and is fully functional, but adds 'Unregistered' messages to the files you create with it), than order and pay your registration. [Rev. 2001 Nov. 28].
Conexor is a linguistic software company based in Finland. Conexor was founded in 1997 for developing and selling linguistic software for use in applications related to information technology, speech processing, human-machine communication, lexicography, grammar and style improvement, and terminology. At the present, they sell computer programs for linguistic analysis of English texts, e.g. EngCG-2 tagger and the EngCG Parser (both with the Constraint Grammar of English).
You have to jump some frames to go to the Corpus Wizard page, where there are free download links but few informations. The description above is instead taken from the Euralex Computing Tool at this page. Corpus Wizard for Win32-95 + 3.1 and NT by Hamagushi Takahashi is a kind of concordancer, which can produce and sort KWIC concordance. You can extract more selected results from the concordance. You can use regular expressions. English, French and Japanese are officially supported. Corpus Wizard have also some other utilities such as FREQ. Corpus Wizard is posted to WinSite.com, so you can get Corpus Wizard from various ftp sites around the world.
The software is shareware and the fee is low (cf. this page): Corpus Wizard for Win16 $6 (Single User) – Corpus Wizard Plus! $6 (Single User) $30 (Site Licence) – Corpus Wizard for Win32 $35 (Single User) $50 (Site Licence) – KPL Text Processing Utils. $6 (Single User) $30 (Site Licence) – DeHTML for Windows $5 (Single User) $20 (Site Licence) – KPL Printing $10 (Single User) $40 (Site Licence).
The CRCL (Center for Research in Computational Linguistics, Bangkok) pages produced by Doug Cooper for the SEASRC (South East Asian Computing And Linguistics Center) lie at the intersection of computing and linguistics in Southeast Asia. SEASRC publicize and encourage cooperative research activity in and around Thailand, and provide data, tools, and contacts to scholars around the world. There is a lot of valuable and usually free resources (especially for Thai) on this site (cf. the index), spanning from the TIE project (with the TOLL bilingual texts) to fonts and related tools. A great site!
CUE is a development system by the Corpus Research Group - University of Birmingham for applications that require access to corpus data. It provides a high level of abstraction that makes it easy to select corpora, to get concordance lines for a certain query, and also to compute collocational scores. A developer no longer has to deal with files and characters, but instead handles corpora, tokens and corpus positions. CUE is a collection of Java class libraries, which allows for platform independent development. It uses data compression to keep the space requirements of corpora at a minimum, and through the use of an inverted index the speed of retrieval is extremely fast.
+ Version 1.3, the first public release, is now freely downloadable. At present there is only an outline of the documentation, together with the API of the most important classes, but more documentation will be supplied as time permits. In order to access your corpus data through CUE you will need to index the data; documentation of this is available in a separate file.
A set of powerful pre-compiled tools developed by Oliver Christ and Bruno Maximilian Schulze of the IMS to the purpose of querying large text corpora. Tokenized texts can be processed into an internal storage format, single tokens or groups of tokens can be queried by using regular expressions. Results can be viewed, sorted, grouped, and printed. The core components of the suite are the CQP (Corpus Query Processor), the query language interpreter, and the Xkwic sorting utility. The corpus query processor CQP is a command-language based query interpreter, which can be used independently or by Xkwic, which is a X-windows graphical user interface. The CQP system supports a restricted set of annotations and parallel corpora, but not SGML syntax.
The software is available free of charge, after license, for research and educational purposes (for Solaris, Linux, and IRIX machines), cf. the present link. Contact: Arne Fitschen
On Dan Melamed’s page at Penn (i.e. the University of Pennsylvania; cf. the Penn Tools file) there are a lot of miscellaneous tools developed by the author and nearly all freely downloadable by ftp, viz.: a statistical machine Translation Toolkit; a simple Perl Simulated Annealing Program; an XTAG Morpholyzer Post-processors for English stemming; a GB to ASCII Converter for punctuation and numerals; Good-Turing Smoothing Software; 170 general text processing tools (mostly in Perl 5); 75 text statistics tools (mostly in Perl 5); 40 bitext geometry tools (mostly in Perl 5). There are also links to other’s stuff and to a lot of freely downloadable PS papers by Melamed himself. Contact. [Last Rev. 2001 April 27].
DBT is the Textual Data Base component of the PiSystem suite made by Eugenio Picchi at the Pisa ILC. It is the more widespread software of this category in Italy (it is, for example, used by the popular LIZ text collection of Italian Literature).
+ Commercial versions of DBT is licensed by the CNR and can be buyed from LEXIS distribution house, via Acireale 19 - ROMA - Italia; Tel./Fax: +39-6-70302626; E-mail; cf. also the web page.
+ DBT is now (April 2001) available at discount price (480.000 IL) also from Libreria Chiari.
+ There is also a web version with a demo online.
+ For a detailed list of the moduli of this procedure see the following page.
A fully functional demo version of a MAC program for writing and testing definite clause grammars.
DELIS (Descriptive Lexical Specifications and Tools for Corpus-based Lexicon building) aimed to provide tools for efficient lexicographic corpus construction, exploration and selective retrieval of lexicographically relevant material. It provided an easy to use and well documented descriptive scheme for lexical representation, improving consistency over manual and semi-automatic data entry. The tools also support importation and exportation of lexical information. The user community of the DELIS tools includes: (a) Lexicographers in dictionary publishing houses. (b) Glossary builders and terminologists in translation and documentation companies. The objectives of the project focused on lexicon-based syntactically oriented retrieval of corpus evidence from morphosyntactically and syntactically annotated text corpora (Search Condition Generator) and exemplification with a fragment of semantically, syntactically and morphosyntactically described verbs of perception and communication in 5 languages: EN, FR, DK, NL, IT. As a formalism for lexical representation the project has employed a 'Frame Semantics' for lexical semantic description, HPSG-like syntax and Typed Feature Structures (TFS). Corpus tools are programmed in C (and X/MOTIF GUI: Xkwic). The Search Condition Generator provides support for lexicon-driven corpus querying and TFS encoding of lexical descriptions based on a model as a starting point. The tool generates corpus queries in the format of English Constraint Grammar (ENGCG: Helsinki). The project, still in progress, produced prototypes of a data entry facility: TFS-mode for emacs; hierarchy viewer, TFS feature structure viewer (XmFed) and a more widely usable Search Condition Generator (e.g. for BNC). Several hundreds of sentences have been encoded in detail for each language (20+ types of semantic, syntactic and morphosyntactic annotations). A TFS dictionary has been produced with entries for perception verbs of EN, FR, IT, DK, NL, related to the corpus sentences. In addition reports on the methodology with detailed examples are available. Contact: Ulrich Heid (Project Manager) - Universität Stuttgart, IMS-CL - Azenbergstr. 12 - 70174 Stuttgart - Germany -- Tel: +49-711-121-1373 - Fax: +49-711-121-1366 - Email: firstname.lastname@example.org.
This Demonstration page of Morphosyntactic analysis, tagging and parsing of unrestricted text allows you to freely submit some sentences in Spanish, Catalan or English to the full suite of tools developed at the DLSI-UPC (Departament de Llenguatges i Sistemes Informatica - Universitat Politècnica de Catalunya) in collaboration with the CLiC (Centre de Llenguatge i Computació - Universitat de Barcelona). The components of the suite are MACO+ (morphological analyzer corpus-oriented), EWN (Top-ontology semantic analyzer), Relax (POS Tagger), TreeTagger (POS Tagger), TACAT (parser). [2001 April 30; last checked 2001 October 28].
Emdros (Engine for MdF Database Retrieval, Organization, and Storage) by Ulrik Petersen (cf. his homepage) is a theory-neutral Open-Source text database engine for storage and retrieval of analyzed or annotated text. It use the MQL (that is a descendant of another query-language, QL, which was the fruit of Crist-Jan Doedens' labors in his PhD thesis. QL was an extremely powerful query-language to go with the MdF model) for asking relevant questions of the data. Emdros implements the EMdF text database model; the primary advantage of this particular model of text over XML's data model is that object types (such as pages and chapters) need not be hierarchically structured or embedded, but may overlap. In addition, objects (such as a clause or a phrase) need not be contiguous, but may have gaps. Emdros can output its results in XML. The XML carries its own standalone DTD and validates with a validating parser.
Emdros has wide applicability in fields that deal with analyzed or annotated text. Application domains include linguistics, publishing, text processing, and any other fields that deal with annotated text. Emdros is good both for corpus linguistics (large amounts of text) and for field-linguistics (smaller amounts of data). MQL supports both simple and complex queries on the data. Queries on syntactic analyses are particularly well supported, but all other linguistic levels are supported as well. Queries for constructions like subject inversion, embedded relative clauses, elliptic clauses, PPs with pospositions, and DP phrases with pre-head complements can all be easily and intuitively formulated, provided the underlying data has the required categories.
Emdros is licensed under the GNU GPL license, and the documentation as well as the Linux and Win32 binaries can be freely downloaded from this page. [2005 January 5].
EngCG, the Constraint Grammar Parser of English by Pasi Tapanainen (1993), performs morphosyntactic analysis (tagging) of English text. There is an online demo at Lingsoft site. It is sold by Lingsoft: for availability (it is a commercial software!) you have to ask to email@example.com.
EngCG-2, by Pasi Tapainen and Atro Voutilainen, is a program that assigns morphological and part-of-speech tags to words in English text at a speed of about 3,000 words per second on a Pentium running Linux. It is an improved version of the original EngCG tagger, which is based on the Constraint Grammar framework advocated by a team of computational linguists in Helsinki, Finland. The documention (and several articles) are available online.
EngCG-2 tagger with the English Constraint Grammar can be licensed from Conexor. There is also a free demo available. Contacts: Tapanainen and Voutilainen.
Englex is an English parsing description for PC-Kimmo by SIL Computing. Englex is a morphological parsing lexicon of English. It uses the standard orthography for English. It is intended for use with PC-Kimmo (or programs that use the PC-Kimmo parser, such as K-Text and K-Tagger). With such software and Englex, you can morphologically parse English words and text. Practical applications include morphologically preprocessing text for a syntactic parser and producing morphologically tagged text. Englex works under Win 3.1-98 + NT, DOS, MAC and Unix. freely available under agreement to the SIL standard free license.
Ergo Linguistic Technologies (2800 Woodlawn Drive, Suite 175 - Honolulu, HI 96822 - tel:808+539-3920 - fax:808+539-3924) has developed NLP software that can analyze English sentences quickly and thoroughly enough to make it possible to greatly improve grammar checker and translation software, foreign language tutoring software and most important, this technology makes it possible to develop language interactivity that allows interactive dialogs with game characters, software applications and even household appliances.There are demos available which include grammar analysis, PennTreeBank style bracketings, first order predicate calculus, inferencing, and Q&A. Of particular interest are the Parser Demo online and the Parsing Contest pages.
The commercial version of Wordnet, for various European languages (Commercial software).
The EUSLEM automatic lemmatizer/tagger, developed by the Lengoaia Naturalaren Prozesamendurako IXA Taldea, is a basic tool for applications such as automatic indexation, documental databases, syntactic and semantic analysis, analysis of text corpora, etc. Its job is to give the correct lemma of a text-word, as well as its grammatical category. The lemmatizer/tagger is being of great help for the second phase of the EEBS project (Systematic Compilation of Modern Basque). A tagset system has been also developed for Basque: it is a three level system which the user can parametrise when using the programme. In the first level seventeen general categories are included (noun, adjective, verb, etc). In the second one each category tag is further refined by subcategory tags. The last level includes other interesting morphological information (case, number, etc.). Information on availability is lacking. [2001 April 30].
EWN top-ontology semantic analyzer accepts as input morphologically analized text (the output of MACO+) and adds to each lemma the nodes in EWN top-ontology that subsume it. EWN is developed at the DLSI-UPC (Departament de Llenguatges i Sistemes Informatica - Universitat Politècnica de Catalunya) in collaboration with the CLiC (Centre de Llenguatge i Computació - Universitat de Barcelona). Availabilty is unknown: contact Núria Castell i Ariño. [2001 April 30].
+ EWN can be queried freely online (English, Spanish and Catalan) as part of the DLSI-UPC/CLiC-UB Tools Demo.
FreeText is a free MAC concordance program. It should be available from this site (but last time I tried I didn't succeded in logging in ...).
GATE, developed over the last three years at the University of Sheffield, is a domain-specific software architecure and development environment that supports researchers in Natural Language Processing and Computational Linguistics and developers who are producing and delivering Language Engineering systems. To contact the GATE project, mail firstname.lastname@example.org. For accessing the GATE ftp (from this page) you have to register at the following link.
GATTO is a lexicographic tool made by Domenico Iorio-Fili, with the collaboration of Francesco Leoncino, at the OVI. GATTO was created in order to maintain, lemmatize and query the text corpora database of the Vocabolario Storico della Lingua Italiana, now in progress at the OVI. GATTO works under Windows 3.1x, Windows 95, Windows 98 or Windows NT. It is freely available (postal delivery charge only) for scholars: the GATTO CD-ROM contains the program itself, the manual in Word 6.0 format and a small sample corpus. Contact: Domenico Iorio-Fili.
The Communal group (Robin Fawcett, Gordon Tucker, etc.) have developed a sentence generation system called Genesys, An integrated environment for developing Systemic grammars (cf. Systemic-Functional Linguistics). It doesn't generate from semantic input, but rather requires the user to traverse the system network, choosing a feature at each point. Large semantic-oriented network. For more Information e-mail Robin Fawcett or Gordon Tucker.
A rich and detailed link page of text analysis (and some NLP as well) software. Software availability, distribution, authors' home page are cleary stated. There's also some more general link.
A package of programs for literary and linguistic computing is available, emphasizing the preparation of concordances and supporting documents. Both keyword in context and keyword and line generators are provided, as well as exclusion routines, a reverse concordance module, formatting programs, a dictionary maker, and lemmatization facilities. There are also word, character, and digraph frequency counting programs, word length tabulation routines, a cross reference generator, and other related utilities. The programs are written in the C programming language, and implemented on several Version 7 Unix systems at Berkeley. HUM, developed by William Tuthill, is freely available as hum.tar.Z file by anonymous-ftp; there is also a downoadable DOC manual. Contact: William Tuthill - Comparative Literature Department - University of California - Berkeley, California 94720.
ICETree 2 is a dedicated software package written at the Survey of English Usage for developing ICE corpora. ICETree allows researchers to build and manipulate syntactic trees. Using ICETree, you can view, build or manipulate syntactic trees for sentences, phrases or other groups of words. ICETree has been used to build the parse trees for the ICE-GB Corpus. ICETree will be of use to researchers building language corpora. The software was written for the ICE corpora but, with some changes to data files that accompany the program, it can be used on other corpora. ICETree can be used to build trees from scratch but is best used to edit existing data, such as the output from an automatic parser.
The trial version is a full, working copy of the program but it is limited to 10 minute sessions. After each 10 minute session, the program will close itself. The trial version comes with a test corpus - a small collection of trees for you to practise on. Download from http://www.ucl.ac.uk/english-usage/ice-gb/icetree/form.htm
The full version of ICETree is available at a cost of 99 UK pounds from the Survey of English Usage. Please email the Survey to arrange your order.
ICECUP is a state-of-the-art corpus exploration program designed for parsed corpora such as ICE-GB. ICECUP is supplied with ICE-GB and is available now with the ICE-GB Sample Corpus for free download. ICECUP allows you to perform a variety of different queries, including using the parse analysis in the corpus to construct Fuzzy Tree Fragments to search the corpus.
A few gopher free download pages for freeware and shareware MAC and DOS NLP tools.
InfoBlast is a Text Indexing tools that runs under Windows. InfoBlast lets you search all those text files and ebooks on your hard drive blindingly fast! No more scrolling top to bottom and then back up again. Search for it and your are there virtually instantly -- even if your text file is several hundred megabytes in size or more! InfoBlast takes text files and indexes them by building an index where all of the words within the text file are located much like the index for a book. This will allow you to view and conduct searches on the text in the same way the search engines search the internet.
It is a finite-state transducer analysis system for English, French, and Italian that runs under NextStep. Contact: Max Silberztein.
The main aim is a publically available speech recognition system, but along the way there are also toolkits for discrete HMMs and statistical decision trees (freely downloadable).
The Japanese Morphological Analyzer (JMA) from Basis Technology is a portable segmentation engine for Japanese text combined with Japanese dictionaries. It can index and search large collections of Japanese documents (also text fields in databases), generate word lists and verify consistency between kanji and yomi forms. Imput texts must be in Unicode UCS-2 format. Price is not stated, and to request an evaluation version you have to send them an e-mail. There is however also a demo online. [2002 February 17].
You have to jump some frames to go to the Corpus Wizard page, where there are free download links but few informations. The software is by Hamagushi Takahashi and is stored in the Takahashi Sotware Plaza.
KKC is a text find utility to output in Kwic. Available also a version for OS/2. Both version are available also for Japanese as shareware. The European version instead is freeware.
The Korean Morphological Analyzer (KMA) from Basis Technology is a portable linguistic segmentation engine for Korean text. The Korean language presents challenges for morphological analysis, and recognition of word boundaries is often difficult. KMA analyzes and extracts keywords from Korean text based on linguistic characteristics and an optimized dictionary of essential modern Korean words. KMA performs morphological analysis on Korean words (Eojeol), including: segregation of morphemes according to POS, grammar or relational function of each morpheme; examination of likelihood of combination between morphemes; stemming (reducing to root form) of irregular verbs/adverbs/adjectives; presumption of compound nouns; recognition of unknown/unregistered words; support for a user-defined dictionary; support for multiple reference dictionaries; decomposition of compound nouns; generation of a list of words from Korean texts; identification of the root form and part-of-speech (POS) information for each morpheme that constitutes Eojeol; and recognition of patterns for morphological structures of Eojeol. Price is not stated, and to request an evaluation version you have to send them an e-mail. There is however also a demo online. [2002 February 17].
The Korean morphological/lexical analyzer by Seung-Shik Kang (see his homepage in English or in Korean) from Kokmin University is a part of the Hangul Analysis Module (HAM). It has now reached Version 5.0.0a and it works on Win 98, 2000 and NT, Linux or Solaris. Freely downloadable directly from the site. An online demo (korean page only!) is also available. [2002 February 17]
This is the page of the Dept. of Computer Science and Engineering, Korea University (1, 5-ka, Anam-dong, SEOUL, 136-701, KOREA) established in 1991. They have developed various working systems for processing the natural language (morphological analysis, part-of-speech tagging, word sense disambiguation, etc.) and for language-dependent applications (information retrieval, spelling correction, linguistic knowledge acquisition , etc.). Recently, their resarch interests are concentrated on the syntactic analysis and multilingual applications like multilingual information retrieval and machine translation. They have also some online demos of their works, such as, for ex., a Korean morphological analyser, a Korean POS tagger, a Korean-English cross-language information retrieval system, etc. [2001 April 26].
A sentence-generation system developed at Information Sciences Institute (ISI) in Los Angeles. Principal architects Bill Mann and Christian Matthiessen. Development is almost stopped in ISI, but development is continuing in several places, most noticeably, John Bateman's current version called KPML (Komet-Penman Multi-Lingual). KPML is widely used, for instance in FAW (ULM), ITRI (Brighton), University of Waterloo, etc. KPML offers sentence generation from a semantic input (SPLs: cf. Systemic-Functional Linguistics). Graphing of system networks, systemic structures, etc. Can handle multiple grammars simultaneously. Platform: Sun/Unix for KPML. The pre-multi-lingual version is available for Macs, TI, Symbolics also.
+ Download it freely at this address.
K-Tagger is a POS tagger by SIL Computing based on PC-Kimmo. It works under Win 3.1-98 + NT, DOS, MAC and Unix, using the shell of the PC-Kimmo parser. Freely available under agreement to the SIL standard free license.
K-Tagger is a text analyzer by SIL Computing based on PC-Kimmo. KTEXT reads a text from a disk file, parses each word using the PC-Kimmo parser, and writes the results to a new disk file. This new file is in the form of a structured text file where each word of the original text is represented as a database record composed of several fields. Each word record contains a field for the original word, a field for the underlying or lexical form of the word, and a field for the gloss string. K-Text works under Win 3.1-98 + NT, DOS, MAC and Unix. Freely available under agreement to the SIL standard free license.
Mary Shaw's page provide a good introduction to the KWIC (KeyWord In Context) algorithm, which provides the grounds for many concordance tools (e.g. Xkwic, Xconcord, etc.). A concise definition of the Keyword in Context problem is provided from [Parnas72]: "The KWIC index system accepts an ordered set of lines, each line is an ordered set of words, and each word is an ordered set of characters. Any line may be "circularly shifted" by repeatedly removing the first word and appending it at the end of the line. The KWIC index system outputs a listing of all circular shifts of all lines in alphabetical order". Contextual indices have been used for many years. For example, Biblical concordances have approximately this form, except for the rotations. The usual source for the problem as now known, however, is the Parnas definition. In his paper of 1972, Parnas used the problem to contrast different criteria for decomposing a system into modules [Parnas72]. He describes two solutions, one based on functional decomposition with shared access to data representations, and a second based on a decomposition that hides design decisions. The latter was used to promote information hiding, a principle that underpins the use of abstract data types and of object-oriented design. Since its introduction, the problem has become well-known and is widely used as a teaching device in software engineering. Garlan, Kaiser, and Notkin also use the problem to illustrate modularization schemes based on data-driven tool invocation [Garlan92]--sometimes referred to as reactive integration. While KWIC can be implemented as a relatively small system it is not simply of pedagogical interest. Practical instances of it are widely used by computer scientists. For example, the "permuted" [sic] index for the Unix Man pages is essentially such a system.
KWICFinder is a brand new, revolutionary concordancer by William H. Fletcher. KWICFinder rides on the back of a standard search engine, enabling the whole WWW to be used as a text corpus. Pre-release version 5 (February 2002) of KWiCFinder is now freely available for download. It requires Windows 95/98/ME & Internet Explorer 5.0 or greater. [2002 February 17].
A well annotated link page by Steven Bird on tools and formats for creating and managing linguistic annotations of Corpora. The focus is on tools which have been widely used for constructing annotated linguistic databases, and on the formats commonly adopted by such tools and databases. There is also an Italian version by Piero Cosi.
LEXA is a set of programmes which puts at the disposal of the interested linguist the tools he or she would require in order to process linguistically relevant data, most probably from an available corpus, with a high degree of automation on a personal computer. The package is divided into several groups which perform typical functions. Of these the first, lexical analysis, will be of immediate concern. The main programme, Lexa, allows one to tag and lemmatise any text or series of texts with a minimum of effort. All that is required is that the user specify what (possible) words are to be assigned to what lemmas. The rest is taken care of by the programme. Lexa properly is the main programme of the current set. It allows one to automatically lemmatise any input ASCII texts, to create frequency lists of the types and tokens occurring in any loaded text, to generate lexical density tables, to transfer textual data in a user-defined manner to a database environment, to mention just the main procedures which are built into Lexa. The results of all operations are stored as files and can be examined later. Each item of information used by Lexa when manipulating texts is specifiable by means of a setup file which is loaded after calling Lexa and used to initialize the programme in the manner desired by the user. For a description of the many other additional components cf. the page http://nora.hd.uib.no/lexainf.html. Lexa is not free, and vailablity is not documented: you have to contact Raymond Hickey, Universität GH Essen, FB 3 Literatur- und Sprachwissenschaften, FB 3 Anglistik / Linguistik, D - 45177 ESSEN, Germany. Tel. +49 201 183 3441. Fax. +49 201 183 3437. E-mail: email@example.com.
Lingsoft, Inc. is a linguistic software company based in Helsinki, Finland. They have (cf. the catalogue at http://www.lingsoft.fi/en/products.html) proofing tools (for Danish, Finnish, Swedish and Norwegian), Swedish grammatical checker, Finnish CD-ROM dictionary, indexing and retieval tools, the EngCG Parser, the English NPtool, and the Swedish Constraint Grammar system SweCG POS Disambiguator.
There aren't price on their site: you have to ask to firstname.lastname@example.org (it is commercial software!), but there are some free online demos.
The Software page of the Linguist List (Eastern Michigan University - Wayne State University) has an imposing quantity of links to linguistic software on the web, spanning from text analyser, to spelling utilities, teaching tools, electronic dictionary and, yes, also NLP applications.
The Linguist's Shoebox is a integrated data management and analysis for the field linguist by SIL Computing. The Shoebox is a computer program that helps field linguists and anthropologists integrate various kinds of text data: lexical, cultural, grammatical, etc. It has flexible options for selecting, sorting, and displaying data. It is especially useful for helping researchers build a dictionary as they use it to analyze and interlinearize text. The name Shoebox recalls the use of shoe boxes to hold note cards on which the definitions of words were written in the days before researchers could use computers in the field. The Shoebox works under Win 3.1-98 + NT, DOS, MAC. The software and the manual (PDF format) are both freely available: you can download it directly from the home page, or have it by ftp or as a CD-ROM by snail mail.
The Link Grammar Parser, made by Davy Temperley Daniel Sleator John Lafferty, is a syntactic parser of English, based on link grammar, an original theory of English syntax. Given a sentence, the system assigns to it a syntactic structure, which consists of set of labeled links connecting pairs of words. The parser has a dictionary of about 60000 word forms. It has coverage of a wide variety of syntactic constructions, including many rare and idiomatic ones. The parser is robust; it is able to skip over portions of the sentence that it cannot understand, and assign some structure to the rest of the sentence. It is able to handle unknown vocabulary, and make intelligent guesses from context about the syntactic categories of unknown words. It has knowledge of capitalization, numerical expressions, and a variety of punctuation symbols. In a test of 100 financial news sentences, version 3.0 of the parser identified 82% of constituents correctly, and it had an average speed of 2.7 seconds per sentence on a 266 MHz Pentium II. They have made the entire system freely available for download from their ftp. The system is written in generic C code, and runs on any platform with a C compiler. There is an application program interface (API) to make it easy to incorporate the parser into other applications.
+ Version 3.0 of the parser was released in April 1998, and was available by ftp.
+ Version 4.1 was released in August 2000. Among the new features of version 4.0 there is a system which derives a "constituent" representation of a sentence (showing noun phrases, verb phrases, etc.); version 4.1 is essentially similar to version 4.0 but have a few bugs fixed in 4.
+ Win Ver 4.1, a Windows version of version 4.1 of the parser, was released in April 2001.
+ There is also documention of the parser, papers related to Link Grammar, and a small demo online.
Lisp is a high adaptability programming language, often used in NLP. Common Lisp is the Lisp dialect that is standardized with commercial users in mind and CMUCL is an implementation of Common Lisp that focuses on it's superb compiler. The CMUCL system produces overhead-free code (that means, it is as fast or even faster than C) for a large number of computation constructs. Many implementations of advanced programming languages produce overhead-free code for their own preferred operations, like List processing and searching, but CMUCL is capable of doing the same for constructs that are usually not in focus of advanced languages, like computations on floating point arrays and intensive integer data processing (encryption, data compression, parsing of text and network protocols). Compared to other good Common Lisp compilers, the CMUCL compiler needs less declarations to reach the same speed and it offers verbose messages to help the programmer in formulating the required declarations. The CMUCL design also features fast I/O and an interface to C libraries that requires no glue code, speeding up both using and implementing C libraries.
+ cf. also the Cons Org home page, both with some useful link and tutorial.
+ the Common Lisp Hyhpermedia Service is unavoidable reference point for Lisp resources. Release 70.33 is now available on the FTP site for most platforms. This includes significant performance improvements, particularly on the Lisp Machine and Macintosh platforms, many new features, and numerous bug fixes. Major components include a mature HTTP 1.1 server, a programmatic client client, a constraint-guided Web Walker, a proxy server and full-text indexation & retrieval. CL-HTTP has a mature port to Microsoft Windows. Completely free systems are available running under FreeBSD UNIX & Linux on x86 hardware.
The Edinburgh Language Technology Group (LTG) makes available various software packages. For research purposes, these are often available for free to academic research groups and for a small fee to industrial R&D groups (contact). Besides the MATE Workbench, these are the tools offered [2001 April 26]:
+ LT TTT, a text tokenization system and toolset which enables users to produce a swift and individually-tailored tokenisation of text.
+ LT XML, an integrated set of XML tools and a developers tool-kit, including a C-based API.
+ RXP, a validating XML parser written in C, available under the GNU Public Licence.
+ LT NSL, a library of normalised SGML tools and a developers tool-kit, including a C-based API.
+ LT POS, an HMM POS tagger; there is also an interactive demo.
+ LT Pleuk, a grammar development shell.
+ LT Chunk, a surface parser which identifies noun groups and verb groups.
+ LT Thistle, a parameterizable display engine and editor for linguistic diagrams.
+ LT Biblio, a software for processing bibliographical references
+ LT TCR, a text categorization and text retrieval software.
+ XLM Perl, a rule based XML transformation language which uses Perl in the bodies of rules. This requires the LTXML parser and a Perl interpreter.
The MACO+ Morphological Analizer Corpus-Oriented accepts unrestricted text as input. The tool tokenizes the text, and performs and produces as output all morphological interpretetions possible for each token. It is able to recognize and deal with numbers, proper nouns, punctuation, dates, abbreviations, multiwords, etc. Spanish, Catalan and English versions available. MACO+ is developed at the DLSI-UPC (Departament de Llenguatges i Sistemes Informatica - Universitat Politècnica de Catalunya) in collaboration with the CLiC (Centre de Llenguatge i Computació - Universitat de Barcelona). Availabilty is unknown: contact Núria Castell i Ariño. [2001 April 30].
+ MACO+ can be queried freely online (English, Spanish and Catalan) as part of the DLSI-UPC/CLiC-UB Tools Demo.
+ A MACO+ only online tagging service is freely provided by UNED.
+ A Maco+ & Relax online tagging service is also freely provided by UNED.
Malaga is a software package for linguistic applications within the framework of Left-Associative Grammar (LAG). It contains a programming language for the modelling of morphology and syntax grammars. Malaga was developed with the GNU-C-Compiler on HP 9000 Series 700 workstations, but it should be easy to port to any other Unix system with an ANSI-compliant C compiler. There are a number of installations on Intel 80x86 PCs running Linux. Malaga may be used and redistributed under the terms of the GNU Public License.
Malaga 4.3 sources and binaries for HP 9000/700 with HP-UX and PC with Linux can be freely downloaded directly from the site. The package includes a German toy syntax as well as a simple morphological parser for English number words and some grammars for formal languages. Some demos are available online, as well as the documentation in PS/HTML/DVI formats.
The MATE Workbench is a parametrisable XML editor made by the Edimburgh LTG based around using stylesheets to specify the appearance and behaviour of the editor. The Mate Workbench is a program designed to aid in the display, editing and querying of annotated speech corpora. It can also be used for editing/displaying arbitrary sets of hyperlinked XML encoded files. The program was developed as part of the European Union funded research project MATE. The last release, viz. Ver. 0.17 (8 Aug. 2000) can be freely downloaded from the LTG site. Notice that the MATE workbench can only be run with Java 1.2.x . Changes in Java 1.3 means that the sound facilities of the workbench will not work, and the display may behave oddly. [2001 April 26].
MBT is a POS tagger made by Jakub Zavrel and Walter Daelemans (from the ILK group, Catholic University Brabant). It hasd been trained for Dutch, English, Spanish, Swedish, Slovene and German. On the site there is a free demo online (working for all languages) and some downloadable publications about the MBT. For more informatios (price, availability and so on) contact Jakub Zavrel (homepage) and Walter Daelemans (homepage).
A Perl implementation of Generalized and Improved Iterative Scaling by Hugo WL ter Doest. (Freely downloadable).
Micro Concord, made by Mike Scott, the same author of Wordsmith, was published in 1993 by OUP. It is a concordancer, operating on IBM PCs running DOS. DOS is faster than Windows but the number of concordance lines is limited to around 1,500, and you can't save a concordance except as a text file. It is very useful for a quick analysis, and may be easier for students to use than. Freely downloadable from this site.
MonoConc by Athelstan is a concordance program for researchers, language teachers and language students (and anyone who works with texts). It is an easy to use Windows software, available at reasonable proce from Athelstan.
+ MonoConc Pro Version 2.0 released March 1, 2000. New Features: Advanced Search: Full Regular Expression search; Tag Search; Meta-tag Definitions; Save Workspace. This professional concordance program is used in commercial and educational settings (the site licence version of the program is installed in several computer labs in universities in the U.S. and abroad). MonoConc Pro is stable and operates well under a variety of configurations. It has the intuitive interface of the simpler concordance program MonoConc 1.5 (see below), yet it offers a variety of options that make it capable of complex and extensive text searches. Available for an educational price of $85 for a single user licence (site licence for 15 computers is $550). For further information and any questions about licensing, send email to email@example.com. There is a also freely downloadable demo to try the program.
+ MonoConc 1.5. Athelstan best-selling Windows concordance program over the three years since the original (1.0) version was produced in 1996. A good choice for concordance novices and occasional users. Originally $89, now $69 (Educ. price); upgrade path to MonoConc Pro 2.0 costs $45. Also for this older version there is a demo freely available.
MORFEUS, developed by the Lengoaia Naturalaren Prozesamendurako IXA Taldea, has a basic task in the automatic processing of Basque. It assigns to each token in a text its lemma as well as all its possible morphological analysis. The rest of the modules will make use of that output so as to accomplish disambiguation and identify lexical units. The output is given in text-format but they are currently working to give it in SGML format. Information on availability is lacking.
Morphy, by Wolfgang Lezius, presents a German morphology and tagger in one package. Morphy runs under 1.1 for Win 95-NT. The Morpholgy comprises 50.000 lemmas for 350.000 forms; the tagger works either with a small tagset (51 tags) or with a large tagset (456 tags), reaching resp. 96% and 85% of correctly tagged words.
Morphy 1.1 is freely downloadable.
http://tanaka-www.cs.titech.ac.jp/pub/mslr/index.html (Japanese also)
The Morphological and Syntactic LR parser is a tool for simultaneous analysis of syntax and morphological form. Although it have been developed to analyze Japanese sentences, you can also use it for other languages. Unfortunately only the parser is provide, so you must prepare the grammar and the dictionary to analyze sentences. The MSLR parser, however, is free: you can download the Ver 0.92 plain or with manual (japanese only). The basic package includes MSLR parser and related software with the following characteristics: performs simultaneous analysis of syntax and morphological form; allows input of limitation specifications inside brackets to improve results of analysis; can handle Probabilistic Generalized LR Model. MSLR parser only works on Unix. The lexicon look-up module that MSLR parser uses has been developed at the Matsumoto research laboratory of the Nara Advanced Institute of Science and Technology as SUFARY (high-speed string search system). Contact. [2001 April 29].
MtRecode by Claude de Loupy (CNRS & Université de Provence, Laboratoire Parole et Langage), is a program for translation between various character sets, developed in the framework of the MULTEXT project. It has some of the functionality of the GNU recode tool, but it is based on different principles and is oriented towards SGML text manipulation. ISO 10646 is used internally as a pivot in the character translation process. When exact translation into a character is not possible, MtRecode can use SGML entities as a fallback. Conversely, MtRecode understands SGML entities in the input and can recode them into characters of the target character sets, if they exist. MtRecode is completely customizable: the user can add new character sets and/or entities by providing tables that map characters and entities to ISO 10646.
MtRecode used to be freely downloadable directly from this page, but "has been disrupted because of technical problems. We regret the inconvenience and hope that the procedure will be restored shortly" – at least this is what they say. Contact the Multext mailbox or Claude de Loupy.
Multext is developing a series of tools for accessing and manipulating corpora, including corpora encoded in SGML, and for accomplishing a series of corpus annotation tasks, including token and sentence boundary recognition, morphosyntactic tagging, parallel text alignment, and prosody markup. Annotation results may also be generated in SGML format. All Multext tools will ultimately follow the software specifications and data architecture developed within the project. However, the tools are in various stages of development and, in their current state, conform to the Multext specifications to varying degrees. Upon completion, all tools will be publicly available for non-commercial, non-military use; at present, only some tools are publicly available and all of them exist in test versions only. Contact: Jean Veronis. Cf. also the MULTEXT file in the Multilingual Corpora section.
Multiconcord is a Windows-based Multilingual Parallel Concordancer for classroom use developed at the University of Birmingham under Lingua project. What is distinctive about the work at Birmingham is that the alignment at sentence level is made 'on the fly' when a concordance is requested: and that while most other work in this area has sought to elaborate the methods proposed by Gale and Church in order to achieve greater accuracy, the Birmingham approach has been to simplify those methods. The other distinctive feature of the Lingua project, in fact, is that its primary focus is practical: our primary aim has not been to invent new methods of test alignment (though that has been an incidental spin-off), but to develop a working program and a methodology for teachers and students to exploit the program in language-learning. Users should be able their add their own pairs of texts to the corpus, using simple and easily-learned mark-up conventions based on SGML.
The program is available from CFL Software Development, price £40. Downloadable parallel texts for Multiconcord without restrictions on distribution are available without extra charge from the Parallel Texts Library. There is also a free downloadable demo, which has all the features of the full version, except that it will work only with the three short texts in English, French and German supplied. [2001 April 23].
The mu-TBL system by Torbjürn Lager represents an attempt to use the search and database capabilities of the Prolog programming language to implement a generalized form of transformation-based learning and it can be used for POS tagging and other things. The &924;-TBL system is designed to be: General (The system supports four types of transformational operators – four types of rules – by means of which not only traditional "Brill-taggers", but also Constraint Grammar disambiguators, are possible to train), Easily extensible (Through its support of a compositional rule/template formalism and "pluggable" algorithms, the system can easily be tailored to different learning tasks) and Efficient (A number of benchmarks have been run which show that the system is fairly efficient – an order of magnitude faster than Brill's contextual-rule learner). You may download papers and software, and there are example applications to experiment with. Freely downloadable; send mail to Torbjorn.Lager@ling.uu.se if you want to be notified of further developments of the software.
It is the well-known Maximum Entropy POS Tagger by Adwait Ratnaparkhi (homepage) from the Penn Tools. There is a freely downloadable JAVA version with also the MXTerminator, a sentence boundary detector. On the site there are also a lot of freely downloadable papers from the author dealing with the MXPOST. [2001 April 27].
NAIST-NLT (Nara Intitute of Science and Technology Natural Language Tools) provides a flexible natural language processing environment. The system consists of JUMAN (a morphological analysers for Japanese and English), SAX (a compiler of a DCG to a bottom-up Chart parser), VisIPS (a visual interface for showing the partial results of the parsing process), and supporting programs for implementing natural language grammars. Modularity and extensibility are important features of the tools, and various customization is possible by the users. Although they are meant to be used in an integrated way, each of the compoments can be used as a stand-alone system. Japanese manuals are obtainable as NAIST technical reports (English manuals will be provided shortly). All the tools (5M bytes approximate size of the whole tools) are free with no limitation and are downloadable by ftp site in UNIX file format or EUC for Japanese (Platform Sun OS 4.1.3; implementation languages SICStus Prolog, gcc, X11R5). Contact person: Yuji Matsumoto. [2001 April 28].
+ JUMAN Japanese Morphological Analyzer. The system is implemented in GCC and produces a lattice-like structure of Japanese morphemes given a Japanese sentence. It works as a UNIX filter. Besides, an interface to SICStus Prolog is provided, so that the system is invoked from Prolog and returns a lattice of morphemes back to the Prolog program. The attached dictionary contains about 120,000 entries. The most important feature of the system is that the basic definition of Japanese morphological grammar system, such as the set of part of speech, inflection rules, and connection rules of morphemes. Since a number of Japanese grammars have been proposed, this feature is indispensable.
+ JUMAN English Morphological Analyzer. The English morphological analyzer deals with inflection of English nouns, verbs and adjectives. Since the treatment of the information given by inflection differs in systems, the detailed information is assumed to be written in grammar rules by the user.
+ SAX Concurrent Chart Parser. The SAX parsing transforms a Definite Clause Grammar (DCG) [Pereira 80] into a Prolog program that realizes a bottom-up Chart parser. The system is implemented by a collection of Prolog clauses directly derived from DCG rules. They are called SAX clauses. A set of the SAX clauses with the same predicate name corresponds to a grammatical phrase and defines a concurrent process. Parsing is performed through data communication between those processes. The system is implemented in two levels: The first consists of the transformed grammar rules, and the second works as an interface with other programs as well as the interface to the user. A number of supporting programs are provided to make users easy to implement their own grammars in the system, e.g., interface programs with the morphological analyzers and an visual interface described in the next section, a unification progam for feature structures.
+ VisIPS Visual Interface for Parsing Systems. The VisIPS system is a visual interface to parsing systems that shows partial parse results of the parser in an intuitive way. The SAX parsing system works as a black box for the users in that the user-defined DCG is transformed into a bottom-up concurrent Chart parser. When running the system, directly looking into the Prolog code is quite complicated since the original grammar rules are transformed in a nontrivial way. However, in a development phase of a grammar or a dictionary, it is indispensable to have some way to figure out the system's behaviour. It should be noted that the users are usually not interested in the transformation details of the system. The system, therefore, should inform the behaviour that is related only with the user defined grammar and dictionary. The VisIPS system is originally developed to monitor the behaviour of the SAX processes. It is, however, applicable to any phrase structure based parsing systems. It shows occurrences of phrases in a triangular table. Two versions of the system have been developed: One is a batch system where the information of phrase structures are written out into a file in a predetermined format and VisIPS shows the results after the parsing process terminates. The other is an interactive mode where a newly constructed phrase structure is immediately presented. Both versions are implemented in C and the X11R5 system. Current system uses socket I/O facility of SICStus Prolog for data communication.
Ngram takes N-gram statistics for text file. "N-gram statistics" in this context means counting cooccurrence frequency of N adjacent words(characters) in text data. Using N-gram statistics, we will get expression which appear frequently in the text data. Especially, if N=1 then N-gram called unigram, else if N=2 then it called bigram, else if N=3 then it called trigram. Ngram takes N-gram statistics for any N. (in default setting, N is less or = 2048, but you can change N to any value when compile Ngram). Among the advantages of Ngram there is that it takes N-gram statistics for both words and characters; it takes N-gram statistics for any N. (in integer) with no limitation of size of input text file; memory required by Ngram is very small. There are moreover specific advantages using Japanese text. Ngram allows encoding of character code of input text file both EUC-JP and ISO-2022-JP, and can detect encoding of character code of input text file automatically. If both alphabet in ASCII character (what is called "HANKAKU") set and alphabet in jisx0208 character set (what is called "ZENKAKU") are intermixed in input text file, Ngram regards these as same thing. And also if some kind of punctuation marks (some kind of punctuation marks exist in jisx0208) are intermixed in input text file, Ngram regards these as same thing. Documents written in English of source package of Ngram isn't available yet, but the current release version of Ngram, ngram-0.6.1.tar.gz, is freely downloadable. [2001 April 29].
Software from the Rainbow/Libbow software package that implements several algorithms for text categorization, including naive Bayes, TF.IDF, and probabilistic algorithms. Accompanies Tom Mitchell's ML text (freely downloadable).
The Natural Language Software Registry, an initiative of the ACL host by the DFKI (German Research Center for Artificial Intelligence) at Saarbrücken, offers a rich summary of the sources of language processing software available to researchers. It comprises academic, commercial and proprietary software with detailed specifications and terms on which it can be acquired; but links to software homepages are often missing. The categories available are the followings: Speech Signal Analysis, Morphological Analysis, Syntactic Analysis, Formalism , Semantic and Pragmatica Analysis, Generation, Knowledge Representation Systems, Multicomponent Systems, NLP Tools, Data Sets, Application and Text Processing.
NPtool, by Atro Voutilainen, is a fast and accurate system for extracting noun phrases from English texts. It is sold by Lingsoft: for availability (it is a commercial software!) you have to ask to firstname.lastname@example.org.
PAPPI is a Multilingual Parsing System for the Principles-and-Parameters Framework by Sandiway Fong (NEC Research Institute). It works on Sun Sparcstations with Quintus PROLOG. Sample implementation of GB are already supplied for English, French, Spanish, Japanese, Korean, Dutch, German and some other language. Contact: email@example.com.
+ PAPPI 3.1 release is freely downloadable as tar gzipped file from the PAPPI 3.1 page. Screenshots supplied for English, French, Turkish, Hungarian, Bengali
+ PAPPI 2.0 release is freely downloadable as tar gzipped file from the PAPPI 2.0 page. Screenshots supplied for English, French, Spanish, Japanese, Korean, Dutch, German, Hindi and Chinese.
Michael Barlow's ParaConc is a bilingual/multilingual concordance program (in different formats) designed to be used for contrastive corpus-based language research.
+ The original parallel concordancer (programmed in HyperTalk) runs on a Mac. It's free and can be downloaded as a binhexed on this site. You are asked to send an email message to firstname.lastname@example.org when you do this. The program is for individual, research use only and cannot be loaded onto a network without purchasing a site licence agreement.
+ Some sample aligned text files can also be downloaded via ftp (English, Spanish).
+ An html manual is available online.
+ A commercial Windows version is announced for Summer 2000. It will be available on the Athelstan site.
+ A Windows beta version is said to be available, but the link when I checked doesn't seem to work.
It is an online demo tagging service for English by Ergo Linguistic Technologies. It is free, but you have to give your name and e-mail. Sentences of the size and complexity of the Wall Street Journal or the New York Times may not work because this demo is using a limited dictionary (approximately 40,000 words).
PC-Kimmo, a two level processor for morphological analysis by SIL Computing, is a new implementation for microcomputers of a program dubbed Kimmo after its inventor Kimmo Koskenniemi (see Koskenniemi 1983). It is of interest to computational linguists, descriptive linguists, and those developing natural language processing systems. The program is designed to generate (produce) and/or recognize (parse) words using a two-level model of word structure in which a word is represented as a correspondence between its lexical level form and its surface level form. The PC-Kimmo program is actually a shell program that serves as an interactive user interface to the primitive PC-Kimmo functions. These functions are available as a C-language source code library that can be included in a program written by the user.The two functional components of PC-Kimmo are the generator and the recognizer. The generator accepts as input a lexical form, applies the phonological rules, and returns the corresponding surface form. It does not use the lexicon. The recognizer accepts as input a surface form, applies the phonological rules, consults the lexicon, and returns the corresponding lexical form with its gloss. Up until now, implementations of Koskeniemmi's two-level model have been available only on large computers housed at academic or industrial research centers. As an implementation of the two-level model, PC-Kimmo is important because it makes the two-level processor available to individuals using personal computers. Computational linguists can use PC-Kimmo to investigate for themselves the properties of the two-level processor. Theoretical linguists can explore the implications of two-level phonology, while descriptive linguists can use PC-Kimmo as a field tool for developing and testing their phonological and morphological descriptions. Teachers of courses on computational linguistics can use PC-Kimmo to demonstrate the two-level approach to morphological parsing. Finally, because the source code for the PC-Kimmo's generator and recognizer functions is made available, those developing natural language processing language processing applications (such as a syntactic parser) can use PC-Kimmo as a morphological front end to their own programs.
PC-Kimmo is freely available by FTP. For infos cf. this link.
PC-PATR is an Unification-based syntactic parser by SIL Computing. It is an implementation of the PATR-II computational linguistic formalism for personal computers. PC-PATR is still under development; however it is already available for MS-DOS, Microsoft Windows, Macintosh, and Unix. (The Microsoft Windows implementation uses the Microsoft C QuickWin function, and the Macintosh implementation uses the MPW C SIOW function.): PC-PATR version 0.99 beta 5 was released on October 21. 1997.The PATR-II formalism can be viewed as a computer language for encoding linguistic information. It does not presuppose any particular theory of syntax. It was originally developed by Stuart M. Shieber at Stanford University in the early 1980's. A PATR-II grammar consists of a set of rules and a lexicon. Each rule consists of a context-free phrase structure rule and a set of feature constraints that is, unifications on the feature structures associated with the constituents of the phrase structure rules. The lexicon provides the items that can replace the terminal symbols of the phrase structure rules, that is, the words of the language together with their relevant features.
+ There is a manual available.
The page collects information on a variety of NLP systems, tools and resources located at Penn (i.e. the University of Pennsylvania), many of which are available externally. Cf. for example Penn Treebank, Middle English Treebank, Old English Treebank, Chinese Treebank, Dan Melamed’s Tools, Maximum Enthropy Tagger, SED Tokenizer, XTag Project and Tools, etc. There are also some smaller tools, such as Graphical Parse Tree Viewers (e.g. Viewtree, a small lisp script) and LaTeX macros for tree writing, and cf. the page on Korean NLP.
PiSystem is a complex but well integrated suite of tools by Eugenio Picchi and his "DBTficio" specially designed for linguistic and lexicographic anaysis of italian litterary texts. The core module of this system is the well known DBT, but a lot of other components are already available or under development for more specific works. The DBT system is patented by CNR and is adopted also by the renowned LIZ collection.
PiTagger is a tagging and automatic lemmatization procedure made by Eugenio Picchi, and is a part of the PiSystem suite. It relies mainly on three component. (1) A morphology of the Italian language, which assigns all possible lemmata to each form. (2) A statistical approach, which, relying on a previously prepared training corpus, allows disambiguation of the multiple tags. (3) An interactive post-editor, by which it is possible to check interactively the results of the automatic tagging procedure and to manually correct the mistakes. The annotated texts (both by the automatic procedure alone or by the interactive post-editor) are compatible with the DBT corpus manager.
(Not free; e-mail or Web trials available).
PosPar is a Korean Syntactic Analyzer using Korean Combinatory Categorial Grammar formalism (including POSTAG) by PosTech Laboratory (cf. also the related paper file in Korean). The N-Best POSPAR99 beta-0.9 demo version (0.01) (binary) (Dec/01/1999) is freely downloadable for non commercial use. There is also an online demo. [2002 February 20].
PosTag is a Korean Morphological Analyzer and POS tagger with generalized unknown morpheme handler by PosTech Laboratory (cf. also the README file in Korean). The AG99 beta-1 demo version (binary) (including 100,000 full vocabulary) (Dec/06/1999) is freely downloadable for non commercial use. There is also an online demo. [2002 February 20].
This Korean NLP group (located at San 31, Hyoja-Dong, Pohang, 790-784, Korea) reasearch mainly in Korean TTS using prosody and phonetic analysis & NLP for speech recognition and Korean morphological anlysis and POS tagging Some of the resources and tools developed at PosTech are available as free or open source for research communities (on commercial use), such as:
+ POSTAG Korean Corpus, a POS tagged corpus of about 100,000 morphemes;
+ POSTAG, a Morphological Analyzer / POS tagger with generalized unknown morpheme handler;
+ POSPAR, a Syntactic Analyzer using Korean Combinatory Categorial Grammar formalism
+ POSNIR, a Korean Natural Language Information Retrieval System (Search Engine, compound noun Indexer, NL query processing (including POSTAG and POSPAR) - cf. also the README file in Korean and the online demo.
+ POSTTS, a Korean text to speech system which converts general Korean text sentences into their corresponding phoneme sequences (cf. the README file in Korean). [2002 February 20].
Qtag is an implementation of a probabilistic tagger, based roughly on HMM technology, by the Corpus Research Group - University of Birmingham (Oliver Mason). It is written in Java, which means it should run on any piece of computing equipment for which a Java implementation is available.Qtag is language independent, i.e. it should work easily with languages other than English (for which it was originally designed). All you need to start is a (manually) tagged training corpus in order to generate the language specific resource files needed. If you want to use Qtag in your own applications, please refer to the API Documentation.
+ The latest version is freely available (cf. this address). It is written in pure Java and should run on all platforms.
+ The older version of the tagger implements a client-server model, where a central server dishes out information on tag probabilities of words to tagger processes which are running as clients, connecting to the server over a network. This setup is advantageous if you have several people working with the tagger, as the resources can be shared. If you want to use the tagger at home, download the new version. The older version is freely downloadable from this page, but please read License before.
+ An E-mail tagging service is also freely available at Birmingham.
+ The tagset used is available on the Web. [Rev. 2001 November 27].
Quick Tag & Quick Parser are two English Language tools by Cogilex. Both runs under Win, and have an online free version you can try.
+ QuickTag is a COM component for Win32 that can efficiently tag (identify the possible grammatical categories of words), lemmatize (identify the root form of words), disambiguate (indicate the actual grammatical category of words) and extract noun phrases from English text. There is a demo version you can download. The basic price (cf. on this page) for QuickTag 2.0 is US$1500 and the royalty-free perpetual distribution license is US$25000. You can purchase it (if you are a very rich man) directly from GetSoftware.com.
+ QuickParse is used internally by Cogilex to efficiently parse english text (i.e. identify grammatical role of words and grammatical links between words) and extract semantic content of different types. Cogilex can customize a version of QuickParse for your specific needs or can develop a complete solution for you. An indepedent general version of QuickParse as a COM component for Win32 is in development and will be offered soon.
Qwick is a corpus browser by the Corpus Research Group - University of Birmingham that allows you to build up your own working corpus, retrieve concordance lines using a simple but powerful query language, and to compute collocation statistics using a variety of adjustable parameters. It is implemented in Java, and it thus platform independent. It has been tested on Windows and Solaris, and (according to Lou Burnard from OUCS) also runs on the Apple Macintosh. Qwick can handle markup in XML format.
+ freely downloadable now is the version that is also distributed on the BNC sampler, but without the data. In order to use it on your own data you will have to index the data first; a description of the indexing procedure is available. Documentation is included with the release, but if you want to have a look before downloading it, there is also an online version.
The Relax POS tagger takes as input the output of the morphological analizer MACO+, and selects the right POS and lemma for each word in the given context. Currently, it produces an output with over 97% precision. The language model may be easily improved with the addition on new context constraints expressed in CG formalism, either hand-written or statistically acquired. Spanish, Catalan and English versions available. Relax is developed at the DLSI-UPC (Departament de Llenguatges i Sistemes Informatica - Universitat Politècnica de Catalunya) in collaboration with the CLiC (Centre de Llenguatge i Computació - Universitat de Barcelona). Availabilty is unknown: contact Núria Castell i Ariño. [2001 April 30].
+ MACO+ can be queried freely online (English, Spanish and Catalan) as part of the DLSI-UPC/CLiC-UB Tools Demo.
+ A Maco+ & Relax online tagging service is also freely provided by UNED. [2001 April 30].
This is a tool, programmed by Mick O'Donnel, that can help a great deal in constructing diagrams of RST analyses (cf. RST (Rhetorical Structure Theory) Web Site). An earlier version of it was used to produce the diagrams on this website. Recently it has been extensively improved and debugged. In the last year over 100 changes have been made, and it is now more ready for widespread use: you can freely download it at this address.
SAM, made by the CRIBeCu (Centro di Ricerca per i BEni CUlturali), is the graphical interface of a library designed to implement a data structure known as suffix array. Suffix array allows for the full text indexing of a collection of texts stored on disk. Sam has been designed exclusively for experimental use even though the implemented library has been successfully exploited for the creation of TreSy freely downloadable from this site.
The Simple Concordance Program (SCP) by Alan Reed is a free program thatv lets you create word lists and search natural language text files for words, phrases, and patterns. SCP is a concordance and word listing program that is able to read texts written in many languages.There are built-in alphabets for English, French, German, Greek, Russian, etc. SCP contains an alphabet editor which you can use to create alphabets for any other language. SCP program uses XML syntax when writing the project files. Although they have and need the extension .scp to be picked up by the SCP program, they can, if given an .xml extension instead, be viewed with XML processing software (this capability is now available in Internet Explorer last releases; you can however save concordances in HTML to display on the web). [2002 February 17].
It is a simple SED script, from the Penn Tools, that does a decent enough job on most English corpora, once the corpus has been formatted into one-sentence-per-line. It is tailored to suite the Penn Treebank and the Maximum Enthropy Tagger. It is freely downloadable directly from this link. [2001 April 27].
SHAXICON by Donald Foster is a lexical database that indexes all of the words that appear in the canonical plays 12 times or less, including a line-citation and speaking character for each occurrence of each word. (These are called "rare words", though they are not rare in any absolute sense - "family [n.]" and "real [ad.]" are rare words in Shakespeare). All rare-word variants are indexed as well, including the entire "bad" quartos of H5, 2H6, 3H6, Ham, Shr, and Wiv; also the nondramatic works, canonical and otherwise (Ven, Luc, PP, PhT, Son, LC, FE, the Will, "Shall I die" et. al.); the additions to Mucedorus and The Spanish Tragedy, the Prologue to Merry Devil of Edmonton, all of Edward III and Sir Thomas More (hands S and D); Ben Jonson's Every Man in His Humour (both Q1 and F1) and Sejanus (F1); and more; but these other texts have no effect on the 12-occurrence cutoff that sets the parameters for SHAXICON's lexical universe.
Address queries to Professor Donald Foster, for availability.
Cf. also the Shakespeare Autorship home page.
SIGNUM, a company founded in 1988 in Quito - Equador, provides to Microsoft with proofing tools that support the Spanish language in Office 2000. The spell checker, hyphenator and Thesaurus that MS has included in this product were rated by PC World magazine (April 1999) as one important advantage of the latest Office version. SIGNUM provides a range of consulting and Spanish language engineering services including specialized tagging, customized spell checkers, as well as language related products marketing and distribution.. There isn't any true corpus linguistic tools, but still an interesting site.
The computing page of the SIL (Summer Institute of Linguistics) provides links to the SIL software catalogue, to Fonts in Cyberspace page, to SIL software projects now in progress and to other resources on the Web. They develop mainly text analysis tools, fonts and keyboard utilities, but there is also some NLP application (and their software is usually free), such as AMPLE, PC-Kimmo, K-Tagger, K-Text, Englex and PC-PATR.
One of the richest reference list of NLP software available on the Web! It is a part of the Speech and Language Web Resources, the big reference archive by Kenji Kita, Tokushima University. [2001 April 28].
SNoW is a learning program that can be used as a general purpose multi-class classifier and is specifically tailored for learning in the presence of a very large number of features. The learning architecture is a sparse network of linear units over a pre-defined or incrementally acquired feature space (Dan Roth). (License required)
Stamp is a morphological transfer and synthesis tool by SIL Computing. for adapting text to a related language. Stamp works with AMPLE to adapt text from one language to another (closely) related language, a process known as "Computer Assisted Related Language Adaptation" (CARLA). Stamp provides morphological transfer and synthesis for words analyzed by AMPLE. The result is, at best, a rough draft suitable for editing by a competent speaker of the target language. Stamp works under Win 3.1-98 + NT, DOS, MAC and Unix. freely available under agreement to the SIL standard free license.
+ A printed manual can be ordered.
Swedish Constraint Grammar (SweCG) is a system for part-of-speech disambiguation and shallow syntactic analysis of running Swedish text, developed within the Constraint Grammar (CG) framework. For availability (it is a commercial software!) you have to ask to email@example.com
This page provides information on Systemic-Functional Linguistics (Systemic-Functional Linguistics) software. Infos and links to a lot of SFL tools, some easily available and some not.
TACAT is a parser that takes as input the output of the morphological analizer MACO+, or the output of any tagger, and produces a syntactic analysis. The tool is a chart-based parser, with some extensios for flexibility. It uses CFG grammars, which can produce either a complete sentence analyses or just partial parsing and chunk recognition. Spanish and Catalan versions available. TACAT i is developed at the DLSI-UPC (Departament de Llenguatges i Sistemes Informatica - Universitat Politècnica de Catalunya) in collaboration with the CLiC (Centre de Llenguatge i Computació - Universitat de Barcelona). Availabilty is unknown: contact Núria Castell i Ariño. [2001 April 30].
+ TACAT can be queried freely online (Spanish and Catalan) as part of the DLSI-UPC/CLiC-UB Tools Demo. [2001 April 30].
TACT is a text-analysis and retrieval system for MS-DOS that permits inquiries on text databases in European languages. It has been developed by a team of programmers, designers, and textual scholars including John Bradley, Ian Lancashire, Lidio Presutti, and Michael Stairs. It was begun under the IBM-University of Toronto Cooperative in the Humanities during 1986-89. TACT is a system of 15 programs for MS-DOS designed to do text-retrieval and analysis on literary works. Typically, researchers use TACT to retrieve occurrences of a word, word pattern, or word combination. Output takes the form of a concordance, a list, or a table. Programs also can do simple kinds of analysis, such as sorted frequencies of letters, words or phrases, type-token statistics, or ranking of collocates to a word by their strength of association. TACT is intended for individual literary texts, or small to mid-size groups of such texts, such as Chaucer's poetry, Francis Bacon's Essays, Shakespeare's plays, Jane Austen's Pride and Prejudice, John Irving's The Cider House Rules, similar works in French, German, Italian, Spanish, Latin, and other modern European languages or languages using a roman alphabet, and classical Greek. Using TACT for large corpora can raise problems.
+ An HTML manual is available.
+ The TACT suite is freeware and downloadable from Toronto FTP.
TACTweb is experimental software developed as a part of a project by John Bradley (homepage) and Geoffrey Rockwell (homepage). TACTWeb connects TACT to the World Wide Web, making a TACT TDB database accessible to the entire WWW community. By using WWW forms users get access to some of the interactive services that TACT provides them - but without requiring them to use TACT itself, or have a copy of their TACT database on their own machine. They can formulate queries against a TDB database using the same query languages used in TACT/UseBase, and can get results that look something like those produced by UseBase in response. Because the WWW Forms language acts as the interface, the TACTWeb user doesn't need to learn how to interact with TACT other than to learn how formulate queries in the query language. Moreover, accessing the database via a WWW browser, the queries are system-independent (you can access TACTWeb online whether you are a PC, Macintosh or Unix user). Currently, however, TACTWeb requires a PC machine as a WWW server. TACTweb is freely downloadable by ftp; tutorials and demos, such as Italian, French and Spanish Online Korpusanalyse mit Hilfe von TactWeb (cf. details under Corpora general section), are also freely available on the site. [2001 April 23].
The Computerlinguistik-Gruppe an der Universität Zürich (Martin Volk,Gerold Schneider, Simon Clematide) trained the Brill's Tagger for German language. Now it has a training corpus of 58.000 words (from the Jahresberichte der Universität Zürich). The Brill Tagger so equipped can be tested online on this page.
The page by Hamagushi Takahashi has a lot of frames: you have to follow the link to "Software Plaza", where you can find a dozen or more links to shareware or freeware software, mostly by Takahashi himself, such as the Corpus Wizard and KKC.
The Text Analysis Tool with Object Encoding for computer-assisted text analysis by Melina Alexa and Lothar Rostek offers functionality for text as well as coding exploration and on-line coding of text data. It supports a large variety of tasks related to computer-assisted and multi-layered text analysis. It works under Windows 95, 98 and NT.Some of the most important functions supported by TATOE are: (1) Importing raw-ASCII, HTML and XML text data; (2) Importing text data containing structure markers, e.g. dialogue texts with speaker turns; (3) Importing German texts which have been morphologically analysed with the Morphy tagger, and storing the tagging as a separate categorization scheme to be used for further analysis; (4) Creating as well as maintaining flat or hierarchically structured categorization schemes; (5) Interacting with already coded text in a number of different ways and presenting text and coding information in several different views; (6) Performing searches on text and coding; (7) Obtaining concordances and KWIC displays; (8) Obtaining co-occurrence distribution graphs; (9) Coding text semi-automatically or manually on the basis of different coding schemes; (10) Defining complex search patterns, allowing for combinations of scheme categories (of different schemes) as well as strings; (11) Defining and working with different styles for displaying existing coding; (12) Exporting the coding to an SPSS syntax file; (13) Exporting the complete data of a project to XML.
You can download the new beta version (V0.987) of TATOE: free after subscription.
TATOO is a tagger by Gilbert Robert of the ISSCO (University of Geneva - Institut Dalle Molle pour les Etudes Sémantiques et Cognitives). This program that takes as input a sequence of words annotated with one or more tags and returns the most likely tag for each word in the text. Based on a Hidden Markov Model, this process is accomplished in two steps:(1) A training phase to estimate the parameters of the model; (2) A tagging phase to select the most probable tags according to this model employing the Viterbi algorithm. You have to register to download (free with registration required).
"Tex2asc", a rudimentary free TeX-to-ASCII converter. It was written solely for the purpose of converting the TeX Latin documents of the Project Libellus (described in the Electronic Literary Texts Section) into ASCII form, so it probably won't work if you try to use it on any more complicated TeX document. Compiled versions of the program, for MS-DOS and VMS, are available in libellus/utils, and a Macintosh executable will be put there if it can be run without crashing the Mac. If you plan to use it on a Unix system, however, you shall have to compile it before you can use it: machine % uncompress tex2asc10.tar.Z || machine% tar xvf tex2asc10.tar || machine% cd tex2asc10 || machine% make.
From 1993/1994 and 1997 the "Textual corpora and tools for their exploration" project of the IMS Stuttgart collected textual material for German, French and Italian, developed a representation for texts and markups, along with a query language and a corpus access system for linguistic exploration of the text material. Texts and analysis results are kept separate from each other, for reasons of flexibility and extensibility of the system; this is possible because of a particular approach for storage and representation. Tool components under development, language-specific and general, range from morphosyntactic analysis to partial parsing, and from mutual information, t-score, collocation extraction and clustering to HMM-based tagging and n-gram tagging. Research on statistical models for noun phrases, verb-object collocations, etc. was underwent. The main outcome of this research are the TreeTagger and the CWB.
The Thai Internet Education Project (TIE) develops and distributes innovative, on-line resources for education, including tools for Thai students of English, and overseas students of Thai. Resources developed and distributed by TIE include on-line guides to pronunciation and transcription, the TOLL (Thai On-Line Library) of parallel English/Thai translations with his TOLL Toolkit, on-line dictionary tools, downloadable "Reader's Reference" cards, etc. TIE focuses on developing "enabling technology" -- software and systems that can be adapted and extended by teachers anywhere in the world. Systems developed for Thai can be applied to Lao, Khmer, and Burmese as well. All TIE Project resources may be freely downloaded and incorporated into lessons by teachers, or used by students directly.
TnT, the short form of Trigrams'n'Tags, is a very efficient statistical part-of-speech tagger that is trainable on different languages and virtually any tagset. The component for parameter generation trains on tagged corpora. The system incorporates several methods of smoothing and of handling unknown words. TnT is not optimized for a particular language. Instead, it is optimized for training on a large variety of corpora. Adapting the tagger to a new language, new domain, or new tagset is very easy. Additionally, TnT is optimized for speed. Although trainable for various languages, comes with English (Susanne and Penn TreeBank) and German (NEGRA) pre-compiled models. For each of the three models you can submit a sentence to be tagged online. It runs on Solaris and Linux. TnT is free of charge for non-commercial research purposes after signing a license agreement to be send by snail mail to Thorsten Brants / Universität des Saarlandes / FR 8.7 Computerlinguistik, Geb. 17 / P.O.Box 151150 / D-66041 Saarbrücken, Germany.
Tools for tokenization are very seldom available on the web, maybe because it is difficult to develop generical tools when each corpus has its own specifical problems. This LTG FAQ page doesn't link to any downloadable software but can be a good starting point to understand the problem, and provides however some good contact infos.
All TOLL software issaid to be freely available, but there isn't still any downloading direction.
Currently available for MS-DOS only. But the decision to make this famous system available is very interesting from a historical perspective, and for software sharing in academia more generally. LOB tag set. (Freely downloadable)
Transcriber is a tool for assisting the creation of speech corpora. It allows to manually segment, label and transcribe speech signals, for later use in automatic speech processing. It is more specifically geared towards the transcription of long duration broadcast news recordings, with labelling of speech turns and topic changes. It provides a user-friendly interface which is configurable. Transcriber is developed with the scripting language Tcl/Tk and C extensions. It relies on the Snack sound extension which allows support for most common audio formats, and on the tcLex Lexer generator. It has been tested on various Unix systems (Linux, Sun Solaris, Silicon Graphics) and Windows NT. Transcriber is distributed as free software under GNU License: follow this link (binary distribution for Linux, Solaris, SGI and Windows NT available).
Trees 2 (© Sean Crist and Tony Kroch) is a Macintosh program for displaying and manipulating syntactic trees and derivations. It has several uses for teachers and students of natural language syntax, including: building trees and pasting them into word processing documents, demonstrating syntactic structures and derivations in a computer-equipped classroom, constructing interactive exercises for use in homework assignments, modeling syntactic analyses to demonstrate and informally test their descriptive coverage. The program works with grammar files that specify words, grammatical structures, and processes of syntactic composition in lexicalized grammar fragments. These fragments can be written by program users to suit their purposes and in pedagogical applications are normally written by the teacher for use with an assignment. There is a Trees Grammar Archive with grammar fragments and assignments available for downloading.
The program is freely available to students and staff of the University of Pennsylvania. The program is also freely available to students and staff at universities that have purchased site licenses for the program. Other users may download for evaluation purposes a demonstration version of the program which will expire after 30 days. After 30 days the program must be registered ($25) in order to continue functioning. For further information on how to register the program or on the cost and conditions of a site license, please contact: Anthony Kroch - Department of Linguistics - University of Pennsylvania 19104-6305.
The decision tree based tagger from the IMS (Helmut Schmid) is a tool for annotating text with part-of-speech information. The tagger can be trained using manually annotated text for any language. For the languages, English, German, French, and Italian, there are preprocessed tagger lexicons. The software is obtainable free of charge for research and educational purposes. For commercial license contact Helmut.Schmid@IMS.Uni-Stuttgart.de.
+ Giuseppe Attardi at the University of Pisa has implemented a free online tagging service based on the Tree Tagger. Unfortunately this page now seems to be down [2001 August 24].
This Part-of-Speech tagger takes as input the output of the morphological analizer MACO+, and selects the right POS and lemma for each word in the given context. Currently, it produces an output with over 97% precision. The language model is based on decision trees acquired from tagged corpora. Spanish and English versions available. This TreeTagger is developed at the DLSI-UPC (Departament de Llenguatges i Sistemes Informatica - Universitat Politècnica de Catalunya) in collaboration with the CLiC (Centre de Llenguatge i Computació - Universitat de Barcelona), and is not to be confused with the more famous TreeTagger developed at IMS Stuttgart. Availabilty is unknown: contact Núria Castell i Ariño. [2001 April 30].
+ TreeTagger can be queried freely online (English, Spanish) as part of the DLSI-UPC/CLiC-UB Tools Demo.
TReSy is the CRIBeCu (Centro di Ricerca per i BEni CUlturali) engine for XML/SGML texts, designed to give the best combination of information retrieval and document management. Informations on how to obtain it are lacking on the site: ask directly to Francesco Corti.
This freely downloadable toolkit by Adam Berger is designed to find pairs of words which co-occur with high frequency in text. Starting from a corpus of text---several million words is what we have in mind---the program automatically discovers ordered pairs of words where the occurrence of the first word in a pair makes the subsequent appearance of the second word much more likely than it otherwise would be. For example, the toolkit might discover that the word "patient" augurs the imminent appearance of "drug". To rank pairs of words, the toolkit uses mutual information. The software, written in C++, comes with a makefile.
This page in Italian is the standard reference (as a matter of fact it was the presentation of the tool made by the author) for Verbum Textual Analysis by Fernando La Greca (Dipartimento di Scienze dell’Educazione - Università di Salerno. E-mail). Verbum, basically, is a textual databasis or text processor (like DBT or Gatto) workig under DOS but interacting with Win 3,1-98 and Word.
+ Verbum v. 1.0 is described in the Concordanze.net site, where can be freely downloaded.
+ Verbum v. 2.0 is freely downloadable directly from the main page, through Salerno University FTP site. [2002 February 22].
Viterbi triclass tagger in SICStus Prolog, written by Torbjörn Lager and Joakim Nivre at the Department of Linguistics, Göteborg Universit is freely available gzipped as TAR archive.
Workbench for Analysis and Generation (WAG) is a system which offers various tools for developing Systemic (cf. Systemic-Functional Linguistics) resources (grammars, semantics, lexicons, etc.), maintaining these resources (lexical acquisition tools, network graphers, hypertext browsers, etc.), and processing (sentence analysis -- O'Donnell 1993, 1994; sentence generation O'Donnell 1995b; knowledge representation -- O'Donnell 1994; corpus tagging and exploration -- O'Donnell 1995a). Note: Sentence Analysis (parsing) is still experimental, so is not supported in the current release. Semi-Automated analysis (The WAG Coder) is however supported. The Sentence Generation component of this system generates single sentences from a semantic input. This semantic input could be supplied by a human user. Alternatively, the semantic input can be generated as the output of a multi-sentential text generation system, allowing such a system to use the WAG system as its realisation component. The sentence generator can thus be treated a black-box unit. Taking this approach, the designer of the multi-sentential generation system can focus on multi-sentential concerns without worrying about sentential issues. Platform: Macintosh with interfaces. Sun with text interface. Cf. also this http.
The Wordcruncher Server of the TITUS (Thesaurus Indogermanischer Text- und Sprachmaterialien) project makes text corpora of many ancient Indo-European as well as non-Indo-European languages (see the the TITUS file, or the general Index, but beware that's very heavy) available for all kinds of scholarly investigations. It can be used and accessed by everybody who is equipped with the WordCruncher Viewer program provided by WordCruncher Company (but beware that last time I tried to follow this link I was brought instead to BuyDomains.com !). The program can be freely downloaded here in version 5.2 (version running under MS Windows 3.11, MS Windows 95/98, MS Windows NT 4.0 and MS Windows 2000). Using the WordCruncher Viewer program, you will have to enter the URL of the TITUS WordCruncher Server which is titus.fkidg1.uni-frankfurt.de or, numerically, 188.8.131.52 as the "host name" in the "remote library" dialogue (to be accessed from the "File" menue of the start window). You will have also to install the TITUS WordCruncher font package. This can be downloaded free of charge here, provided you declare that you will use both the texts and the fonts for non-commercial purposes only. [2001 August 30].
MicroConcord, made by Mike Scott, the same author of Wordsmith, (Windows 95, 98, NT only) is a file splitter which enables you to cut an original large text file into numerous small ones, eg. of 500 words each. Options include cutting out punctuation, tags, making a list in alphabetical order. Freely downloadable.
Wordsmith, made by Mike Scott, is a powerful suite of lexical analysis tools for data-driven learning and research (mainly concordancing and sorting). It runs under Windows 3.x, 95, 98, NT. You can download it for free directly from Mike Scott's Web. There's also an extensive help system, plus a manual which runs to about 130 pages which you can print out if you download the software; there's also an introduction online.
XConcord, developed by the Computing Research Laboratory (CRL) at New Mexico State University, is a concordance tool that allows Key Word In Context (KWIC) searches to be performed on Unicode text. Currently there are over 17 languages supported through input and display methods. The tool has been designed to be easy to work with so that teachers, students and other users can use XConcord to identify relevant texts by viewing words and expressions in context. Searching is quick and the size of the corpus is limited only by available disk space. Searching is also flexible. Users can match any string with any part of a word or phrase. Users can also limit the search to display only those texts either containing or missing specified strings in the context to the left or right of the keyword. Searches can also be performed using regular expressions. The results of an Xconcord search are shown in a KWIC-style display. The complete sentence for the selected KWIC line is shown beneath the kwic display. The complete document can be displayed in another window. Xconcord provides easy methods for saving individual sentences or complete documents to new text files. X-Concord has no built-in support for SGML/TEI markup, but it is possible to search for such tags by specifying them as a part of the search string expression.
+ Xconcord is freely downloadable from this page after you have signed the CRL Software License Agreement, obtain password and username to login (it's easy: they dont ask you money or embarassing questions!).
A corpus management tool by Laurent Romary developed by CRIN (Centre de Recherche en Informatique de Nancy). This is not so much a system for browsing corpora, as a system to manipulate, annotate and transform TEI/SGML encoded corpora. It embodies up-translation modules, a corpus editor, an alignment module and an exporter into HTML format, which makes results suitable for viewing on the WWW. Availability informations are lacking.
The XRCE (home page) MLTT team creates basic tools for linguistic analysis, e.g. morphological analysers, parsing and generation platforms and corpus analysis tools. These tools are used to develop descriptions of various languages and the relation between them. Currently under development are phrasal parsers for French and German, an LFG grammar for French and projects on multilingual information retrieval, translation and generation. There are only free web demos of some of their tools on the web for Arabic, Czech, Dutch, English, French, German, Greek, Hungarian, Italian, Polish, Portuguese, Russian, Spanish; the Xerox Tagger is freeware. See the Xerox Contact page for more infos. The PARC anonimous FTP provides however free access to a lot of resources.
The Xerox Tagger is a well known Lisp HMM tagger freely available by the PARC anonimous FTP.
These are tools for parsing and grammar development which were created at Penn (i.e. the University of Pennsylvania; cf. the Penn Tools file) for the XTAG Project, viz. a wide-coverage grammar for English using a lexicalized Tree Adjoining Grammar (TAG) formalism. All are freely available under the GNU Licence. They consist of: SuperTagger, a tool for tagging sentences with SuperTags (i.e. elementary trees; the package contains also a lightweight dependency analyser which uses the SuperTagger); XTAG Parser and grammar development interface (Initial release of the newer Common Lisp version. Pre-Release date 3.31.2000. Only tested with Lucid Common Lisp; it will require experience in porting software to make it run on Allegro Common Lisp and CMU Common Lisp compilers); XTAG morphology database with Berkeley db interface, X11 maintenance tool, a separatedly downloadable Readme file; Synedit, a tool for editing the XTAG syntax flat file; Bungee, a Tcl/Tk tree/family viewer for the English grammar. There are also a lot of freely downloadable User Manuals and selected papers dealing with the various components of XTAG. [2001 April 27].