Linguistic Specifications for NLP 

The aim of this project was to provide coherent large coverage linguistic specifications expressed in a way that they can be used for a range of formal frameworks. Together with the REFMAN project a number of related small projects created very detailed descriptions for Dutch, Danish, Italian, German, Greek, and French on the basis of the specifications of the REFMAN project.

The rationale of the REFMAN project was to investigate the suitability of existing linguistic specifications, especially those of the EUROTRA Reference Manual (RM) 7.0, for their general usefulness in NL processing and reuse them.

The EUROTRA reference manual was assessed a useful and valuable resource for future projects and industrial developments in the area of MT and other applications (Oakley EUROTRA Evaluation Report). In order to make it more accessible and usable the linguistic knowledge has to be presented in a more widely used notation.

A first goal of the project was to rewrite the useful and valuable linguisticspecifications in a standard notation, Typed Feature Structures, (henceforth TFS). TFSs is the major data type for the representation of linguistic knowledge in current modern computational linguistic frameworks. The REFMAN project, however, is not restricted to a specific formalism in order to avoid linguistic specifications again being too dependent on a specific formalism. Instead the project uses a set of formal devices which are defined in a specific chapter of the manual. They were chosen according to specific criteria following the rationale of the project, which is to providea linguistic basis for a broad range of applications.

The results in summary: 

  • Broad and detailed descriptions of phenomena according to common principles, on the basis of common assumptions with abundant exemplification from various languages. These descriptions are available in a first version under THE MANUAL, CHAPTER BY CHAPTER.
  • Monolingual manuals for six languages which specify in great detail a core NL-system including TLM, word structure, phrase structure and predicate-argument-structure. For these manuals, see the following links:  
  • The REFMAN not only gives linguistic analyses, but also discusses the consequences of a large number of formal devices. There is information about (dis)advantages of lexical rules, underspecification, flat vs. binary syntactic structures etc.  

The project provided a resource that is useful for the following reasons:

For easy building of an extended NL-core system, without losing too much time. The REFMAN avoids starting from scratch. For an implementation it is extremely useful to see: Which approaches exist for different fields. (Morphotactics: Word-and-paradigm, structure-based. PS: `lean syntax', vacuous projections, argument inheritance, gap threading, `flat' vs. `binary' branching, split subcat list, Predicate-argument-structures for ALL categories). Where the problems are for a specific language, where the basic challenges are, how different approaches fare, how different approaches spell out across languages. All this for a great deal of phenomena in one place. Formalism independence, discussion of formal tools. It has been shown how things can be done by different formal devices and at what cost (e.g. lexical generalizations). For some languages it is the sole description of this kind. Here the resource is especially important if a system is to be implemented. The specifications are relevant for multilingual MT, though not especially tuned for this purpose. In an `academic' context: It is a source of inspiration (problem description, alternative approaches discussed, new areas addressed).


January 1994 - December 1995 


  • University of Essex, Dept. of Language and Linguistics, Wivenhoe Park, UK
  • Gruppo DIMA, Torino, Italy
  • University of Manchester, Institute of Science and Technology, UMIST, Manchester, UK
  • Center for Computational Linguistics, K.U. Leuven, Leuven, Belgium
  • Universitat Pompeu Fabra (UPF), Barcelona, Spain
  • TALANA, Paris, France
  • Anite Systems (Former Cray Systems), Luxembourg


 Technical Annex

  • Contents Table
  • Introduction
  • Formalism
  • Morphology
  • Phrase Structure
  • Lexicon
  • Predicate Argument Structure
  • Verb Alternations
  • Support Verb Constructions
  • Determination and Quantification
  • Coordination
  • Mood, Tense and Aspect
  • Negation
  • Bibliography