OLiA ontologies

This page enumerates the ontologies that are currently available. Officially, none of them has been released. They will be released under a Creative Commons Attribution Sharealike licence as soon as a reference publication has appeared. Until then, feel free to make use of them, but it would be nice to be notified if this happens. Besides the ontologies listed below, there are a number of experimental ontologies, e.g., concerning further annotation schemes, the linking with GOLD and the ISO TC37/SC4 Data Category Registry, and additional phenomena (discourse, coreference).

The OLiA architecture is a set of modular OWL/DL ontologies with ontological models of annotation schemes (Annotation Models) on the one hand, an ontology of reference terms (Reference Model) on the other hand, and ontologies (Linking Models) that implement subClassOf relationships between them.

Some remarks on viewing and browsing the ontologies: For browsing the OLiA ontologies, I recommend:

  • OwlSight is a light-weight online browser for ontologies (recommended only for taking a first look on the ontologies), or
  • Protégé is an JAVA-based ontology browser and editor (recommended for browsing, requires installation)

Both ontology browsers accept the URLs given below (insert by copy and paste).


Overview


Background

Concentrating on the more elementary levels of linguistic analysis such as parts of speech and morphology, a generalization over different terminologies applied for the annotation of the corpora hosted by three collaborative research centers SFB 441 (Tübingen), SFB 538 (Hamburg) and SFB 632 (Potsdam/Berlin) was developed, and later extended for NLP tools and corpora beyond these resources. As a result, an ontology was developed which specifies reference terminology, and the tags of the original annotated data are linked with this reference terminology. Besides its function in annotation documentation, the ontology can be applied for the formulation of tag-set neutral corpus queries. For this purpose, I developed the OntoClient, a JAVA-based query pre-processor which translates formal ontology-based specifications into disjunctions of concrete tags. The OntoClient serves as a pre-processor for corpus querying languages such as ANNIS-QL and CQP, furthermore, it was applied in the specification of tag-set independent corpus processing scripts.

The OLiA ontologies were initially developed in the context of the project "Sustainability of Linguistic Resources", a collaborative project between three German Collaborative Research Centers (SFBs), The Collaborative Research Centres involved in the project are the SFB 538 'Multilingualism' at the University of Hamburg, the SFB 632 'Information Structure' at the University of Potsdam and the Humboldt University Berlin, and the SFB 441 'Linguistic Data Structures' at the Eberhard Karls University Tübingen.

The project aimed at preparing language resources to assure an accessible dissemination and sustainable storage of linguistic corpora. One of the main goals of the project was a practical one: resources acquired in long-term projects situated in the three Collaborative Research Centres have to be converted in either one or multiple formats to be sustainably usable by researchers and applications. Furthermore, the project developed unified methods of access for the heterogeneous data acquired in the projects.

The linguistic resources dealt by the project are highly heterogeneous:

  • the primary data itself is heterogeneous: size (e.g., single sentences vs. entire articles)
  • text types / data types (e.g. newspaper texts, diachronic texts, dialogues, treebanks, ...)
  • modality (monologue vs. dialogue)
  • categories of information covered by the annotation / annotation levels (e.g. layout, textual structure, morpho-syntax, syntax, ...)
  • underlying linguistic theories
  • language
  • the annotations require data structures of various types (attribute-value pairs, trees, pointers, etc.)
  • data is annotated by means of different, task-specific annotation tools

Integration of linguistic terminologies

One of the tasks addressed by the sustainability project was the integration of heterogeneous terminology, especially those applied for the annotation of existing corpora. Examples for such differences range from minor variation in the choice of tag names (which often go unrealized and thus, affect the reliability of broad-scale corpus studies) to fundamental conceptual differences.

  • Different abbreviations for the same annotations
    • E.g. pronominal adverbs in the German de-facto standard tag set STTS, annotated PROAV (Stuttgart variant of STTS), PAV (Tiger variant of STTS), or PROP (Tübingen variant of STTS) without any change in meaning.
  • Same abbreviation for different annotations
    • E.g. the indefinite article in STTS. In Tiger-STTS, PIAT is applied to the indefinite article in attributive use throughout, in Stuttgart-STTS, PIAT is restricted to "proper" indefinite articles, i.e. those which appear as articles of indefinite descriptions, while the indefinite article after a definite article is tagged as PIDAT.
  • Same annotation, but different interpretation
    • E.g. the concept "auxiliary verb". In STTS, the tag VAFIN, explained as "auxiliary verb", is used for German haben "to have; to own" and sein "to be; to be defined by; to exist" in all uses. In the SFB632 annotation standard, however, VAUX is restricted to German haben and sein in auxiliary use only, while the copula sein "to be equal to" and the lexical uses of haben "to own" and sein "to exist" are tagged separately.
  • Different granularity of tag sets
    • The SFB538/E2 tag set assigns all nouns (proper nouns and common nouns) the same tag, the SFB632 annotation standard designed for typological research, differentiates 2 types of nouns (common nouns and proper nouns), the Penn Treebank differentiates 4 types of nouns (common and proper nouns in singular and plural), the SUSANNE tag set for English differentiates approximately 63 types of nouns based on semantic and morphosyntactic properties, and in the Russian Uppsala corpus, we find 111 different tags for common and proper nouns according to morphological features.
  • Conceptual overlap
    • In languages with grammaticalized determiners, attributive possessive pronouns can be regarded as determiners, as they, like an article, fulfil the function to mark a nominal as a noun phrase (resp. determiner phrase). However, in the literal sense (and traditional grammar), attributive possessive pronouns are "pro-nouns", i.e. replacements of names, i.e. they are characterized by their referentiality, and hence, pronouns. There is free variation among tag sets whether attributive possessive pronouns are regarded as determiners (ccording to their syntactic function), or pronouns (according to their semantic characterization).

All these problems are taken from the seemingly most elementary domain, the domain of part of speech tags, however, more problems arise as soon as morphology, syntax, or discourse phenomena are addressed.

In order to overcome such problems, terminological integration is necessary, i.e.

  • documentation of terminological differences
  • harmonization between different terminologies

To provide an integrated access to terminologically heterogeneous resources, it is also necessary to provide an abstract model of linguistic reference terminology to which individual annotations refer, a so-called "terminological backbone".

Classical solutions are the standardization approach and the interlingua approach:

  • Standardization (cf. the EAGLES recommendations on morphosyntactic annotation)
    • Definition of a reference inventory of terms which must or may be considered by a standard-conformant annotation scheme. Concrete annotations are directly mapped onto reference terms or a disjunction of reference terms. (Wilson and Leech 1996)
  • Interlingua (cf. the AMALGAM project)
    • From different annotation schemes, or tag sets, an abstracted representation is derived which subsumes all possible differences between the participating tag sets. Whenever no direct mapping of annotations (e.g. X and Y) from different annotation schemes (e.g. A and B) is possible, all possible combinations must be represented in the interlingua, i.e. (A:X,B:X), (A:X,B:Y), (A:Y,B:X), (A:Y,B:Y).

Both solutions are limited in flexibility and scalability, and hence, both approaches are applicable only within a limited domain. The standardization approach relies on the existence of common grammatical categories and features found in the languages for which standard-conformant tag sets are to be developed. Otherwise, it results in projection of complexity (e.g. the standard entails predictions for grammatical categories for a standard-conformant tagset which are absent in a language). However, even the sheer existence of universal morphosyntactic categories has been questioned in typologic research, and hence, the EAGLES-based standardization approach is unlikely to extend beyond "Standard Average European" languages.

The interlingua approach, however, involves the process to construct an interlingua between existing schemes, and is less statically than the standardization process. However, the complexity of the interlingua grows monotonically with every new language/tag set considered, and, hence, the general applicability of the interlingua approach is restricted by its limited scalability.

Therefore, the project is currently developing an ontology of linguistic annotations as a more flexible representation of a "terminological backbone".

An ontology-based approach

So far, we have developed an ontology of linguistic annotations with special consideration of part of speech and morphological annotations existing the participating Collaborative Research Centers (Schmidt et al. 2006, Chiarcos 2006c, Chiarcos 2006d, Chiarcos 2007).

The approach relies on the ontological reconstruction of annotation schemes based on guidelines and additional documentation in so-called "annotation models" (or "domain models").

Every annotation model represents one tag set or annotation scheme, with nonterminal nodes (concepts) representing conceptual categories as mentioned in the documentation or indicated in the document structure of the annotation guidelines, and terminal nodes (instances) representing concrete annotation values, or tags.

As an illustration, prototypes for the following annotation models are available in an HTML serialization:

  • STTS (POS tags, German) [owl] (Stuttgart, Tübingen and Tiger-Variant)
  • Tiger-Morphology (Morphology, POS tags inherited from STTS, German) [owl]
  • SUSANNE (POS tags with partial information about morphosyntax and lexical semantics, English) [owl]
  • Uppsala (POS tags and morphology, Russian) [owl]

With respect to morphosyntactic annotations, the OLiA annotation models currently comprise 16 annotation schemes applied to 42 languages (5 annotation models for English, 5 annotation models for German, 2 annotation models for Russian, one annotation model for Tibetan, one for Old High German, the Connexor annotation model for 10 European languages, one annotation model for a typologically-oriented annotation scheme applied to 29 languages). Annotation models for syntax and information structure/anaphora are currently under construction.

The concepts of these annotation models are linked to a common "reference model" which is based on the EAGLES recommendations for morphosyntax, and extended according to the needs of the participating annotation models, hence it is also referred to as "E(xtended)-EAGLES" ontology.

The annotation models are then mapped onto the categories specified in the reference model by means of conceptual subsumption (rdfs:subClassOf, rdfs:subPropertyOf). This mapping is specified in separate "linking files", thus making both the reference model and the annotation models independent and self-contained ontologies.

The "reference model", however, does not specify authoritative definitions for existing terminology, but only a fairly traditional view on it. Hence, its primary function is not to provide prescriptive definitions of terms, but only to provide a reference point for the participating annotation models. Whenever a more reliable ontology of linguistic terminology will be developed (e.g. revised versions of the General Ontology of Linguistic Description (GOLD) or the grammis ontology), the reference model can be linked with it in the same way as the annotation models are linked with the reference model, and thus mediate between such an external reference model and the annotation models. In this sense, the reference model serves as an interface to the annotation model, and it could be better termed "interface model".

  • an exemplary implementation of the linking of E-EAGLES with an an extended version of GOLD, v.0.3 as an external reference model [owl]

Ontology-based corpus querying

Besides the purely documentation function of the ontologies, the specifications in the ontology can be used for tag-set neutral corpus querying. In essence, this means that expressions from the ontology can be directly used for corpus queries. As an example, a user may enter the query

  • PossessivePronoun and hasNumber(Singular) and hasGender(Neuter) and hasCase(Genitive)
  • instead of the SUSANNE tag
  • APPGh1

Of course, APPGh1 is shorter, but it is a cryptic and idiosyncratic abbreviation, and knowing about the function of APPGh1 in SUSANNE helps nothing when searching for the corresponding items in, say, the Uppsala corpus, where the same query expands to

  • pronomen_pos_1p_gen_sg_neut_opl | pronomen_pos_2p_gen_sg_neut_opl | ...

Especially, this kind of ontology-based corpus querying can thus allow researchers unfamiliar with a certain resource to take a first glance at a corpus with an unknown tag set without having to spend to much efforts in locating and consuming the annotation documentation. Hence, the bias for re-usability of existing resources is substantially lowered.

For ontology-based corpus querying, the OntoClient is developed, a JAVA-package that works as a pre-processor for corpus queries. Given a certain string, the OntoClient replaces ontology-sensitive sub-strings with the disjunction of tags retrieved as instances which satisfy the criteria specified in the ontology-sensitive sub-string.

The output of the OntoClient is highly configurable, and thus, it can be easily applied to practically any kind of existing corpus query interface.

  • Currently, we have implemented a prototype for an ontology-sensitive CQP interface.
  • At the GLDV Frühjahrstagung 2007, Christian Chiarcos and Michael Götze presented the integration of the OntoClient with the ANNIS.
  • At the RaNLP 2007, Georg Rehm, Richard Eckart and Christian Chiarcos will present the application of the OntoClient as a pre-processor for XQuery templates.

OLiA Reference Model and system ontologies

Module Phenomenon OWL/DL models
OLiA Reference Model for morphosyntax, morphology and syntax morphosyntax, morphology and syntax http://purl.org/olia/olia.owl
OLiA Reference Model for discourse structure discourse structure, discourse relations t.b.a
OLiA Reference Model for information structure information structure, information status, coreference t.b.a
OLiA System Ontology basic annotation data structures http://purl.org/olia/system.owl
OLiA Top-Level Ontology top-level concepts of the OLiA Reference Model for morphosyntax, morphology and syntax http://purl.org/olia/olia-top.owl

Annotation Models

Multilingual Annotation Models for morphological, morphosyntactic and syntactic annotation

Tagset / NLP tool Phenomenon Languages OWL/DL models
SFB632 annotation standard (Dipper et al. 2008) parts of speech, glosses, chunk labels, grammatical functions (phonology, information structure) > 30 typologically different languages, including many African languages Annotation Model, Linking Model
EAGLES recommendations (Leech and Wilson 1996) morphosyntax 11 EU languages, incl. Romance, Germanic, Greek and Irish Annotation Model, Linking Model
Connexor dependency parser morphosyntax, morphology, dependency syntax 10 European languages, incl. Romance, Germanic and Uralic languages Annotation Model, Linking Model
MULTEXT-East morphosyntax, morphology 15 mostly Eastern European languages, incl. Slavic, Romance, Uralic languages and Persian Annotation Model (common specifications), Linking Model; Annotation Model (all languages), see project page and below for individual languages
IL-POSTS tagset Baskaran et al. (2008) morphosyntax languages of the Indian subcontinent Annotation Model, Linking Model
AnnCorra Bharati et al. (2006) morphosyntax, chunks languages of the Indian subcontinent Annotation Model, Linking Model
IIIT tagset IIT (2007) morphosyntax languages of the Indian subcontinent Annotation Model, Linking Model

Annotation Models for the morphological, morphosyntactic and syntactic annotation of English

Tagset / NLP tool Phenomenon OWL/DL models
Brown corpus tagset morphosyntax Annotation Model, Linking Model
Connexor dependency parser morphosyntax, morphology, dependency syntax Annotation Model, Linking Model
EAGLES recommendations (English) (Leech and Wilson 1996) morphosyntax Annotation Model, Linking Model
GENIA corpus morphosyntax Annotation Model, Linking Model
MULTEXT-East (English) morphosyntax Annotation Model, Linking Model
Penn Treebank morphosyntax Annotation Model, Linking Model
Penn Treebank syntax Annotation Model, Linking Model
QTag morphosyntax Annotation Model, Linking Model
Stanford dependency parser dependency syntac Annotation Model, Linking Model
Susanne corpus morphosyntax Annotation Model, Linking Model

Annotation Models for the morphological, morphosyntactic and syntactic annotation of German

Tagset / NLP tool Phenomenon OWL/DL models
Connexor dependency parser morphosyntax, morphology, dependency syntax Annotation Model, Linking Model
EAGLES recommendations (German) (Leech and Wilson 1996) morphosyntax Annotation Model, Linking Model
Morphisto morphology Annotation Model, Linking Model
STTS morphosyntax Annotation Model, Linking Model
TIGER/NEGRA morphology Annotation Model, Linking Model
TIGER/NEGRA constituent syntax Annotation Model, Linking Model
TreeTagger Chunker chunk labels Linking Model
RFTagger morphosyntax, morphology t.b.a

Annotation Models for the morphological, morphosyntactic and syntactic annotation of other Germanic languages

Tagset / NLP tool Phenomenon Languages OWL/DL models
EAGLES recommendations (Leech and Wilson 1996) morphosyntax; inflectional morphology Danish, Dutch, Swedish (and several non-Germanic languages) Annotation Model, Linking Model
Connexor morphosyntax, morphology, dependency syntax Dutch, Swedish, Danish, Norwegian Annotation Model, Linking Model
SFB632 annotation standard (Dipper et al. 2008) parts of speech, glosses, chunk labels, grammatical functions (phonology, information structure) Dutch (among other languages) Annotation Model, Linking Model
MENOTA (incomplete) morphosyntax Old Norse Annotation Model, Linking Model
T-CODEX morphosyntax, syntax, information structure Old High German Annotation Model, Linking Model

Annotation Models for the morphological, morphosyntactic and syntactic annotation of Russian

Tagset / NLP tool Phenomenon OWL/DL models
Uppsala corpus tagset morphosyntax, morphology Annotation Model, Linking Model
Russian TreeTagger (Serge Sharoff) morphosyntax Annotation Model, Linking Model
MULTEXT-East for Russian morphosyntax, morphology Annotation Model, Linking Model

Annotation Models for the morphosyntactic annotation of other Slavic languages

Tagset / NLP tool Languages OWL/DL models
MULTEXT-East Bulgarian Annotation Model, Linking Model
MULTEXT-East Czech Annotation Model, Linking Model
MULTEXT-East Macedonian Annotation Model, Linking Model
MULTEXT-East Polish Annotation Model, Linking Model
MULTEXT-East Slovak Annotation Model, Linking Model
MULTEXT-East Slovene Annotation Model, Linking Model
MULTEXT-East Resian (Slovene spoken in Italy) Annotation Model, Linking Model
MULTEXT-East Serbian Annotation Model, Linking Model
MULTEXT-East Ukrainian Annotation Model, Linking Model

Annotation Models for the morphological, morphosyntactic and syntactic annotation of French

Tagset / NLP tool Phenomenon OWL/DL models
EAGLES recommendations (Leech and Wilson 1996) morphosyntax Annotation Model, Linking Model
French TreeTagger (Achim Stein) morphosyntax Annotation Model
Le Monde corpus (Abeillé et al. 2000) morphosyntax Annotation Model
Connexor morphosyntax, morphology, dependency syntax Annotation Model, Linking Model
SFB632 annotation standard (Dipper et al. 2008) parts of speech, glosses, chunk labels, grammatical functions (phonology, information structure) for Canadian French (among other languages) Annotation Model, Linking Model

Annotation Models for the morphological, morphosyntactic and syntactic annotation of other Romance languages

Tagset / NLP tool Phenomenon Languages OWL/DL models
EAGLES recommendations (Leech and Wilson 1996) morphosyntax Catalan, Portuguese, Spanish Annotation Model, Linking Model
Connexor morphosyntax, morphology, dependency syntax Spanish, Italian Annotation Model, Linking Model
PAROLE Spanish/Catalan (http://nlp.lsi.upc.edu/freeling) morphosyntax, inflectional morphology Spanish, Italian Annotation Model
MULTEXT-East morphosyntax, morphology Romanian Annotation Model, Linking Model

Annotation Models for the morphological, morphosyntactic and syntactic annotation of Uralic and Altaic languages

Tagset / NLP tool Phenomenon Languages OWL/DL models
Connexor morphosyntax, morphology, dependency syntax Finnish Annotation Model, Linking Model
MULTEXT-East morphosyntax, morphology Estonian Annotation Model, Linking Model
MULTEXT-East morphosyntax, morphology Hungarian Annotation Model, Linking Model
SFB632 annotation standard (Dipper et al. 2008) parts of speech, glosses, chunk labels, grammatical functions (phonology, information structure) Hungarian (among other languages) Annotation Model, Linking Model
Turkish POS tagset (Oflazer et al. 2003) morphosyntax Turkish Annotation Model

Annotation Models for the morphosyntactic annotation of other European languages

Tagset / NLP tool Phenomenon Languages OWL/DL models
EAGLES recommendations (Leech and Wilson 1996) morphosyntax Greek, Irish (among other EU languages) Annotation Model, Linking Model
SFB632 annotation standard (Dipper et al. 2008) parts of speech, glosses, chunk labels, grammatical functions (phonology, information structure) Georgian, Greek (among other languages) Annotation Model, Linking Model
EUSTagger (Ezeiza et al. 1998) morphosyntax Basque Annotation Model

Annotation Models for the morphosyntactic annotation of Indoiranian languages

Tagset / NLP tool Phenomenon Languages OWL/DL models
Urdu EMILLE tagset Hardie (2003, 2004) morphosyntax, inflectional morphology Urdu Annotation Model, Linking Model
Urdu tagset Sajjad (2007) morphosyntax Urdu Annotation Model, Linking Model
IL-POSTS tagset Baskaran et al. (2008) morphosyntax, inflectional morphology Bangla, Hindi, Marathi, Sanskrit Annotation Model, Linking Model
AnnCorra Bharati et al. (2006) morphosyntax, chunks Bangla, Hindi Annotation Model, Linking Model
IIIT tagset IIIT (2007) morphosyntax Hindi, Marathi Annotation Model, Linking Model
SFB632 annotation standard (Dipper et al. 2008) parts of speech, glosses, chunk labels, grammatical functions (phonology, information structure) Konkani (among other, unrelated languages) Annotation Model, Linking Model
MULTEXT-East morphosyntax Farsi (Persian) Annotation Model, Linking Model

Annotation Models for the morphosyntactic annotation of Dravidian languages

Tagset / NLP tool Phenomenon Languages OWL/DL models
IL-POSTS tagset Baskaran et al. (2008) morphosyntax Kannada, Malayalam, Tamil, Telugu Annotation Model, Linking Model
AnnCorra Bharati et al. (2006) morphosyntax, chunks Telugu, Tamil Annotation Model, Linking Model
IIIT tagset IIIT (2007) morphosyntax Telugu Annotation Model, Linking Model

Annotation Models for the morphological, morphosyntactic and syntactic annotation of Tibeto-Burman languages

Tagset / NLP tool Phenomenon Languages OWL/DL models
Dzongkha tagset (Chungku et al. 2010) morphosyntax Dzongkha Annotation Model, Linking Model
SFB632 annotation standard (Dipper et al. 2008) parts of speech, glosses, chunk labels, grammatical functions (phonology, information structure) Prinmi (among other, unrelated languages) Annotation Model, Linking Model
Tübingen Tibetan Corpora (Wagner & Zeisler 2004) morphosyntax, morphology, syntax Tibetan (Old Tibetan, Classical Tibetan, Balti, Ladakh) Annotation Model

Annotation Models for East Asian languages

Annotation scheme / Corpus Phenomenon Languages OWL/DL models
Penn Chinese Treebank (Xia 2000) morphosyntax Chinese Annotation Model
SFB632 annotation standard (Dipper et al. 2008) parts of speech, glosses, chunk labels, grammatical functions (phonology, information structure) Japanese (among other, unrelated languages) Annotation Model, Linking Model

Annotation Models for Afroasiatic languages

Annotation scheme / Corpus Phenomenon Languages OWL/DL models
Arabic tagset (Khoja 2001) morphosyntax Arabic Annotation Model
SFB632 annotation standard (Dipper et al. 2008) parts of speech, glosses, chunk labels, grammatical functions (phonology, information structure) Chadic languages (including Guruntum, Tangale, Hausa) Annotation Model, Linking Model
Hausa Internet Corpus (Chiarcos et al. 2011) morphosyntax Hausa t.b.a

Annotation Models for the languages of Subsaharic Africa

Annotation scheme / Corpus Phenomenon Languages OWL/DL models
SFB632 annotation standard (Dipper et al. 2008) parts of speech, glosses, chunk labels, grammatical functions (phonology, information structure) Gur and Kwa languages (including Aja, Dagbani, Buli, Byali, Ditammari, Fon, Foodo, Konni, Nateni, Waamma, Yom) Annotation Model, Linking Model
SFB632 annotation standard (Dipper et al. 2008) Chadic languages (including Guruntum, Tangale, Hausa)
Hausa Internet Corpus (Chiarcos et al. 2011) morphosyntax Hausa t.b.a

Annotation Models for indigenous languages of the Americas, Australia and the Pacific

Annotation scheme / Corpus Phenomenon Languages OWL/DL models
SFB632 annotation standard (Dipper et al. 2008) parts of speech, glosses, chunk labels, grammatical functions (phonology, information structure) Teribe, Yucatec Maya, Mawng, Niue Annotation Model, Linking Model

Annotation Models for discourse annotations

Annotation scheme / Corpus Phenomenon Languages OWL/DL models
ARRAU corpus coreference English t.b.a
CRC 732, A3 annotations of the Stuttgarter Radio News Corpus information status, pronominal coreference German t.b.a
OntoNotes coreference English t.b.a
Penn Discourse Graphbank discourse relations English t.b.a
Penn Discourse Treebank connectives, discourse relations English t.b.a
Potsdam Coreference Scheme coreference English, German t.b.a
RST Discourse Treebank RST discourse relations and discourse segments English t.b.a

External Reference Models

Terminological repository Original url Local url Linking Model
ISO TC37/SC4 Data Category Registry http://www.isocat.org t.b.a t.b.a
GOLD http://linguistics-ontology.org t.b.a t.b.a