OLiA ontologies
This page enumerates the ontologies that are currently available. Officially, none of them has been released. They will be released under a Creative Commons Attribution Sharealike licence as soon as a reference publication has appeared. Until then, feel free to make use of them, but it would be nice to be notified if this happens. Besides the ontologies listed below, there are a number of experimental ontologies, e.g., concerning further annotation schemes, the linking with GOLD and the ISO TC37/SC4 Data Category Registry, and additional phenomena (discourse, coreference).
The OLiA architecture is a set of modular OWL/DL ontologies with ontological models of annotation schemes (Annotation Models) on the one hand, an ontology of reference terms (Reference Model) on the other hand, and ontologies (Linking Models) that implement subClassOf relationships between them.
Some remarks on viewing and browsing the ontologies: For browsing the OLiA ontologies, I recommend:
- OwlSight is a light-weight online browser for ontologies (recommended only for taking a first look on the ontologies), or
- Protégé is an JAVA-based ontology browser and editor (recommended for browsing, requires installation)
Both ontology browsers accept the URLs given below (insert by copy and paste).
Overview
- Background
- OLiA Reference Model
- OLiA Annotation Models
- Multilingual
- English
- German
- other Germanic languages (Danish, Dutch, Norwegian, Swedish; Old Norse, Old High German)
- Russian
- other Slavic languages (Bulgarian, Czech, Macedonian, Polish, Resian, Slovak, Slovene, Ukrainian)
- French
- other Romance languages (Catalan, Italian, Portuguese, Romanian, Spanish)
- Uralic and Altaic languages (Estonian, Finnish, Hungarian, Turkish)
- other European languages (Basque, Georgian, Greek, Irish)
- Indoiranian languages (Bangla, Farsi, Hindi, Konkani, Marathi, Sanskrit, Urdu)
- Dravidian languages (Kannada, Malayalam, Tamil, Telugu)
- Tibeto-Burman languages (Old Tibetan, Classical Tibetan, Balti, Ladakh; Dzongkha, Prinmi)
- Eastern Asian languages (Chinese, Japanese)
- Afroasiatic languages (Arabic, Guruntum, Hausa, Tangale)
- Subsaharic African languages (Aja, Buli, Byali, Dagbani, Ditammari, Fon, Foodo, Guruntum, Hausa, Konni, Nateni, Tangale, Waamma, Yom)
- Indigenous languages of the Americas, Australia and the Pacific (Teribe, Yucatec Maya, Mawng, Niue)
- Annotation Models for discourse phenomena
- External Reference Models
Background
Concentrating on the more elementary levels of linguistic analysis such as parts of speech and morphology, a generalization over different terminologies applied for the annotation of the corpora hosted by three collaborative research centers SFB 441 (Tübingen), SFB 538 (Hamburg) and SFB 632 (Potsdam/Berlin) was developed, and later extended for NLP tools and corpora beyond these resources. As a result, an ontology was developed which specifies reference terminology, and the tags of the original annotated data are linked with this reference terminology. Besides its function in annotation documentation, the ontology can be applied for the formulation of tag-set neutral corpus queries. For this purpose, I developed the OntoClient, a JAVA-based query pre-processor which translates formal ontology-based specifications into disjunctions of concrete tags. The OntoClient serves as a pre-processor for corpus querying languages such as ANNIS-QL and CQP, furthermore, it was applied in the specification of tag-set independent corpus processing scripts.
The OLiA ontologies were initially developed in the context of the project "Sustainability of Linguistic Resources", a collaborative project between three German Collaborative Research Centers (SFBs), The Collaborative Research Centres involved in the project are the SFB 538 'Multilingualism' at the University of Hamburg, the SFB 632 'Information Structure' at the University of Potsdam and the Humboldt University Berlin, and the SFB 441 'Linguistic Data Structures' at the Eberhard Karls University Tübingen.
The project aimed at preparing language resources to assure an accessible dissemination and sustainable storage of linguistic corpora. One of the main goals of the project was a practical one: resources acquired in long-term projects situated in the three Collaborative Research Centres have to be converted in either one or multiple formats to be sustainably usable by researchers and applications. Furthermore, the project developed unified methods of access for the heterogeneous data acquired in the projects.
The linguistic resources dealt by the project are highly heterogeneous:
- the primary data itself is heterogeneous: size (e.g., single sentences vs. entire articles)
- text types / data types (e.g. newspaper texts, diachronic texts, dialogues, treebanks, ...)
- modality (monologue vs. dialogue)
- categories of information covered by the annotation / annotation levels (e.g. layout, textual structure, morpho-syntax, syntax, ...)
- underlying linguistic theories
- language
- the annotations require data structures of various types (attribute-value pairs, trees, pointers, etc.)
- data is annotated by means of different, task-specific annotation tools
Integration of linguistic terminologies
One of the tasks addressed by the sustainability project was the integration of heterogeneous terminology, especially those applied for the annotation of existing corpora. Examples for such differences range from minor variation in the choice of tag names (which often go unrealized and thus, affect the reliability of broad-scale corpus studies) to fundamental conceptual differences.
- Different abbreviations for the same annotations
- E.g. pronominal adverbs in the German de-facto standard tag set STTS, annotated PROAV (Stuttgart variant of STTS), PAV (Tiger variant of STTS), or PROP (Tübingen variant of STTS) without any change in meaning.
- Same abbreviation for different annotations
- E.g. the indefinite article in STTS. In Tiger-STTS, PIAT is applied to the indefinite article in attributive use throughout, in Stuttgart-STTS, PIAT is restricted to "proper" indefinite articles, i.e. those which appear as articles of indefinite descriptions, while the indefinite article after a definite article is tagged as PIDAT.
- Same annotation, but different interpretation
- E.g. the concept "auxiliary verb". In STTS, the tag VAFIN, explained as "auxiliary verb", is used for German haben "to have; to own" and sein "to be; to be defined by; to exist" in all uses. In the SFB632 annotation standard, however, VAUX is restricted to German haben and sein in auxiliary use only, while the copula sein "to be equal to" and the lexical uses of haben "to own" and sein "to exist" are tagged separately.
- Different granularity of tag sets
- The SFB538/E2 tag set assigns all nouns (proper nouns and common nouns) the same tag, the SFB632 annotation standard designed for typological research, differentiates 2 types of nouns (common nouns and proper nouns), the Penn Treebank differentiates 4 types of nouns (common and proper nouns in singular and plural), the SUSANNE tag set for English differentiates approximately 63 types of nouns based on semantic and morphosyntactic properties, and in the Russian Uppsala corpus, we find 111 different tags for common and proper nouns according to morphological features.
- Conceptual overlap
- In languages with grammaticalized determiners, attributive possessive pronouns can be regarded as determiners, as they, like an article, fulfil the function to mark a nominal as a noun phrase (resp. determiner phrase). However, in the literal sense (and traditional grammar), attributive possessive pronouns are "pro-nouns", i.e. replacements of names, i.e. they are characterized by their referentiality, and hence, pronouns. There is free variation among tag sets whether attributive possessive pronouns are regarded as determiners (ccording to their syntactic function), or pronouns (according to their semantic characterization).
All these problems are taken from the seemingly most elementary domain, the domain of part of speech tags, however, more problems arise as soon as morphology, syntax, or discourse phenomena are addressed.
In order to overcome such problems, terminological integration is necessary, i.e.
- documentation of terminological differences
- harmonization between different terminologies
To provide an integrated access to terminologically heterogeneous resources, it is also necessary to provide an abstract model of linguistic reference terminology to which individual annotations refer, a so-called "terminological backbone".
Classical solutions are the standardization approach and the interlingua approach:
- Standardization (cf. the EAGLES recommendations on morphosyntactic annotation)
- Definition of a reference inventory of terms which must or may be considered by a standard-conformant annotation scheme. Concrete annotations are directly mapped onto reference terms or a disjunction of reference terms. (Wilson and Leech 1996)
- Interlingua (cf. the AMALGAM project)
- From different annotation schemes, or tag sets, an abstracted representation is derived which subsumes all possible differences between the participating tag sets. Whenever no direct mapping of annotations (e.g. X and Y) from different annotation schemes (e.g. A and B) is possible, all possible combinations must be represented in the interlingua, i.e. (A:X,B:X), (A:X,B:Y), (A:Y,B:X), (A:Y,B:Y).
Both solutions are limited in flexibility and scalability, and hence, both approaches are applicable only within a limited domain. The standardization approach relies on the existence of common grammatical categories and features found in the languages for which standard-conformant tag sets are to be developed. Otherwise, it results in projection of complexity (e.g. the standard entails predictions for grammatical categories for a standard-conformant tagset which are absent in a language). However, even the sheer existence of universal morphosyntactic categories has been questioned in typologic research, and hence, the EAGLES-based standardization approach is unlikely to extend beyond "Standard Average European" languages.
The interlingua approach, however, involves the process to construct an interlingua between existing schemes, and is less statically than the standardization process. However, the complexity of the interlingua grows monotonically with every new language/tag set considered, and, hence, the general applicability of the interlingua approach is restricted by its limited scalability.
Therefore, the project is currently developing an ontology of linguistic annotations as a more flexible representation of a "terminological backbone".
An ontology-based approach
So far, we have developed an ontology of linguistic annotations with special consideration of part of speech and morphological annotations existing the participating Collaborative Research Centers (Schmidt et al. 2006, Chiarcos 2006c, Chiarcos 2006d, Chiarcos 2007).
The approach relies on the ontological reconstruction of annotation schemes based on guidelines and additional documentation in so-called "annotation models" (or "domain models").
Every annotation model represents one tag set or annotation scheme, with nonterminal nodes (concepts) representing conceptual categories as mentioned in the documentation or indicated in the document structure of the annotation guidelines, and terminal nodes (instances) representing concrete annotation values, or tags.
As an illustration, prototypes for the following annotation models are available in an HTML serialization:
- STTS (POS tags, German) [owl] (Stuttgart, Tübingen and Tiger-Variant)
- Tiger-Morphology (Morphology, POS tags inherited from STTS, German) [owl]
- SUSANNE (POS tags with partial information about morphosyntax and lexical semantics, English) [owl]
- Uppsala (POS tags and morphology, Russian) [owl]
With respect to morphosyntactic annotations, the OLiA annotation models currently comprise 16 annotation schemes applied to 42 languages (5 annotation models for English, 5 annotation models for German, 2 annotation models for Russian, one annotation model for Tibetan, one for Old High German, the Connexor annotation model for 10 European languages, one annotation model for a typologically-oriented annotation scheme applied to 29 languages). Annotation models for syntax and information structure/anaphora are currently under construction.
The concepts of these annotation models are linked to a common "reference model" which is based on the EAGLES recommendations for morphosyntax, and extended according to the needs of the participating annotation models, hence it is also referred to as "E(xtended)-EAGLES" ontology.
The annotation models are then mapped onto the categories specified in the reference model by means of conceptual subsumption (rdfs:subClassOf, rdfs:subPropertyOf). This mapping is specified in separate "linking files", thus making both the reference model and the annotation models independent and self-contained ontologies.
The "reference model", however, does not specify authoritative definitions for existing terminology, but only a fairly traditional view on it. Hence, its primary function is not to provide prescriptive definitions of terms, but only to provide a reference point for the participating annotation models. Whenever a more reliable ontology of linguistic terminology will be developed (e.g. revised versions of the General Ontology of Linguistic Description (GOLD) or the grammis ontology), the reference model can be linked with it in the same way as the annotation models are linked with the reference model, and thus mediate between such an external reference model and the annotation models. In this sense, the reference model serves as an interface to the annotation model, and it could be better termed "interface model".
- an exemplary implementation of the linking of E-EAGLES with an an extended version of GOLD, v.0.3 as an external reference model [owl]
Ontology-based corpus querying
Besides the purely documentation function of the ontologies, the specifications in the ontology can be used for tag-set neutral corpus querying. In essence, this means that expressions from the ontology can be directly used for corpus queries. As an example, a user may enter the query
- PossessivePronoun and hasNumber(Singular) and hasGender(Neuter) and hasCase(Genitive)
- instead of the SUSANNE tag
- APPGh1
Of course, APPGh1 is shorter, but it is a cryptic and idiosyncratic abbreviation, and knowing about the function of APPGh1 in SUSANNE helps nothing when searching for the corresponding items in, say, the Uppsala corpus, where the same query expands to
- pronomen_pos_1p_gen_sg_neut_opl | pronomen_pos_2p_gen_sg_neut_opl | ...
Especially, this kind of ontology-based corpus querying can thus allow researchers unfamiliar with a certain resource to take a first glance at a corpus with an unknown tag set without having to spend to much efforts in locating and consuming the annotation documentation. Hence, the bias for re-usability of existing resources is substantially lowered.
For ontology-based corpus querying, the OntoClient is developed, a JAVA-package that works as a pre-processor for corpus queries. Given a certain string, the OntoClient replaces ontology-sensitive sub-strings with the disjunction of tags retrieved as instances which satisfy the criteria specified in the ontology-sensitive sub-string.
The output of the OntoClient is highly configurable, and thus, it can be easily applied to practically any kind of existing corpus query interface.
- Currently, we have implemented a prototype for an ontology-sensitive CQP interface.
- At the GLDV Frühjahrstagung 2007, Christian Chiarcos and Michael Götze presented the integration of the OntoClient with the ANNIS.
- At the RaNLP 2007, Georg Rehm, Richard Eckart and Christian Chiarcos will present the application of the OntoClient as a pre-processor for XQuery templates.
OLiA Reference Model and system ontologies
Module | Phenomenon | OWL/DL models |
---|---|---|
OLiA Reference Model for morphosyntax, morphology and syntax | morphosyntax, morphology and syntax | http://purl.org/olia/olia.owl |
OLiA Reference Model for discourse structure | discourse structure, discourse relations | t.b.a |
OLiA Reference Model for information structure | information structure, information status, coreference | t.b.a |
OLiA System Ontology | basic annotation data structures | http://purl.org/olia/system.owl |
OLiA Top-Level Ontology | top-level concepts of the OLiA Reference Model for morphosyntax, morphology and syntax | http://purl.org/olia/olia-top.owl |
Annotation Models
Multilingual Annotation Models for morphological, morphosyntactic and syntactic annotation
Tagset / NLP tool | Phenomenon | Languages | OWL/DL models |
---|---|---|---|
SFB632 annotation standard (Dipper et al. 2008) | parts of speech, glosses, chunk labels, grammatical functions (phonology, information structure) | > 30 typologically different languages, including many African languages | Annotation Model, Linking Model |
EAGLES recommendations (Leech and Wilson 1996) | morphosyntax | 11 EU languages, incl. Romance, Germanic, Greek and Irish | Annotation Model, Linking Model |
Connexor dependency parser | morphosyntax, morphology, dependency syntax | 10 European languages, incl. Romance, Germanic and Uralic languages | Annotation Model, Linking Model |
MULTEXT-East | morphosyntax, morphology | 15 mostly Eastern European languages, incl. Slavic, Romance, Uralic languages and Persian | Annotation Model (common specifications), Linking Model; Annotation Model (all languages), see project page and below for individual languages |
IL-POSTS tagset Baskaran et al. (2008) | morphosyntax | languages of the Indian subcontinent | Annotation Model, Linking Model |
AnnCorra Bharati et al. (2006) | morphosyntax, chunks | languages of the Indian subcontinent | Annotation Model, Linking Model |
IIIT tagset IIT (2007) | morphosyntax | languages of the Indian subcontinent | Annotation Model, Linking Model |
Annotation Models for the morphological, morphosyntactic and syntactic annotation of English
Tagset / NLP tool | Phenomenon | OWL/DL models |
---|---|---|
Brown corpus tagset | morphosyntax | Annotation Model, Linking Model |
Connexor dependency parser | morphosyntax, morphology, dependency syntax | Annotation Model, Linking Model |
EAGLES recommendations (English) (Leech and Wilson 1996) | morphosyntax | Annotation Model, Linking Model |
GENIA corpus | morphosyntax | Annotation Model, Linking Model |
MULTEXT-East (English) | morphosyntax | Annotation Model, Linking Model |
Penn Treebank | morphosyntax | Annotation Model, Linking Model |
Penn Treebank | syntax | Annotation Model, Linking Model |
QTag | morphosyntax | Annotation Model, Linking Model |
Stanford dependency parser | dependency syntac | Annotation Model, Linking Model |
Susanne corpus | morphosyntax | Annotation Model, Linking Model |
Annotation Models for the morphological, morphosyntactic and syntactic annotation of German
Tagset / NLP tool | Phenomenon | OWL/DL models |
---|---|---|
Connexor dependency parser | morphosyntax, morphology, dependency syntax | Annotation Model, Linking Model |
EAGLES recommendations (German) (Leech and Wilson 1996) | morphosyntax | Annotation Model, Linking Model |
Morphisto | morphology | Annotation Model, Linking Model |
STTS | morphosyntax | Annotation Model, Linking Model |
TIGER/NEGRA | morphology | Annotation Model, Linking Model |
TIGER/NEGRA | constituent syntax | Annotation Model, Linking Model |
TreeTagger Chunker | chunk labels | Linking Model |
RFTagger | morphosyntax, morphology | t.b.a |
Annotation Models for the morphological, morphosyntactic and syntactic annotation of other Germanic languages
Tagset / NLP tool | Phenomenon | Languages | OWL/DL models |
---|---|---|---|
EAGLES recommendations (Leech and Wilson 1996) | morphosyntax; inflectional morphology | Danish, Dutch, Swedish (and several non-Germanic languages) | Annotation Model, Linking Model |
Connexor | morphosyntax, morphology, dependency syntax | Dutch, Swedish, Danish, Norwegian | Annotation Model, Linking Model |
SFB632 annotation standard (Dipper et al. 2008) | parts of speech, glosses, chunk labels, grammatical functions (phonology, information structure) | Dutch (among other languages) | Annotation Model, Linking Model |
MENOTA (incomplete) | morphosyntax | Old Norse | Annotation Model, Linking Model |
T-CODEX | morphosyntax, syntax, information structure | Old High German | Annotation Model, Linking Model |
Annotation Models for the morphological, morphosyntactic and syntactic annotation of Russian
Tagset / NLP tool | Phenomenon | OWL/DL models |
---|---|---|
Uppsala corpus tagset | morphosyntax, morphology | Annotation Model, Linking Model |
Russian TreeTagger (Serge Sharoff) | morphosyntax | Annotation Model, Linking Model |
MULTEXT-East for Russian | morphosyntax, morphology | Annotation Model, Linking Model |
Annotation Models for the morphosyntactic annotation of other Slavic languages
Tagset / NLP tool | Languages | OWL/DL models |
---|---|---|
MULTEXT-East | Bulgarian | Annotation Model, Linking Model |
MULTEXT-East | Czech | Annotation Model, Linking Model |
MULTEXT-East | Macedonian | Annotation Model, Linking Model |
MULTEXT-East | Polish | Annotation Model, Linking Model |
MULTEXT-East | Slovak | Annotation Model, Linking Model |
MULTEXT-East | Slovene | Annotation Model, Linking Model |
MULTEXT-East | Resian (Slovene spoken in Italy) | Annotation Model, Linking Model |
MULTEXT-East | Serbian | Annotation Model, Linking Model |
MULTEXT-East | Ukrainian | Annotation Model, Linking Model |
Annotation Models for the morphological, morphosyntactic and syntactic annotation of French
Tagset / NLP tool | Phenomenon | OWL/DL models |
---|---|---|
EAGLES recommendations (Leech and Wilson 1996) | morphosyntax | Annotation Model, Linking Model |
French TreeTagger (Achim Stein) | morphosyntax | Annotation Model |
Le Monde corpus (Abeillé et al. 2000) | morphosyntax | Annotation Model |
Connexor | morphosyntax, morphology, dependency syntax | Annotation Model, Linking Model |
SFB632 annotation standard (Dipper et al. 2008) | parts of speech, glosses, chunk labels, grammatical functions (phonology, information structure) for Canadian French (among other languages) | Annotation Model, Linking Model |
Annotation Models for the morphological, morphosyntactic and syntactic annotation of other Romance languages
Tagset / NLP tool | Phenomenon | Languages | OWL/DL models |
---|---|---|---|
EAGLES recommendations (Leech and Wilson 1996) | morphosyntax | Catalan, Portuguese, Spanish | Annotation Model, Linking Model |
Connexor | morphosyntax, morphology, dependency syntax | Spanish, Italian | Annotation Model, Linking Model |
PAROLE Spanish/Catalan (http://nlp.lsi.upc.edu/freeling) | morphosyntax, inflectional morphology | Spanish, Italian | Annotation Model |
MULTEXT-East | morphosyntax, morphology | Romanian | Annotation Model, Linking Model |
Annotation Models for the morphological, morphosyntactic and syntactic annotation of Uralic and Altaic languages
Tagset / NLP tool | Phenomenon | Languages | OWL/DL models |
---|---|---|---|
Connexor | morphosyntax, morphology, dependency syntax | Finnish | Annotation Model, Linking Model |
MULTEXT-East | morphosyntax, morphology | Estonian | Annotation Model, Linking Model |
MULTEXT-East | morphosyntax, morphology | Hungarian | Annotation Model, Linking Model |
SFB632 annotation standard (Dipper et al. 2008) | parts of speech, glosses, chunk labels, grammatical functions (phonology, information structure) | Hungarian (among other languages) | Annotation Model, Linking Model |
Turkish POS tagset (Oflazer et al. 2003) | morphosyntax | Turkish | Annotation Model |
Annotation Models for the morphosyntactic annotation of other European languages
Tagset / NLP tool | Phenomenon | Languages | OWL/DL models |
---|---|---|---|
EAGLES recommendations (Leech and Wilson 1996) | morphosyntax | Greek, Irish (among other EU languages) | Annotation Model, Linking Model |
SFB632 annotation standard (Dipper et al. 2008) | parts of speech, glosses, chunk labels, grammatical functions (phonology, information structure) | Georgian, Greek (among other languages) | Annotation Model, Linking Model |
EUSTagger (Ezeiza et al. 1998) | morphosyntax | Basque | Annotation Model |
Annotation Models for the morphosyntactic annotation of Indoiranian languages
Tagset / NLP tool | Phenomenon | Languages | OWL/DL models |
---|---|---|---|
Urdu EMILLE tagset Hardie (2003, 2004) | morphosyntax, inflectional morphology | Urdu | Annotation Model, Linking Model |
Urdu tagset Sajjad (2007) | morphosyntax | Urdu | Annotation Model, Linking Model |
IL-POSTS tagset Baskaran et al. (2008) | morphosyntax, inflectional morphology | Bangla, Hindi, Marathi, Sanskrit | Annotation Model, Linking Model |
AnnCorra Bharati et al. (2006) | morphosyntax, chunks | Bangla, Hindi | Annotation Model, Linking Model |
IIIT tagset IIIT (2007) | morphosyntax | Hindi, Marathi | Annotation Model, Linking Model |
SFB632 annotation standard (Dipper et al. 2008) | parts of speech, glosses, chunk labels, grammatical functions (phonology, information structure) | Konkani (among other, unrelated languages) | Annotation Model, Linking Model |
MULTEXT-East | morphosyntax | Farsi (Persian) | Annotation Model, Linking Model |
Annotation Models for the morphosyntactic annotation of Dravidian languages
Tagset / NLP tool | Phenomenon | Languages | OWL/DL models |
---|---|---|---|
IL-POSTS tagset Baskaran et al. (2008) | morphosyntax | Kannada, Malayalam, Tamil, Telugu | Annotation Model, Linking Model |
AnnCorra Bharati et al. (2006) | morphosyntax, chunks | Telugu, Tamil | Annotation Model, Linking Model |
IIIT tagset IIIT (2007) | morphosyntax | Telugu | Annotation Model, Linking Model |
Annotation Models for the morphological, morphosyntactic and syntactic annotation of Tibeto-Burman languages
Tagset / NLP tool | Phenomenon | Languages | OWL/DL models |
---|---|---|---|
Dzongkha tagset (Chungku et al. 2010) | morphosyntax | Dzongkha | Annotation Model, Linking Model |
SFB632 annotation standard (Dipper et al. 2008) | parts of speech, glosses, chunk labels, grammatical functions (phonology, information structure) | Prinmi (among other, unrelated languages) | Annotation Model, Linking Model |
Tübingen Tibetan Corpora (Wagner & Zeisler 2004) | morphosyntax, morphology, syntax | Tibetan (Old Tibetan, Classical Tibetan, Balti, Ladakh) | Annotation Model |
Annotation Models for East Asian languages
Annotation scheme / Corpus | Phenomenon | Languages | OWL/DL models |
---|---|---|---|
Penn Chinese Treebank (Xia 2000) | morphosyntax | Chinese | Annotation Model |
SFB632 annotation standard (Dipper et al. 2008) | parts of speech, glosses, chunk labels, grammatical functions (phonology, information structure) | Japanese (among other, unrelated languages) | Annotation Model, Linking Model |
Annotation Models for Afroasiatic languages
Annotation scheme / Corpus | Phenomenon | Languages | OWL/DL models |
---|---|---|---|
Arabic tagset (Khoja 2001) | morphosyntax | Arabic | Annotation Model |
SFB632 annotation standard (Dipper et al. 2008) | parts of speech, glosses, chunk labels, grammatical functions (phonology, information structure) | Chadic languages (including Guruntum, Tangale, Hausa) | Annotation Model, Linking Model |
Hausa Internet Corpus (Chiarcos et al. 2011) | morphosyntax | Hausa | t.b.a |
Annotation Models for the languages of Subsaharic Africa
Annotation scheme / Corpus | Phenomenon | Languages | OWL/DL models |
---|---|---|---|
SFB632 annotation standard (Dipper et al. 2008) | parts of speech, glosses, chunk labels, grammatical functions (phonology, information structure) | Gur and Kwa languages (including Aja, Dagbani, Buli, Byali, Ditammari, Fon, Foodo, Konni, Nateni, Waamma, Yom) | Annotation Model, Linking Model |
SFB632 annotation standard (Dipper et al. 2008) | Chadic languages (including Guruntum, Tangale, Hausa) | ||
Hausa Internet Corpus (Chiarcos et al. 2011) | morphosyntax | Hausa | t.b.a |
Annotation Models for indigenous languages of the Americas, Australia and the Pacific
Annotation scheme / Corpus | Phenomenon | Languages | OWL/DL models |
---|---|---|---|
SFB632 annotation standard (Dipper et al. 2008) | parts of speech, glosses, chunk labels, grammatical functions (phonology, information structure) | Teribe, Yucatec Maya, Mawng, Niue | Annotation Model, Linking Model |
Annotation Models for discourse annotations
Annotation scheme / Corpus | Phenomenon | Languages | OWL/DL models |
---|---|---|---|
ARRAU corpus | coreference | English | t.b.a |
CRC 732, A3 annotations of the Stuttgarter Radio News Corpus | information status, pronominal coreference | German | t.b.a |
OntoNotes | coreference | English | t.b.a |
Penn Discourse Graphbank | discourse relations | English | t.b.a |
Penn Discourse Treebank | connectives, discourse relations | English | t.b.a |
Potsdam Coreference Scheme | coreference | English, German | t.b.a |
RST Discourse Treebank | RST discourse relations and discourse segments | English | t.b.a |
External Reference Models
Terminological repository | Original url | Local url | Linking Model |
---|---|---|---|
ISO TC37/SC4 Data Category Registry | http://www.isocat.org | t.b.a | t.b.a |
GOLD | http://linguistics-ontology.org | t.b.a | t.b.a |