Bergen Language Design Laboratory (BLDL)
BLDL has an internal meeting series. Some of these have a content which may be of interest to a larger audience. The program of these are announced here.
Contact Magne Haveraaen for more information.
- Friday 2019-04-26 1215-1400,
510N3 (the yoga room, informatikk), Høyteknologisenteret
Paul Meurer (University of Bergen Library, Department of Linguistic, Literary and Aesthetic Studies, University of Bergen, NO):
Querying linguistic databases – corpora and treebanks
In my lecture, I will give an introduction to databases of linguistic structures – corpora and treebanks, and show how such databases can be searched in a linguistically meaningful way.
Corpora are searchable collections of texts that are linguistically annotated, typically with part of speech, lemma and morphology information, and text-related metadata. Treebanks are special corpora of syntactically annotated texts; the name treebank derives from the fact that syntactic annotations often are tree structures (or in some cases more general directed graphs).
From a formal point of view, a corpus is a sequence of tokens (comprising the sequence of words of the texts), with attribute values attached to each one of the tokens, for a given set of attributes. A search language that is best suited to querying such a structure is a regular language; formally, it can be characterized as a regular expression calculus over the alphabet of constraints on corpus positions (i.e., tokens plus attribute values). I will show how such a query language can be efficiently implemented.
In treebanks, the search domain is the sequence of analyzed sentences, which formally are directed graphs. Such structures can be queried using a calculus based on first-order predicate logic. The variables of the calculus are node variables, and a sentence (directed graph) matches a logical form if there is a set of graph nodes on which the form evaluates to true.
To illustrate these two types of query languages, I will demonstrate the corpus tool Corpuscle, and the treebanking infrastructure INESS, both being hosted at the CLARINO Bergen Centre.
Short bio: Paul Meurer is a senior consultant and researcher at the University of Bergen Library and the Department of Linguistic, Literary and Aesthetic Studies. His research interests lie in the fields of theoretical and computational linguistics. Within theoretical linguistics, he has focused on morphology and syntax, and language typology. In computational linguistics, he has contributed to the research and development of language resources and tools in diverse fields, such as morphological and syntactic parsing for Norwegian, Georgian and Abkhaz, treebanking, vizualization, corpus management and search, terminology, and metadata curation.
VilVite auditorium is in VilVite, Thormøhlensgt 51.
Conference room D is in VilVite, Thormøhlensgt 51.
Lille auditorium is in Datablokken, Høyteknologisenteret, Thormøhlensgt 55.
Stort auditorium is in Datablokken, Høyteknologisenteret, Thormøhlensgt 55.