Almannarómur: An Open Icelandic Speech Corpus
The purpose of the "Almannarómur" project is collecting data for a speech corpus for Icelandic. Its main aim is creating an open source speech project to enable research and development for Icelandic language technology. The database is particularly suitable for acoustic modelling for speech recognition but it could also be used for other purposes, such as to develop a speaker recognition
system or to analyze prosody. The project is run by Reykjavik University and the ICLT in cooperation with Google. Further information can be found here.
A System Architecture for Intelligent Computer-Assisted Language Learning
The aim of this project is to develop open-source system architecture for supporting ICALL systems with an emphasis on the Nordic languages. The main goal is to make the architecture as language independent as possible such that the underlying relevant NLP tools can be easily plugged in as external modules depending on the language in question. The same holds for the type of activities (exercises) that the architecture will be able to support, i.e. new activities will be "plug and play" to as large an extent as possible.
Baltic and Nordic Parts of the European Open Linguistic Infrastructure (META-NORD)
The META-NORD project aims to establish an open linguistic infrastructure in the Baltic and Nordic countries.
The project will focus on European languages with less than 10 million speakers – EU official languages Danish, Finnish, Swedish, the languages of the recently accessed EU countries – Estonian, Latvian and Lithuanian – and the languages of the European Economic Area – Icelandic and Norwegian. Further information can be found here.
Viable Language Technology Beyond English
In January 2009, ICLT received a Grant of Excellence, "Viable Language Technology Beyond English", from the Icelandic Research Fund. The primary objective is to make it realistic to develop three particular types of LT modules with limited resources without sacrificing the quality of the work. The three types of modules are a database of semantic relations, a machine translation system, and a treebank.
Further information can be found here.
Construction of a new gold standard corpus
In this project, we build a new gold standard, a PoS-tagged corpus of modern Icelandic text. The corpus consists of a sample, about 1 million tokens, from Mörkuð íslensk málheild [PoS-tagged Icelandic corpus]. This sample will become a new gold standard to be used to train and evaluate taggers on Icelandic text.
In addition to the corpus itself, the main product of this project is a tool which automates the tagging processs. This process consists of i) sentence segmentation and work tokenisation; ii) PoS tagging using five different taggers and a combination method; and iii) the detection of tagging errors.
Improved tagging accuracy of Icelandic text
Icelandic is a morphologically complex language for which the task of part-of-speech (PoS) tagging has turned out to be difficult, both for data-driven and linguistic rule-based taggers. This project aimed to improve the tagging accuracy using several methods.
First, we used data from a large morphological database to extend the dictionaries used by the taggers. Second, we searched for cost-effective ways to increase the tagging accuracy of our linguistic rule-based tagger IceTagger. Third, we design an external tagset (the tagset used for evaluation) by removing information from the internal tagset (the tagset used by a tagger) which reflects distinctions that are not morphologically based. Lastly, we used six different taggers to build a combined tagger for the purpose of further increasing the tagging accuracy.
Detecting errors in a PoS-tagged corpus
Part-of-speech (PoS) tagged corpora are valuable resources for developing PoS taggers. Corpora in various languages have been used to train (in the case of data-driven methods) and develop (in the case of linguistic rule-based methods) different taggers, and to evaluate their accuracy. Consequently, the quality of the PoS annotation in a corpus (the gold standard annotation) is crucial.
In this project, we experimented with three different methods of PoS error detection using the Icelandic Frequency Dictionary (IFD) corpus.
First, we used the variation n-gram method proposed by Dickinson and Meurers (2003).
Secondly, we ran five different taggers on the corpus and examined those cases where all the taggers agreed on a tag, but, at the same time, disagreed with the gold standard annotation.
Lastly, we used IceParser to generate shallow parses of sentences in the corpus and then developed various patterns, based on feature agreement, for finding candidates for annotation errors.
The tagging accuracy of a particular text in a given language can usually be increased by combining taggers which are based on different tagging methods. In most cases, each combined tagger has been written from scratch, i.e. each developer has written the necessary program code to build the combined tagger. This is unfortunate because, generally, it entails the reproduction of code already written.
To tackle this problem, we developed CombiTagger, a language and tagset independent system for developing and evaluating combined taggers.
The system provides algorithms for simple and weighted voting, but it is extensible so that other combination algorithms can be added easily.
CombiTagger is an open source system which can be obtained here.
Lemmald: A lemmatizer for Icelandic
In this project, we developed a new mixed method lemmatizer for Icelandic, Lemmald, which achieves good performance by relying on IceTagger for tagging and The Icelandic Frequency Dictionary corpus for training. We combined the advantages of data-driven machine learning with linguistic insights to maximize performance. To achieve this, we made use of a novel approach: Hierarchy of Linguistic Identities (HOLI), which involves organizing features and feature structures for the machine learning based on linguistic knowledge. Accuracy of the lemmatization is further improved using an add-on which connects to the Database of Modern Icelandic Inflections.
Context-Sensitive Spelling Correction
Context-sensitive spelling correction is the task of correcting spelling errors which
result in valid words. In this project, we adapted established methods from English to a morphologically rich language (Icelandic) and concluded that the rich morphology negatively affects performance. However, our system is still good enough to be useful in regular word processing.
IceParser: A finite-state parser for Icelandic
In this project, we developed an incremental finite-state parser for Icelandic - the first parser published for the language. Input to the parser is PoS-tagged text and it generates output according to a shallow syntactic annotation scheme, specifically designed for this project.
The parser consists of a phrase structure module and a syntactic functions module. Both modules comprise a sequence of finite-state transducers, each of which adds syntactic information into substrings of the input text.
In its early years, ICLT spent considerable effort on experimenting and developing tagging methods for Icelandic. We have experimented with data-driven taggers, various combination methods and developed a linguistic rule-based tagger, named IceTagger.