This summer, I had the pleasure of attending the 33rd European Summer School in Logic, Language, and Information, a fascinating series of multidisciplinary courses focusing on the intersection of computational linguistics, semantics, and logic. Today I’d like to discuss one of the classes I participated in called Creating and Maintaining High-quality Language Resources, which did a fantastic job explaining best practices for creating optimal research material and also brought in a new, interesting perspective on the idea of training on a gold standard. The course was taught by Professor Valerio Basile, to whom I express my sincere thanks for leading the incredible class.
So, in order to discuss language resources, we first need to clarify what language resources actually are. In this case, the term language resource refers to a set of speech of language data, commonly paired with descriptions of the data referred to as tags. These resources are used for building, improving, and evaluating natural language and speech algorithms, along with serving as data for linguistics analysis.
There are many types of language resources, all of which have their own use cases. These types include corpora (large collections of speech or text with metadata), treebanks (collections of sentences and their corresponding syntactic structures), meaning banks (collections of sentences and representations of their meanings), lexicons (groups of words or multi-word expressions), and ontologies. These resources all serve as training data for neural networks or material for statistical analysis in a multitude of linguistic tasks, and a large part of language study would be entirely impossible without them. Their quality makes the difference between successful research with meaningful results and null or unpublishable findings.
In order to create these vital resources, text or speech must be collected and then annotated. While the former task is self-explanatory, the latter requires a little more explanation. We must answer a few big questions: how do we choose our annotators, how do we ensure our annotators know how to annotate well, and what do we do when our annotators disagree?
There are a few viable answers to the first question, including calling upon experts, gathering small groups in controlled environments, hiring annotators, and crowdsourcing. More recently, technological annotators have been incorporated as a first layer, which can then be verified by human judgment. While all of these methods have advantages and disadvantages, the main components to consider are your funding, the size of your dataset, the number of annotators you want (diversity, efficiency, accuracy, etc.), and the difficulty of the annotation process.
The second question is extremely important and has a pretty straightforward answer, but it can be tricky to implement and require many revisions; annotators are given annotation guidelines, which are later published with the language resource, explaining how they should be tagging the data. The better the guidelines, the better the annotation, but because the aim of tagging varies between tasks, it’s difficult to determine what optimal guidelines actually look like. A few attempts at standardization are starting to have widespread influence, making this task less daunting. As for ensuring annotators are accurate and efficient, incentives can be given via pay, gamification, or just making sure your annotators are interested or invested in the task.
The last question is also vital, and it doesn’t have one clear answer. Many statistical calculations have been proposed for determining annotator agreement, ranging from simple percentage agreement to more complicated equations like Cohen’s Kappa, Fleiss’ Kappa, and split-half reliability. No matter what form of agreement calculation is used, disagreement can be resolved by majority rules or discussion resulting in a revision of guidelines. It is always best to ensure that disagreements are resolved without bias and escalation whenever possible.
Language resources can be published online with research papers or by themselves. A number of large databases for such resources exist, including GitHub, OSF, and the European Language Grid. Resources can also be self-published on websites, but constant maintenance and high programming skill are usually required. Making sure to publish thorough explanations of methodology and annotation guidelines with resources is crucial to their effective use.
What has been discussed up to this point has been a summary of the standard approach to annotation of language resources for many years, but recently, some researchers have started to call for a change in perspective. For more information on this section, I recommend reading Professor Basile’s Perspectivist Data Manifesto, as it explains the issue in great detail. The abridged version is as follows: using a gold standard to train machine learning models actually causes inaccuracies instead of improving results. A gold standard is created when annotations from multiple annotators are combined in some way (majority rules, discussion, etc.) in order to have one final set of ‘ideal’ annotations to feed to a model for training. Lowering annotator disagreement as much as possible was thought to better the quality of data.
Professor Basile, and many of his colleagues, argue that the opposite is true: training models on distributions of annotations instead of a single aggregate set creates more accurate, less biased models. The forced agreement generated during the creation of gold standard data strengthens bias and takes away from a neural network’s ability to replicate the cognitive function of a representative sample of humans. Initial studies have shown that this is indeed true, as the same models trained on gold standard and distributed data saw better performance on the distributed data. While still very new, this approach seems quite promising for the future, and I’m excited to see where it goes!