This weekend, I had the privilege of virtually attending the 60th Annual Meeting of the Association for Computational Linguistics (ACL). I was able to attend two fantastic lectures, and I will be discussing the first today. Its topic was one of great importance to modern computational linguistics: how to train neural networks accurately with limited access to labeled text data.
The presentation began with a quote relatable to anyone who’s tried to complete a computational linguistics task with machine learning: “I have an extremely large collection of clean labeled data” – no one. The truth of the matter is that creating labeled data sets is expensive and time-consuming, and in order to get the hundreds of thousands if not millions of data points needed for accurate training, unlabeled data will almost always be needed, picked up from Wikipedia or review websites. The problem gets even worse with low resource languages, which often have little to no labeled data. As I am currently working on a task that has a very small number of resources, the methods discussed in this lecture are extremely important to my work.
I’d like to focus in on one of the major solutions that the presenters highlighted: semi-supervised learning, or learning that combines labeled and unlabeled data in various ways to create an optimal model. One method of semi-supervised learning the presenters discussed was entropy minimization. Entropy is the degree of a model’s uncertainty, so to minimize a model’s entropy is to tell it to be as confident as possible about its prediction. This works great when the model’s fit line does not run through any large clusters of data, but it breaks down to an extent when the line of best fit does hit such a cluster, as the model’s habit of being confident in all its predictions is significantly more likely to produce less accurate results at those clusters, resulting in a flawed outcome.
Another similar method that was discussed is self-training with ArgMax, in which the model treats its predictions with 100 percent certainty, making the decision process for each data point binary. Usually, the result is quite similar to entropy minimization, if perhaps a bit more accurate.
One of the last methods that was discussed was, in my opinion, the most interesting. It is called SentAugment, and it involves training a teacher model on a small set of labeled data, which then trains a student model on the original labeled data and the unlabeled data that most closely resembles it using a series of checks including simply the nearest neighbor of each labeled data point. This incorporation of a strong base model, any labeled data that you do have, and the optimal unlabeled data points for continued training produce a strong, accurate model.
Big thank you to Diyi Yang, Ankur Parikh, and Colin Raffel for the incredible lecture!