Evaluating Computational Models of Language Acquisition

How do children learn language? This question is one of the most important and mysterious challenges facing the modern linguistics community. If answered, language acquisition curriculums could be optimized, neural networks could be designed to more effectively learn human language, language corpora could be modeled after human language comprehension, and our understanding of the complex brain would be revolutionized.

There are two competing theories that attempt to answer this question: nativism and emergentism. The prior is Chomsky’s theory of a universal grammar, popularized by Pinker, which claims humans possess a biological component responsible for the acquisition of language. The latter approach claims that language is learned like all other human skills – from stimulus in the environment. The question of which theory is correct is still fiercely debated to this day. 

Today I would like to discuss a relatively new development in the search for an explanation of language learning: computational models of language acquisition. In particular, I’d like to focus on the issue of evaluating these models, as evaluation is key to accurate results, and they have proven rather difficult to examine.

First of all, why would we want to use a computer to better our understanding of the complex brain? In short, it is possible (not necessarily easy) to examine the learning process used by a neural network, and the nicely formatted data that such networks output is perfect for examination and statistical evaluation. Furthermore, research has shown that large language models, and to some extent their smaller neural network counterparts, can actually be rather accurate representations of the human mind and its learning processes under the right conditions.

So, we create a complex neural network that is ready to learn language from its environment. We simulate that environment by annotating a corpus of data (e.g. CHILDES) in a format that accurately represents the information given to a child and feeding it to the model. The model learns to form grammatical sentences… or does it? See, the problem is that it’s rather difficult to define the level of grammaticality expected from a child at a certain point in the learning process. Moreover, it’s difficult to define well-formedness generally, especially when the utterances are being generated from a model and don’t have an annotated data set that they can be compared to. Many research teams have tried to solve this problem, and I’m going to highlight some of the most interesting solutions below.

In 2008, van Zaanen et al. proposed four methods of evaluation. The first approach, nicely titled looks good to me, is rather subjective, but easy to implement. Another referred to as rebuilding a-priority known grammars feeds author-created sentences into the model for evaluation, allowing for a predicted successful output but heavily limiting the grammar. The language membership method tests precision and recall for a system against a test corpus, but that limits the so-called correct output to that described in the corpus, eliminating room for ambiguity. The fourth approach is similar to the third but uses treebanks, which is rather problematic for child language. All of these are good ideas, but none are perfect.

Another solution was presented by Chang et al. in 2008. Referred to as sentence prediction accuracy (SPA), the method simply describes the ability of a model to correctly order the words in an unordered sentence. This does overcome some problems, separating the evaluation from the language and focusing solely on grammar while also ignoring underlying linguistic theory. Where the method breaks down, however, is when it is presented with more complex sentences (its tests involved only short utterances with little ambiguity). As complexity increases, more sentence structures could be correct, yet all but one will be ruled out and marked as wrong, rendering the method too strict. 

Two other interesting methods were proposed, one by Brodskey et al. in 2007 and another by Kol et al. in 2009. The former trains two constrained models on different corpora under the assumption that they will learn to evaluate grammaticality differently and then uses those models to probabilistically analyze the accuracy of the test sentences. The latter trains on the CHILDES data in a specific manner that more accurately replicates child language acquisition and then evaluates on a subset of the data. To account for over-generation, the model is asked to evaluate sentences in reverse order, a task at which it would ideally perform poorly.

In summary, the challenge of evaluating computational models of language acquisition remains an open area for research, but many interesting methods have been proposed, and I am excited to see where this unique approach brings the study of language learning in the future.


Wintner, S. (2010, March). Computational models of language acquisition. In International Conference on Intelligent Text Processing and Computational Linguistics (pp. 86-99). Springer, Berlin, Heidelberg.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: