Modeling Language: Regular Expressions

Spring is finally here, and the warm weather and clear skies are a welcome change from the harsh winter snow. Today, in the spirit of overcoming the winter’s challenges, I’d like to discuss a common challenge faced by computational linguists, along with a fascinating solution.

One of the most common programming languages in the modern era is Python, as it has an extremely fast compilation time, it reads similarly to English, and it has many easily accessible machine learning repositories. One of the few downsides to using Python for computational linguistics is that its strings are immutable. This means that any string, a variable representing language characters (i.e “hello”) cannot be changed after it is created. Naturally, this causes many inconveniences to linguists, as changing the contents of strings in a program is necessary for most analysis, translation, and manipulation tasks.

There are multiple ways to get around this limitation. The first and perhaps the simplest solution is to do your string manipulations in a list, an ordered set of variables that can be changed. However, this requires many lines of code, takes up a decent chunk of memory, and slows down run times. These problems don’t represent issues in small programs, but in large quantities, their inconveniences can stack up, and programs become very difficult to read. 

Thankfully, there is a better solution: regular expressions. Regular expressions are special types of strings that can be used to manipulate normal strings. They allow linguists to search for occurrences of a certain series inside of a larger string, substitute and remove certain characters, and split strings into lists of words or characters. Furthermore, they are relatively easy to use and read. For a more detailed explanation of how they work, I highly recommend you check out the user-friendly documentation here: https://docs.python.org/3/howto/regex.html#regex-howto.

Using regular expressions, reduces the lengths of programs, makes them easier to read, and makes computational modeling challenges in Python far easier. They are a wonderful tool and a vital part of any computational linguist’s arsenal.

Leave a comment