Using regular expressions to learn a language

Regular expressions are cool creatures, but I've mostly been avoiding anything but the most basic ones, thinking they are more trouble than they are worth. Last week I found myself needing them when I was going through a largish code base. I ended up solving two problems, as well as ending up with two new problems.

The first problem came up when I was learning about traits (similar to interfaces, protocols or abstract classes in many other language) in Scala. I wanted to find all traits in a code base, but this turned out to print out a lot of "sealed trait"s (traits that are private). After some googling I found out about regular expression look-behind, which goes back in the text after it finds a match and, in this case, ignores all the matches which are prepended with "sealed ". This turns up the desirable public " trait"s only, which was exactly what I wanted.

ack '(?<!sealed) trait'

The second problem was learning about pattern matching in Scala. It was less about regular expressions per se and more about useful flags to ack  (like grep but with a better interface for code search). I wanted to find some examples of `match` being used, but there was no context outside of the line being matched, and it was getting annoying to constantly go to the file in question. Looking in the manual of ack (I used grep to find what I wanted, out of habit: `man ack | grep context`!) I found the neat flags --A (after lines), --B (before lines) and --C (context, both before and after) that prints the surrounding n lines. All I had to was write something like the following to get a good idea of how "match" worked in several instances.

ack --scala --C=5 "match "

Pretty useful stuff.

Enamored with my newfound interest for searching through code bases I immediately started thinking of other things I wanted to do. Here are two things I wanted to do but didn't find any obvious way to solve, mostly because I couldn't figure out how to deal with wrapping lines. Maybe someone reading this knows how to do them?

  1. How would you find all functions longer than 10 lines? (Defined as either (defn foo ...) or just as a block with surrounding empty lines).
  2. How would you find all instances of `match` with everything between { and } being shown? (i.e. variable context)

1 would be useful in finding functions likely to be in need of refactoring, at least in Clojure code. For 2 I was hoping something like this was going to work:

ack --scala "match {(.|[\r\n])*\}"

but it and a couple of permutations thereof produced no matches.

I have three takeaways from this weekend's foray into regular expression.

  1. Regular expressions can be extremely useful.
  2. They are a rabbit hole.
  3. Real code with context is great for understanding both a language and a code base.

Read about Regex look-behind: http://www.regular-expressions.info/lookaround.html

Get Ack: http://beyondgrep.com/

Discuss this post on Hacker News: https://news.ycombinator.com/item?id=7830722