Regex Lookahead Party | Victor Bodell

There is a universe of design ideas and concepts around in computer programming that I wish I knew and understood better. Some of them I indulge in through reading, trying to apply previous experience. Others I’m exposed to in the code of peers and colleagues, and still others I look forward to making an acquaintance with. But there’s the occasional concept that I feel like I’m fairly good at. And one of them is Regex.

Granted, I cheat often, primarily via regex101. But a lot of the magical characters are known to me, I would assume through having used the vim substitute command for a number of years now. But, this week gave me a new lesson: There’s nothing like being forced to apply the concepts in the wild. There’s nothing that will force you to learn more than having an actual implementation to consider. Which we did this week at work!

Consider the following: At SR we have numerous journalists producing articles every day. Sometimes an extraordinary event happens and we want to consolidate those reports into a single article to track updates in close to real time. Consider the day of the Russian invasion in Ukraine. Or a tragic storm hitting the country. This is known as a Directreport. And we use them frequently at SR. The problem is that they’re reported in a system that doesn’t have innate support for the format. So journalists do what any good tinkerer will: They shoehorn. Reusing the formatting toolbox they have at hand by styling the text to make it look like multiple articles are embedded in the solution. A simplified UML diagram for a supported solution could look something like this

--- title: Directreport example --- classDiagram Directreport "1" --> "*" Article Article : +String text Directreport : +Article[] articles

But since the reality is closer to

--- title: Directreport reality --- classDiagram class Directreport { +String content }

there’s only one way to actually structure the content into the grouping structure we want: Regex!

Admittedly, this stuff is not as scary as it used to now that LLMs write most of the clonky syntax. But since the model made assumptions about the structure that turned out to be more complex (journalists may not be that notorious in jotting down the text according to a singularly defined structure) we had to put on the surgeons mask and get into the nitty gritty. Specifically, since we’re working with unfiltered html from the source system rich text editor we’d see a lot of this stuff:

const r = /<p>([^<]+)<\/p>/;

This is all pretty familiar territory. Verbatim html tags, capture groups, character classes, inverted character classes. I.e. parse the tag with whatever’s inside it that isn’t itself an opening tag. But I also encountered this for the first time:

Not to brag, but I also know that -E in grep enables “extended” regex so you don’t have to escape | or + (I guess they weren’t part of the original definition). Or the v-qualifier in vim for “VeryMagic” acheiving the same effect.

const r = /<p>([^<]+(?:<[^>]+\/>))<\/p>/;

And apparently the questionmark after an opening group acts as a sort of qualifier so that the next character defines specific behavior for the capture group. (?: then specifically defines a non-capture group. I.e. a group that won’t be a part of the match. Useful for text within a group that you want to disqualify. Another interesting feature is (?= denoting positive lookahead, i.e. it checks for the text but doesn’t consume it in the regex match so that you can match on the same string again that you just defined. This turned out to be useful for us because the initial delimiter wasn’t used that consistently and so we had to group the articles by their headline and consequently split the texts by the regex match for them while not actually consuming the text itself in the match.

Inversely there’s the negative lookahead: (?! that is a non-match so that you can trigger some alert based on the regex no longer matching because the keyword has been introduced. The website example for this makes the case that you don’t want the keyword “Error” to match on your website. I don’t really see why you can’t just use a normal regex and flag on the error match for this, but I guess if you want to be more fancy you would have other groups in the negative lookahead regex that you do want to match on and maintain. Actually this site explains it better.

Negative lookahead is indispensable if you want to match something not followed by something else.

And it further explains that lookaheads (and lookbehinds! Using the same syntax as above but adding < between ? and = or !) aren’t actually capture groups at all but simply their definition: a lookahead or lookbehind.

Final bonus point: (?s) enables DOTALL mode!. Consequently . will match even newline in the following regex. Man, regex is so much wilder than I thought. It’s like when you start to realize that folks are programming stuff in SQL…