Understanding regular expression parsing

05 Jul 2015

I was reading a wonderfull article on python regular expression. And I was playing with some example code.

import re

text = "aabbbaaababaab"
pattern = "abb"

re.findall(pattern, text)
# output: ['abb']
# pattern 'abb' found at index 1:3

then I got stuck with following code. I couldn’t understand why the output is the way it is.

import re

text = "aabbbaaababaab"
pattern = "ab+" # a followed by one or more b

re.findall(pattern, text)
# output: ['abbb', 'ab', 'ab', 'ab']
# wait.. Where is "abb"?

The pattern ab+ means ‘a’ followed by one or more ‘b’. So “abb” should also be present in output. But it was not!!

So I tried to come up with simple logic to understand how regular expression parsing takes place to produce the output. And here it is!

How regular expression parses the pattern

text - Actual text to search for(input)

pattern - What to search for

(say, in a give log file(text) find all ip address(pattern))

Steps

Start from the beginning of the ‘text’
Start looking for ‘pattern’ in ‘text’
If fails at initially, move to next char in ‘text’ and proceed from step 2
If passes, then parse character by character until fails. print the passed characters. Move to next char and proceed from step 2
If reached end of ‘text’, print the passed characters and stop

Good news!

Now I know why the output doesn’t contain “abb”.

Its because of step 4 of parsing. When the parsing “abb” is done, its still ‘passing’ the pattern rules so it continous to “abbb”. And now next character is ‘a’ and pattern ab+ fails. So parsing stops and prints “abbb” then continue its parsing from next character ‘a’