Understanding regular expression parsing
I was reading a wonderfull article on python regular expression. And I was playing with some example code.
then I got stuck with following code. I couldn’t understand why the output is the way it is.
The pattern ab+
means ‘a’ followed by one or more ‘b’. So “abb” should also be present in output. But it was not!!
So I tried to come up with simple logic to understand how regular expression parsing takes place to produce the output. And here it is!
How regular expression parses the pattern
text - Actual text to search for(input)
pattern - What to search for
(say, in a give log file(text) find all ip address(pattern))
Steps
- Start from the beginning of the ‘text’
- Start looking for ‘pattern’ in ‘text’
- If fails at initially, move to next char in ‘text’ and proceed from step 2
- If passes, then parse character by character until fails. print the passed characters. Move to next char and proceed from step 2
- If reached end of ‘text’, print the passed characters and stop
Good news!
Now I know why the output doesn’t contain “abb”.
Its because of step 4 of parsing. When the parsing “abb” is done, its still ‘passing’ the pattern rules so it continous to “abbb”. And now next character is ‘a’ and pattern ab+
fails. So parsing stops and prints “abbb” then continue its parsing from next character ‘a’