Thursday, 2 December 2010

Operator "or" in Python's regular expressions is not greedy

This is what I learned today:

>>> import re
>>> re.findall(r'(a|ab)', 'ab')
['a']

I always thought that regular expressions in Python are greedy, but it seems I was mistaken. From the Python manual (quite hidden, but still explicitly written):

'|'

A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. [...] As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy.

This means that reversing the order of items in parentheses produces the desired effect:

>>> re.findall(r'(ab|a)', 'ab')
['ab']

So if you have a list of strings and you want to find all occurrences of strings in this list in a (potentially large) text, you could easily create a regular expression from them and use it for searching:

regexp = re.compile(
ur'(%s)' % u'|'.join(re.escape(name) for name in names),
re.UNICODE | re.MULTILINE)

The problem is that if you have two names, e.g. "John" and "John Smith" and "John" comes before "John Smith" in the names list, "John Smith" will never be found, only "John".

To fix this, you can sort names in reverse order based on their length:

names.sort(key=lambda n: len(n), reverse=True)

This would work, but looks like a hack and anyway, regular expressions in the form (John|John Smith|...) could be improved. How about making a simple prefix tree and generating the regular expression from it? Like this one:

class PrefixTree(object):
'''
Prefix tree for generating optimal regular expressions.
'''
def __init__(self):
self.children = {}
self.terminal = False

def add(self, word):
if not word:
self.terminal = True
return
firstLetter, rest = word[0], word[1:]
if firstLetter not in self.children:
self.children[firstLetter] = PrefixTree()
self.children[firstLetter].add(rest)

def toRegex(self):
children = self._getRegexOfChildren()
if not self.terminal:
assert self.children
if self.terminal:
if children:
return '(%s)?' % children
else:
return ''
elif len(self.children) > 1:
return '(%s)' % children
else:
return children

def _getRegexOfChildren(self):
childkeys = sorted(self.children.keys())
return '|'.join(k + self.children[k].toRegex()
for k in childkeys)

Basic usage:

>>> pt = PrefixTree()
>>> pt.add('John')
>>> pt.add('John Smith')
>>> print pt.toRegex()
John( Smith)?

On a side note, regular expressions created from the prefix tree are twice as fast as the previous ones, created simply by joining all possible names.

0 comments: