Thursday, 2 December 2010

Operator "or" in Python's regular expressions is not greedy

This is what I learned today:

>>> import re
>>> re.findall(r'(a|ab)', 'ab')
['a']

I always thought that regular expressions in Python are greedy, but it seems I was mistaken. From the Python manual (quite hidden, but still explicitly written):

'|'

A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. [...] As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy.

This means that reversing the order of items in parentheses produces the desired effect:

>>> re.findall(r'(ab|a)', 'ab')
['ab']

So if you have a list of strings and you want to find all occurrences of strings in this list in a (potentially large) text, you could easily create a regular expression from them and use it for searching:

regexp = re.compile(
ur'(%s)' % u'|'.join(re.escape(name) for name in names),
re.UNICODE | re.MULTILINE)

The problem is that if you have two names, e.g. "John" and "John Smith" and "John" comes before "John Smith" in the names list, "John Smith" will never be found, only "John".

To fix this, you can sort names in reverse order based on their length:

names.sort(key=lambda n: len(n), reverse=True)

This would work, but looks like a hack and anyway, regular expressions in the form (John|John Smith|...) could be improved. How about making a simple prefix tree and generating the regular expression from it? Like this one:

class PrefixTree(object):
'''
Prefix tree for generating optimal regular expressions.
'''
def __init__(self):
self.children = {}
self.terminal = False

def add(self, word):
if not word:
self.terminal = True
return
firstLetter, rest = word[0], word[1:]
if firstLetter not in self.children:
self.children[firstLetter] = PrefixTree()
self.children[firstLetter].add(rest)

def toRegex(self):
children = self._getRegexOfChildren()
if not self.terminal:
assert self.children
if self.terminal:
if children:
return '(%s)?' % children
else:
return ''
elif len(self.children) > 1:
return '(%s)' % children
else:
return children

def _getRegexOfChildren(self):
childkeys = sorted(self.children.keys())
return '|'.join(k + self.children[k].toRegex()
for k in childkeys)

Basic usage:

>>> pt = PrefixTree()
>>> pt.add('John')
>>> pt.add('John Smith')
>>> print pt.toRegex()
John( Smith)?

On a side note, regular expressions created from the prefix tree are twice as fast as the previous ones, created simply by joining all possible names.

Monday, 15 November 2010

iBooks "Dictionary not available for this language" fix

There is an irritating bug in the newest version of iBooks for iPhone (and probably also for iPad): if you try to use the built-in dictionary on English books that were converted (e.g. by Calibre) without the proper language set, you'll get an error:
Dictionary not available for this language
The culprit here is the metadata file of the ePub archive, content.opf. More specifically, it's an XML document that has a line corresponding to the language the book is supposed to be written in. The line should look like this:
<dc:language>en</dc:language>
But it often looks like this:
<dc:language>UND</dc:language>
I wrote a small script, epub-dict-fix, to batch-fix ePub files and assign a language (English by default) to them. You can get it from its Google Code homepage. I wrote it mainly for myself, but I hope it will be useful to someone else, too. Oh, and you need Python installed for it to work.

Tuesday, 12 October 2010

Subclassing frozenset

I encountered a strange "feature" of Python standard library a few days ago. I needed to subclass frozenset (an immutable set that is a part of Python standard library) to add some validation check on the contents in the constructor. My first approach looked like this:
def validate(iterable):
print 'validation:', iterable # complex validation here


class MyFrozenset(frozenset):

def __init__(self, iterable):
listed = list(iterable)
validate(listed)
frozenset.__init__(self, listed) # ***

if __name__ == '__main__':
fs = MyFrozenset([0, 1, 2, 3, 4])
print fs
Note that I took precautions and converted iterable to list to avoid consuming a generator before it reached frozenset's __init__. The code worked, but I received a strange warning:
validation: [0, 1, 2, 3, 4]
blog.py:17: DeprecationWarning: object.__init__() takes no parameters
frozenset.__init__(self, listed)
MyFrozenset([0, 1, 2, 3, 4])
It seemed as if frozenset didn't override __init__ at all! And true enough, removing the call to the parent constructor (marked with *** in the code) removed the warning while the code still worked!

I started wondering how it is possible for the set to be filled without passing the iterable to the parent. I made another check, but used a generator as argument instead:
def get_five():
for i in xrange(5):
yield i

if __name__ == '__main__':
fs = MyFrozenset(get_five())
print fs
Based on the previous warning, the results were not very surprising:
validation: []
MyFrozenset([0, 1, 2, 3, 4])
It seems that the generator had already been consumed before entering my __init__ method! I then remembered that there is another object initialization method you can call beside __init__: the __new__ method! I quickly checked Python help and found:
__new__() is intended mainly to allow subclasses of immutable types (like int, str, or tuple) to customize instance creation.

It seems this is exactly what I was looking for! I quickly changed __init__ to __new__:
class ProperFrozenset(frozenset):

def __new__(cls, iterable):
listed = list(iterable)
validate(listed)
return frozenset.__new__(cls, listed)

if __name__ == '__main__':
fs = ProperFrozenset(get_five())
print fs
And it worked:
validation: [0, 1, 2, 3, 4]
ProperFrozenset([0, 1, 2, 3, 4])
I'm not sure why it's done this way in Python, but I guess it's for consistency: __init__ is a method that is called after an object has been created, so if the object is immutable, no method should be able to modify it. __new__, on the other hand, is called in the class context before the desired instance is created, so manipulation of arguments is still possible.

One mystery remains: if initialization of frozenset is done in its __new__ method, why do you still receive the iterable argument if you override __init__ and why not having an overridden __init__ doesn't raise a warning?

Sunday, 21 June 2009

Blog Name Change

The previous post wasn't exactly about Design Patterns or Python, so I decided to change the blog name. user_entitties is a name of a MySQL table that existed in a large project I was working on. It was there for about two years before someone noticed the funny spelling mistake. So, welcome to user_entitties! I hope you like it.

Show Photos of Two People in Picasa Web Albums

If you're a user of Picasa Web Albums, you probably know about the awesome feature of showing all photos that include a selected person. What you may not know is that, with a little tweak, you can show photos that include two (or more) people you chose. This is great e.g. for creating an album with only pictures of you and your friend, then creating a nice collage and making a great birthday gift!


So, how do you create such a filter? Notice that, when you view photos of one person, the URL of the webpage looks like this:



http://picasaweb.google.com/lh/view?uname=<username>&isOwner=true&subjectids=<long hash>#.

I don't think this is documented anywhere, but the subjectids parameter is actually a comma-separated list -- you just don't see it, because it has only one element. So, all you have to do is copy the <long hash> part from views of two (or more) different persons and combine it into one URL:



http://picasaweb.google.com/lh/view?uname=<username>&isOwner=true&subjectids=<long hash 1>,<long hash 2>#

And that's it! You can then easily download all photos in the album by opening its RSS feed (there's a small RSS link in the lower-right corner of the website) and downloading all photos, e.g. with FlashGot Firefox Extension. Here's how the header of three-people filter looks like:


Picasa Web Albums three persons filter

Monday, 15 October 2007

Global Quiz

For a break, let's make a quiz. Look at the following code and try to guess what the output will be:
queue = ['one']

def f():
print queue
queue = ['one', 'two']
print queue

f()

Select the text below for a correct answer:
Traceback (most recent call last):
File "quiz.py", line 8, in
f()
File "quiz.py", line 4, in f
print queue
UnboundLocalError: local variable 'queue' referenced before assignment

Surprised? I certainly was.
Well, what about this one:
queue = ['one']

def f():
print queue
queue.append('two')
print queue

f()

This time, the script behaves as expected:
['one']
['one', 'two']

To achieve the same result in the first example, a global statement must be added at the beginning of f():
queue = ['one']

def f():
global queue
print queue
queue = ['one', 'two']
print queue

f()

The most interesting thing is what happens now:
queue = ['one']

def f():
print queue

f()

We get:
['one']
without the global statement!
For the explanation of this strange behaviour, let's get back to the first example:
queue = ['one']

def f():
print queue
queue = ['one', 'two']
print queue

f()
It seems that f() treats the queue variable in the first print queue statement already as a local variable, although assignment to it is done no earlier than in the second line. This becomes reasonable if we realise that a Python method "knows" a list of its local variables before execution. This list is prepared when parsing the function definition and before the function is called. If you want to test it, you can write:
print f.func_code.co_varnames
before calling f() in the preceding examples. You'll notice that the output is () if we use the global statement or don't assign to queue inside the function, and (queue,) otherwise. An assignment to a variable which is not an argument of the function is treated as initialisation of a local variable -- hence the UnboundLocalError.

I encountered this error today while searching for a cause of a very strange bug. It would have been very hard to find if I hadn't run my favourite static code-checker tool: PyLint. I encourage everyone to use it, and it integrates well with PyDev, too. PyLint, when encounters the buggy code from the first example, throws a very nice warning:
W0621:  5:f: Redefining name 'queue' from outer scope (line 1)
...and everything becomes obvious.

Thursday, 11 October 2007

Reflection: hasattr

This is my first of, I hope, many articles about reflection in Python. In general, we are speaking of reflection when the program has the ability to "observe" and possibly to modify its own structure and behaviour (from Wikipedia). Scripting languages often offer many ways the program code can be observed and modified. Reflection gives the programmer a very powerful tool, but it doesn't come without a price. If reflection is used in a careless way, it can lead to creation of code that is unreadable and unmanageable. It quickly turns out that adding a feature or fixing a bug in such a code requires adding more and more "bad" reflection. In my articles, I'll try to summarise positive and negative impacts of various reflection uses in Python.

Let's start from the most basic reflection function of Python that almost everyone uses: the hasattr() built-in. The most common usage of this function I encounter is something like:

if hasattr(obj, "hello_world"):
obj.hello_world()
else:
# do nothing or throw exception

From the readability's point of view, this is a perfectly valid piece of code. However, if you look at Python's source code, you'll notice that hasattr() is implemented in a peculiar way (code from Python-2.5.1\Python\bltinmodule.c):

static PyObject * builtin_hasattr(PyObject *self, PyObject *args) {
// [CUT]
v = PyObject_GetAttr(v, name);
if (v == NULL) {
// [CUT]
return Py_False;
}
// [CUT]
return Py_True;
}

Can you see it? hasattr() simply calls getattr() and checks if it didn't throw an exception! So the presented Python code could be written without using hasattr() at all:

try:
hello_world = obj.hello_world
except AttributeError:
# do nothing or throw exception
else:
hello_world()

The main benefit is that we don't have to use the string literal (the second parameter of hasattr()) anymore, which we always forget to change when we rename our methods.

The local variable looks a bit ugly, but you can get rid of it at the cost of running a second getattr() -- a small price for readability:

try:
obj.hello_world
except AttributeError:
# do nothing or throw exception
else:
obj.hello_world()


Some may ask: Why would I use a complicated try..except..else clause when I could simply write:

try:
obj.hello_world()
except AttributeError:
# do nothing or throw exception

Well, you could, if you are sure that hello_world() will never throw AttributeError, because when it does, you will not notice it -- your code will understand that obj doesn't have the hello_world attribute. Still, this is a good solution in many cases.

The thing I really dislike about hasattr() is that you can pass anything as a second argument. Even if you use a constant, things start to look ugly:

if hasattr(obj, HELLO_WORLD):
obj.hello_world()

Here, if you would like to change the name of hello_world method, it would be even harder for you to remember to also change the constant.

If you have to use constants, a slightly better idea would be:

if hasattr(obj, HELLO_WORLD):
getattr(obj, HELLO_WORLD)()

But then again, more elegant solution is:

try:
hello_world = getattr(obj, HELLO_WORLD)
except AttributeError:
pass
else:
hello_world()

or, if you are sure that inner AttributeError won't be thrown:

try:
getattr(obj, HELLO_WORLD)()
except AttributeError:
pass


But the most horrible examples I've seen look like this:

method_name = calculate_method_name_or_take_it_from_some_place()
if hasattr(obj, method_name):
getattr(obj, method_name)()

Apart from very few special cases when this kind of behaviour is acceptable, you should avoid it at all costs. I found myself reverting repository commits too many times after having removed a method I was sure no one used.

To summarise, I think that hasattr(), when passed a string literally written in code, is an example of "good" reflection, although it can be replaced with try..except clauses. Passing anything but a literal string or an easy-accessible constant to it should be usually considered improper and avoided or refactorised.