Showing posts with label python. Show all posts
Showing posts with label python. Show all posts

Wednesday, November 21, 2007

Frequency Counting in Python

Let's say you need to keep track of frequency count of arbitrary items. For instance parsing a log file or doing a word count and seeing what terms come up most often.

Python lets you subclass the built in dict class (hash table). An extension to aid in frequency counting is listed below. It's nothing particular special, but it's simple, it works, and it's fast. Enjoy!

class dictcount(dict):

    def add(self, key, value=1):
        self[key] = self.get(key,0) + value

    def sum(self):
        return sum(self.itervalues())

    def sortByValue(self, reverse=True):
        return sorted(self.iteritems(),
                      key=lambda (k,v): (v,k),
                      reverse=reverse)

Sample run:

$ python
>>> d = dictcount()
>>> d.add('foo')
>>> d.add('foo')
>>> d.add('bar', 2)
>>> d.add('bar', 3)
>>> d.sum()
7
>>> d.sortByValue()
[('bar', 5), ('foo', 2)]

Of course all the regular dict methods are available too.

Sunday, November 18, 2007

Whoops, I forgot the optimization flag

So it turns out if you use macports on Mac OS 10.5, python2.5 is compiled without any optimization. This makes it run at least 2x slower than the version in /usr/bin (the bug report is here).

I only noticed this since I'm a nerd and tuned on verbose/debug flags on macports and then thought it was weird that I didn't see the usual -O2. I then wrote a minor benchmark to confirm that indeed, the macports version is slower.

Which reminds me of major screwups I have done involving pushing debug code live instead of the normal optimized version (in that case it was C++ code). In this case I think it took 6 hours of many people's time to "undo" since all the servers were on fire and end customers were pissed off.

Besides turning on al compiler warnings, besides running your unit tests, besides running your integration tests, you also need to have some type of minor performance test to catch these problems. Even the simplest test will catch these type of problems. And it's not just for C++: even if you are writing using a scripting language, the packager of the language can screw up, or use a different optimization between builds, or use a different compile or just not use the best optimization possible.

In the macport example, it appears that some tweeks needed to be done to make python compile on the new OS. In the process of fixing that, the optimizer flags were ignored silently. Nothing was "removed". I've done the same thing fiddling with Makefiles. I didn't remove optimization, I just screwed up some rules, so the wrong flags were used. In other words, it's really easy to this.

Have a performance smoke test

Fast datetime parsing in python

So you need parse a date time string in a log file a zillion times? For instance the Apache log has this lovely time format 08/Nov/2007:04:05:07

The python docs say use datetime and strptime like this:

import datetime
import time
def apachetime_slow(line):
    return datetime.datetime(*time.strptime(line, "%d/%b/%Y:%H:%M:%S")[0:6])

Short, ugly, and slow. In my application, 25% of my runtime was this function! The Apache time format is gross, but it is fixed length. This means you can use string slices to get the individual bits. The only "trick" is mapping the month abbreviations to a number.

import datetime

month_map = {'Jan': 1, 'Feb': 2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6, 'Jul':7, 
    'Aug':8, 'Oct':9, 'Nov': 11, 'Dec': 12}

def apachetime(s):
    global month_hash
    return datetime.datetime(int(s[7:11]), month_map[s[3:6]], int(s[0:2]), \
         int(s[12:14]), int(s[15:17]), int(s[18:20]))

On my box this is a full 10x faster! Enjoy!

Monday, November 12, 2007

Sorting a python dict by value

Don't search for "python dict sort by value" since you'll get outdated answers. As of python 2.4, the "right" way to do this is:

alist = sorted(adict.iteritems(), key=lambda (k,v): (v,k))

to get the reverse order, add on a ,reverse=True

This is the fastest way to do this and it uses the least amount of memory. Enjoy.

>>> adict = {'first':1, 'second':2,'third':3, 'fourth': 4}
>>> adict
{'second': 2, 'fourth': 4, 'third': 3, 'first': 1}
>>> sorted(adict.iteritems(), key=lambda (k,v):(v,k))
[('first', 1), ('second', 2), ('third', 3), ('fourth', 4)]
>>> sorted(adict.iteritems(), key=lambda (k,v):(v,k), reverse=True)
[('fourth', 4), ('third', 3), ('second', 2), ('first', 1)]

Wednesday, August 29, 2007

PNG metadata from the command line, again

You may have noticed that ImageMagick's identify -verbose output is huge, and if the key or data part is a bit long, the formating is all wacky. Here's a quick and dirty python script that uses PIL to print PNG metadata to the command line. It's simple enough that even if you don't know python you should be able to hack it to do what you want. Save this file as pngmeta, and then do a chmod a+x pngmeta. Then it should work just like any other shell command, e.g. ./pngmeta file1 file2 file3.... This might work with other image types as well.

#!/usr/bin/env python                                                                                                                   

# public domain, nick galbreath
# http://blog.modp.com/2007/08/png-metadata-from-command-line-again.html

import sys
from PIL import Image

# These are not user-added meta data, skip                                                                                              
reserved = ('interlace', 'gamma', 'dpi', 'transparency', 'aspect')

# sys.argv[0] is the name of the program.. skip it                                                                                      
# for each file on the command line
for file in sys.argv[1:]:
    print file
    im = Image.open(file)
    for k,v in im.info.iteritems():
        # if auto-generated metadata, skip it
        if k in reserved: continue
        print k + " = " + v

Oh great, I'm becoming a PNG metadata expert. Just what I always wanted to be.

Tuesday, August 28, 2007

Python, PIL and PNG metadata, take 2

Contrary to my original post, the Python PIL library does have support for reading and writing PNG metadata. This is based on the 1.6 and devel snapshot, as of 28-Aug-2007.

The Short Story

The short story is that Image.load reads most PNG metadata into the Image.info dict. But, Image.save ignores Image.info and will erase all metadata!. Use this wrapper function instead:

#                                                                                                                                      
# wrapper around PIL 1.1.6 Image.save to preserve PNG metadata
#
# public domain, Nick Galbreath                                                                                                        
# http://blog.modp.com/2007/08/python-pil-and-png-metadata-take-2.html                                                                 
#                                                                                                                                       
def pngsave(im, file):
    # these can be automatically added to Image.info dict                                                                              
    # they are not user-added metadata
    reserved = ('interlace', 'gamma', 'dpi', 'transparency', 'aspect')

    # undocumented class
    from PIL import PngImagePlugin
    meta = PngImagePlugin.PngInfo()

    # copy metadata into new object
    for k,v in im.info.iteritems():
        if k in reserved: continue
        meta.add_text(k, v, 0)

    # and save
    im.save(file, "PNG", pnginfo=meta)

Just edit the Image.info as you like and it will get written out.

from PIL import Image
im = Image.new("RGB", (128,128), "Black")
im.info["foo"] = "bar"
pngsave(im, "foo.png")

You can see that it worked by either doing strings foo.png or if you use ImageMagick, identify -verbose foo.png

The Long Story

The next section is mostly for PNG nerds and developers of PIL.

Reading

It dumps the metadata key/value pairs into the standard Image.info field. So far so good.

What's not so good is that is also puts transparency, gamma, aspect, and dpi into the same array. While I guess this is metadata, it is rendering metadata which is treated differently in the PNG file than user-added metadata. I'm not sure what PIL does for other image types -- there may be other keywords. This is really only a problem when it comes to writing metadata, in the next section.

Another issue is that PIL only reads only one of the three different type of metadata chunks that PNG supports. (tEXt: yes, zTXt: no, iTXt: no). This post provides a patch for zTXt.

Writing

By default PIL will erase any user-metadata with Image.save. I would think this is a bug. Editing the Image.info dictionary does not result in changes either. It is completely ignored on write.

Oddly PIL has support for writing metadata, as either uncompressed tEXt or compressed zTXt data (which it isn't able to read!). Here's what you do:

>>> from PIL import Image
>>> from PIL import PngImagePlugin

>>> # let's make an image
>>> im = Image.new("RGB", (128,128), "Black")
>>> 
>>> # HERE'S THE SECRET
>>> meta = PngImagePlugin.PngInfo()
>>> meta.add_text("foo", "bar")
>>> im.save("foo.png", "png", pnginfo=meta)
>>>
>>> 
>>> # But im.info is not modified
>>> im.info
{}
>>> # but if we re-open the image, we get out
>>> # metadata back
>>> im2 = Image.open("foo2.png")
>>> im2.info
{'foo': 'bar'}
>>> #
>>> # but remember if we save it without
>>> # explicitly adding the metadata, we lose it
>>> im2.save("foo3.png")
>>> im3 = Image.open("foo3.png")
>>> im3.info
{}
>>> # whoops

The secret is making a PngImagePlugin.PngInfo() object, and then adding key/value pairs using the add_text method. It has an optional third argument whether to compress the value text or not (true/false).

Technically, the PNG spec says that tEXt and zTXt should only contain latin-1 characters. I don't see the writer code enforcing this rule, but it's doubtful it matters at all. There is also no support for the iTXt block, which is for UTF-8 data. This doesn't seem to be a big deal since few (if any) image programs support it.

Ideas

Adding support for zTXt seems like a no-brainer. Especially since the writer exists.

Adding support for iTXt would be nice, but it appears nobody really uses it.

Adding support for the tIME (last modified time) seems like another no-brainer. It is currently not read or written.

Lumping together rendering metadata and user metadata in the same dict is not great. In a ideal world it would be nice to store the metadata in a special dict, that said what type of chunk it was in. Loading and saving a file would result in a near identical file. You then could also specify if a metadatum needed compressing or not. This is bonus. I'd be happy with any interface that allowed one to write plain 'ol tEXt chunks.

The hard part is making a uniform system of metadata that can work between different image types.