Sunday, September 30, 2007

RFC 1123 Date Time Format

Using strftime format specifiers (used by every scripting language), the RFC 1123 time format is
%a, %d %b %Y %H:%M:%S GMT
(assuming you are using the "C" or a english local, and UTC time). In python this would be:
$ python
>>> import datetime
>>> datetime.datetime.utcnow().strftime("%a, %d %b %Y %H:%M:%S GMT")
'Sun, 30 Sep 2007 14:58:05 GMT'
>>> 

Note, the "%a, " is optional.

Good luck looking that up directly. It's on page 55 of RFC 1123, which it's just a change to RFC 822 to be "year 2000 compliant" so you have go track down the original spec (it's in section 5). Which is in BNC format.

I should really make a master page listing standard time formats and their equivalent strftime format.

Friday, September 14, 2007

Processing Fixed-Length Records

Here's a trick to save memory when reading very large files of fixed-length-records into memory when using a scripting language. Especially those that use garbage collection. For an overview of how scripting language use memory, see my last post

For an example, we'll use the most excellent OpenStreetMap project to convert US Census TIGER data

In this case they have to read several different fix-length-format files into memory and index them (meaning, everything stays in memory). For the Tiger RT1 file, each line is 230 characters containing 45 fields! The current converter is in ruby. The code snippet below is the string offset (+1) and the length of the field. Eeeck.

    RT1_fields = {
      :rs => [1,1],
      :version => [2,5],
      :tlid => [6, 10],
      :side1 => [16, 1],
      :source => [17, 1],
      :fedirp => [18, 2],
      :fename => [20, 30],
      :fetype => [50, 4],
      :fedirs => [54, 2],
      :cfcc => [56, 3],
      :fraddl => [59, 11],
      :toaddl => [70, 11],
      :fraddr => [81, 11],
      :toaddr => [92, 11],
      :friaddl => [103, 1],
      :toaiddl => [104, 1],
      :friaddr => [105, 1],
      :toiaddr => [106, 1],
      :zipl => [107, 5],
      :zipr => [112, 5],
      :aianhhfpl => [117, 5],
      :aianhhfpr => [122, 5],
      :aihhtlil => [127, 1],
      :aihhtlir => [128, 1],
      :census1 => [129, 1],
      :census2 => [130, 1],
      :statel => [131, 2],
      :stater => [133, 2],
      :countyl => [135, 3],
      :countyr => [138, 3],
      :cousubl => [141, 5],
      :cousubr => [146, 5],
      :submcdl => [141, 5],
      :submcdr => [151, 5],
      :placel => [161, 5],
      :placer => [166, 5],
      :tractl => [171, 6],
      :tractr => [177, 6],
      :blockl => [183, 4],
      :blockr => [187, 4],
      :frlong => [191, 10],
      :frlat => [201, 9],
      :tolong => [210, 10],
      :tolat => [220, 9]
    }

Immediate Lookup

To read an entire file of this and index by the :tlid you could do something like this where you chop up each line and put into a hash table immediately:

class RT1
    def initialize(line)
      RT1_fields.each |key, slice|
         start = slice[0] -1
         len = slice[1]
         @data[key] = line[start, len] 
    end

    def [] key
      @data[key]
    end
end

# ...
fp.each_line do |line|
  record = RT1.new(line)
  id = record[:tlid]
  @database[id] = record
end

Uhhh, I'm sure there is a more elegant way to do this in ruby by using object properties, but you get the idea.

There is nothing wrong this this approach. It only becomes a problem when you run out of memory and start swapping

The issue here is that each record causes 45 objects to be created. With a large file, you'll be creating zillions of objects. No matter what scripting language you use, the overhead is going to be pretty large. I don't know the exact value, but I'm sure each object has at least 4 bytes of book keeping. In our case that's almost as large as the actual data!

And in a garbage collected language such as ruby, java, or javascript, it's going to have pause, and touch every one of these objects to figure out if they can deleted or not. That means slow.

(NOTE: This is just an example. The OSM project actually used a more sophicated version of this.)

Deferred Lookup

The trick to save memory is to be lazy and just store the raw string, and extract the right field when requested.

class RT1
    def initialize(line)
        @line = line
    end

    def [] key
      sliceval = RT1_fields[key]
      # minus one since the s
      @line[sliceval[0]-1, sliceval[1]]
    end
end

# ...
fp.each_line do |line|
  record = RT1.new(line)

  # lazy lookup of id
  id = record[:tlid]

  @database[id] = record
end

The cost of doing a single hash lookup and string slice is pretty close to 0, so performance shouldn't be much of an issue. It will be a bit slower, but the big benefit is we dropped the number of long-lived objects per record from 46 to 2. That's going to make the garbage collect much happer, and since we have fewer objects, we have less overhead, which means we use less memory.

For large TIGER datasets, this technique drops memory usage by hundreds of megabytes!

Tuesday, September 11, 2007

Scripting Languages and Memory Management

One of the great benefits of using a scripting language (and Java) is that memory management is done for you. You make new objects, and magically they get zapped when they are no longer needed.

Most of the time.

While normally everything works well, there are times when you scale up or have complicated data structures when the scripting language needs a little help.

To aid in diagnosing memory problems, the follow is a crude description of how popular scripting languages deal with memory management.

Scripting Objects

By an "object" I mean something that uses memory. It could be a Class-Object object, or it could be an array, a string or a hash table. Anything.

For just about every language besides C/C++, when you create an object, you actually create two things. The actual raw object in memory. You don't get to touch that directly. The other is a reference to the object. You do get to play with that. You use the reference to manipulate the underlying memory.

Reference Counting

Python, Perl, and PHP use reference counting to manage memory. It's just what you think it is.

  • When you create an object, it's reference count is 1.
  • When it goes out of scope the reference count is decremented.
  • If the count is 0, the object is destroyed and memory is released.
  • When another object contains or refers to your object, the reference count is increased.
  • Then an object is destroyed, all it's children get their reference count decremented.

The good news is that it's conceptually simple to implement, and it's very clear what-is-happening-when (as compared the next method). Memory usage is more or less predictable.

The bad news is that just about everything you do involves constantly incrementing and decrementing reference counts. This eats CPU time.

The other bad news is that you can have circular references where two objects point to each other so their reference count never goes to 0.

More bad news: it can be hard for the low level developers to "get it right" (in the raw C/C++ code, they have to manually add those increments and decrements, so you don't have to.)

Garbage Collection

Java, Javascript/Actionscript and Ruby use a different system called "mark and sweep" to do "garbage collection". These systems have a "root" object in the global scope where everything else is some how connected to it. The interconnections between objects creates a "graph." Somewhere else is a list of every object allocated in the system.

To delete objects that are no longer used, the following process is used:

  • Set a flag to 0 on every object in existence
  • Starting at the root node, do a depth or breadth first descent in the graph marking each object as you go
  • Scan every object and see what was not marked: these are dead objects and can be deleted

While the garbage collection process sounds gross, in practice all memory management code is one place so it can be optimized. In general its very fast.

Good news: circular references can't hide. If an object isn't part of the main graph, it gets killed.

The bad news is that when garbage collection happens, the program pauses (although multithreaded garbage collectors do exist).

Memory usage is hard to predict. Typically GC programs can have large swings in the amount of memory used.

In Practice

The above descriptions were very simple. In practice, some use hybrids. Some defer releasing memory in order to prevent "holes." Some create different pools or categories of objects. Some use a different technique for strings.

What's better?

Well it all depends and it's devil in the details.

In theory garbage collection systems have the potential for being more stable since nothing can leak and higher performance system since you aren't constantly adding and subtracting. And since the deallocation is deferred it can in theory make smart choices to prevent memory fragmentation. However a good collector is hard to write.

For short-lived applications, such as PHP, (where at the end of the HTTP response Apache deletes all memory PHP allocated anyways) reference counting is very fine and probably optimal.

In most cases it doesn't matter.

The Big Problem with Reference Counting

You can't really measure the CPU cost of doing referencing counting since it's embedded deeply in the code. Either you are happy with the performance of your scripting language or not.

The big problem is circular references, where memory never gets released. Normally it's not too hard to spot where the problem is, it's typically data structures that point to each other. To fix, you can alter the datastructure or explicitly null or zero out some of it's members to break the circular chain

Do a search on "your scripting languge" with "circular reference" and you'll find plenty of tips.

The Big Problem with Garbage Collection

In most "normal" programs you can't even notice the garbage collection. GC works well for a few hundred thousand concurrent objects. But once you get into the millions, it can be a real problem.

Generating millions of object is easy to do if you are implementing a large in-memory database, caches, or in-memory data processing. In these cases, you might experience a lot of garbage collection and very little work getting done.

To fix that you have to be clever so you make less objects, or you might want to try using a reference-counted language (or go C/C++ where you have direct control, but that's another topic).

Drawing circles and stars on Google Earth with KML

So I'm sure you were wondering what the last post was about. Here are the fruits of my labor.

I created a new googlecode project kmlcircle that will generate the appropriate Google Earth KML snippet for circles, regular polygons and stars. Check out the examples:

circles circles

Monday, September 10, 2007

Rotating a point around a vector

Well, you'd think it would be easy: rotate a point (x,y,z) around a vector (u,v,w), by a certain angle λ. You'd think this would be in every elementary graphics textbook, but no.

For true 3-D applications, you'd probably want to use a transformation 4x4 matrix and perhaps quaternions (seach) This is for quick and dirty, "I need to rotate-something-now" type of work where you don't have a matrix library laying around. Like in javascript, or python. An application of this is to be able to draw circles on Google Earth (posting to come).

Glen Murray solved this in the most general case with the new point being:

Which is not so bad. However if we assume the vector is a unit vector, (and there is no reason not to assume this since we always do a little translation and scaling), the result is much simpler:

I'll post the derivation shortly.

Saturday, September 8, 2007

Slides from NYCBUG crypto talk posted

Hi all,

The nice people at NYCBUG invited me to speak on cryptography on 05-Sept-2007. About 20 (?) people showed up ranging from people who had no exposure to crypto to some people who worked on VPNs, developers to sysadmins -- all types with lots of excellent questions. Thanks everyone for coming -- it was really great to meet you.

I reworked the slides a bit based on feedback from that day and posted them here.

What's my problem? Why didn't I take any photos? Why didn't I post here before I spoke? Oh well. Next time! There are rumors I'll do something similar for the NYC php group.

Sunday, September 2, 2007

ruby-mode for emacs

I'm not really a ruby guy, but recently I had to do some ruby hacking, which means I needed ruby-mode for emacs. Between the InterWeb, RubyGarden, and EmacsWiki, I couldn't find a clear, correct description of how to get ruby-mode. Here's my take.

Step 1: Why not upgrade emacs?

Did you know emacs finally released a new version?!

If you are using MacPorts, do this for a terminal (non-gui version). Do port variants emacs for other options.

sudo port install -v emacs

Step 2: Download the code

We are going to install our lisp code in ~/.emacs.d/site-lisp which I think is more-or-less standard, but you could use any directory.

export SITE_LISP=~/.emacs.d/site-lisp
mkdir -p $SITE_LISP
svn export http://svn.ruby-lang.org/repos/ruby/trunk/misc ${SITE_LISP}/ruby
emacs -batch -f batch-byte-compile ${SITE_LISP}/ruby

Don't forget that last step! That compiles the lisp code, and will make emacs run faster (handy if you re-indent a large file).

Step 3: Configure your .emacs

Add this to your .emacs. This is a complete rip from RubyGardens's Installing Emacs Extensions, with the exception of the first line.

;; ruby                                                                         
;; based on http://www.rubygarden.org/Ruby/page/show/InstallingEmacsExtensions  
;;                                                                              

(add-to-list 'load-path "~/.emacs.d/site-lisp/ruby")

 (autoload 'ruby-mode "ruby-mode"
     "Mode for editing ruby source files")
 (add-to-list 'auto-mode-alist '("\\.rb$" . ruby-mode))
 (add-to-list 'interpreter-mode-alist '("ruby" . ruby-mode))
 (autoload 'run-ruby "inf-ruby"
     "Run an inferior Ruby process")
 (autoload 'inf-ruby-keys "inf-ruby"
     "Set local key defs for inf-ruby in ruby-mode")
 (add-hook 'ruby-mode-hook
     '(lambda ()
         (inf-ruby-keys)))
 ;; If you have Emacs 19.2x or older, use rubydb2x                              
 (autoload 'rubydb "rubydb3x" "Ruby debugger" t)
 ;; uncomment the next line if you want syntax highlighting                     
 (add-hook 'ruby-mode-hook 'turn-on-font-lock)

Step 4: Get back to work

EOF

Emacs 22.1

Did you know emacs finally popped out a new release? The last release was in Feb 6, 2005! Yes 2.5 years ago. You probably have been using a snapshotted version, but now it's time to upgrade to the real version. The list of changes is kinda long: 230K of raw text. I mostly use emacs in the terminal (not GUI), on Mac OSX, and mostly for programming. Here's my greatest hits.

Mac OS X Support

Finally official and supported. This is more a big deal if you use a GUI. For terminal users, the following key remapping command may be useful

** The variable `mac-command-key-is-meta' is obsolete.  Use
`mac-command-modifier' and `mac-option-modifier' instead.

Tidbits

ruler-mode is neato.

display-battery-mode -- more toys!

;; turn off the splash screen on startup
(setq inhibit-splash-screen t)

New major modes

python-mode was finally added. ruby-mode, javascript/emacs-script-mode still absent. Boo!

dns master files, and cfengine config file however get their own mode.

conf-file major mode

conf-mode is a new major mode for editing conf file. Apparently it can recongize .cf, .config.properties, .desktop, .ini, and "many others".

If emacs doesn't recognize your conf file, I suppose you need to add something like this to your .emacs

(add-to-list 'auto-mode-alist '("\\.XXX$" . conf-mode))

cc-mode improvements

all sorts of parsing improvements to C++

Doc-comment highlighting. For you javadocs can now be color coded. Oddly, no support Doxygen! They they the system is plug-inable. So maybe this can be fixed. More here

Doxygen does support javadoc style comments in C++ code, so if you do that, then add this to your .emacs

(add-hook 'c++-mode-hook
    (function (lambda ()
        (add-to-list 'c-doc-comment-style '(c++-mode . javadoc))))

Other programming stuff improved

More improvements to M-x compile to support more messages

Lots of improvements and additions in etags for: C, C++, HTML, PHP, Lua, perl, prolog

flymake

Quote:
** The new package flymake.el does on-the-fly syntax checking of program
source files.  See the Flymake's Info manual for more details.

The idea is thatin the background, your source file will be get compiled and proactively tell you where you made mistakes. I haven't tried it yet, but it sounds interesting.

Wow, there is a movie demo-ing this feature. A movie on emacs. Amazing. It's oddly effective. Oh yeah, flymake is cool too. I may yet need to upgrade to GUI version of emacs.

File Operations

*** The new hook 'before-save-hook' is invoked by `basic-save-buffer'
before saving buffers.  This allows packages to perform various final
tasks.  For example, it can be used by the copyright package to make
sure saved files have the current year in any copyright headers.

I don't remember where I saw this, but I'm told this is also Emacs 22

;;
;; Trick for emacs 22
;; if the script has a first line of "#!" then do chmod a+x
;;
(add-hook 'after-save-hook 'executable-make-buffer-file-executable-if-script-p)