Showing posts with label ruby. Show all posts
Showing posts with label ruby. Show all posts

Friday, September 14, 2007

Processing Fixed-Length Records

Here's a trick to save memory when reading very large files of fixed-length-records into memory when using a scripting language. Especially those that use garbage collection. For an overview of how scripting language use memory, see my last post

For an example, we'll use the most excellent OpenStreetMap project to convert US Census TIGER data

In this case they have to read several different fix-length-format files into memory and index them (meaning, everything stays in memory). For the Tiger RT1 file, each line is 230 characters containing 45 fields! The current converter is in ruby. The code snippet below is the string offset (+1) and the length of the field. Eeeck.

    RT1_fields = {
      :rs => [1,1],
      :version => [2,5],
      :tlid => [6, 10],
      :side1 => [16, 1],
      :source => [17, 1],
      :fedirp => [18, 2],
      :fename => [20, 30],
      :fetype => [50, 4],
      :fedirs => [54, 2],
      :cfcc => [56, 3],
      :fraddl => [59, 11],
      :toaddl => [70, 11],
      :fraddr => [81, 11],
      :toaddr => [92, 11],
      :friaddl => [103, 1],
      :toaiddl => [104, 1],
      :friaddr => [105, 1],
      :toiaddr => [106, 1],
      :zipl => [107, 5],
      :zipr => [112, 5],
      :aianhhfpl => [117, 5],
      :aianhhfpr => [122, 5],
      :aihhtlil => [127, 1],
      :aihhtlir => [128, 1],
      :census1 => [129, 1],
      :census2 => [130, 1],
      :statel => [131, 2],
      :stater => [133, 2],
      :countyl => [135, 3],
      :countyr => [138, 3],
      :cousubl => [141, 5],
      :cousubr => [146, 5],
      :submcdl => [141, 5],
      :submcdr => [151, 5],
      :placel => [161, 5],
      :placer => [166, 5],
      :tractl => [171, 6],
      :tractr => [177, 6],
      :blockl => [183, 4],
      :blockr => [187, 4],
      :frlong => [191, 10],
      :frlat => [201, 9],
      :tolong => [210, 10],
      :tolat => [220, 9]
    }

Immediate Lookup

To read an entire file of this and index by the :tlid you could do something like this where you chop up each line and put into a hash table immediately:

class RT1
    def initialize(line)
      RT1_fields.each |key, slice|
         start = slice[0] -1
         len = slice[1]
         @data[key] = line[start, len] 
    end

    def [] key
      @data[key]
    end
end

# ...
fp.each_line do |line|
  record = RT1.new(line)
  id = record[:tlid]
  @database[id] = record
end

Uhhh, I'm sure there is a more elegant way to do this in ruby by using object properties, but you get the idea.

There is nothing wrong this this approach. It only becomes a problem when you run out of memory and start swapping

The issue here is that each record causes 45 objects to be created. With a large file, you'll be creating zillions of objects. No matter what scripting language you use, the overhead is going to be pretty large. I don't know the exact value, but I'm sure each object has at least 4 bytes of book keeping. In our case that's almost as large as the actual data!

And in a garbage collected language such as ruby, java, or javascript, it's going to have pause, and touch every one of these objects to figure out if they can deleted or not. That means slow.

(NOTE: This is just an example. The OSM project actually used a more sophicated version of this.)

Deferred Lookup

The trick to save memory is to be lazy and just store the raw string, and extract the right field when requested.

class RT1
    def initialize(line)
        @line = line
    end

    def [] key
      sliceval = RT1_fields[key]
      # minus one since the s
      @line[sliceval[0]-1, sliceval[1]]
    end
end

# ...
fp.each_line do |line|
  record = RT1.new(line)

  # lazy lookup of id
  id = record[:tlid]

  @database[id] = record
end

The cost of doing a single hash lookup and string slice is pretty close to 0, so performance shouldn't be much of an issue. It will be a bit slower, but the big benefit is we dropped the number of long-lived objects per record from 46 to 2. That's going to make the garbage collect much happer, and since we have fewer objects, we have less overhead, which means we use less memory.

For large TIGER datasets, this technique drops memory usage by hundreds of megabytes!

Sunday, September 2, 2007

ruby-mode for emacs

I'm not really a ruby guy, but recently I had to do some ruby hacking, which means I needed ruby-mode for emacs. Between the InterWeb, RubyGarden, and EmacsWiki, I couldn't find a clear, correct description of how to get ruby-mode. Here's my take.

Step 1: Why not upgrade emacs?

Did you know emacs finally released a new version?!

If you are using MacPorts, do this for a terminal (non-gui version). Do port variants emacs for other options.

sudo port install -v emacs

Step 2: Download the code

We are going to install our lisp code in ~/.emacs.d/site-lisp which I think is more-or-less standard, but you could use any directory.

export SITE_LISP=~/.emacs.d/site-lisp
mkdir -p $SITE_LISP
svn export http://svn.ruby-lang.org/repos/ruby/trunk/misc ${SITE_LISP}/ruby
emacs -batch -f batch-byte-compile ${SITE_LISP}/ruby

Don't forget that last step! That compiles the lisp code, and will make emacs run faster (handy if you re-indent a large file).

Step 3: Configure your .emacs

Add this to your .emacs. This is a complete rip from RubyGardens's Installing Emacs Extensions, with the exception of the first line.

;; ruby                                                                         
;; based on http://www.rubygarden.org/Ruby/page/show/InstallingEmacsExtensions  
;;                                                                              

(add-to-list 'load-path "~/.emacs.d/site-lisp/ruby")

 (autoload 'ruby-mode "ruby-mode"
     "Mode for editing ruby source files")
 (add-to-list 'auto-mode-alist '("\\.rb$" . ruby-mode))
 (add-to-list 'interpreter-mode-alist '("ruby" . ruby-mode))
 (autoload 'run-ruby "inf-ruby"
     "Run an inferior Ruby process")
 (autoload 'inf-ruby-keys "inf-ruby"
     "Set local key defs for inf-ruby in ruby-mode")
 (add-hook 'ruby-mode-hook
     '(lambda ()
         (inf-ruby-keys)))
 ;; If you have Emacs 19.2x or older, use rubydb2x                              
 (autoload 'rubydb "rubydb3x" "Ruby debugger" t)
 ;; uncomment the next line if you want syntax highlighting                     
 (add-hook 'ruby-mode-hook 'turn-on-font-lock)

Step 4: Get back to work

EOF