Sunday, November 25, 2007

SSI Secure Software Test for C

As mentioned in the last post, the Secure Software Institute is granting Secure Software Programmers certification. i took the C sample test online. While I can't cut-n-paste from the sample test, the questions for the C/C++ are similar to this:

What line contains a security issue:

1   #include 
2   int main(int argc, char** argv) {
3       printf("%d\n", argc);
4       printf(argv[1]);
5       return 0;
6  }

The answer is line 4, since the "format string" is coming from the user. If the string contains "%d" and "%s" formats in it, C will start filling in values off the stack which can lead to nastiest. You want to change line 4 to printf("%s\n", argv[1]);, or in this bad example if (argc > 1) { printf("%s\n", argv[1]); } Ok, so now you know. The other questions are more obscure.

So are you suppose to go through all your code and make these nit-picky changes? Like that's going to work. Like you're going to have time to even do that. Even if you did "fix your existing" code, new code, patches, changes are constantly coming in. And humans aren't so good with details -- they'll make mistakes. Certification in secure programming is just a start in security.

For this example, the issue can be caught automatically with gcc, by adding -Wformat-security or -Wformat=2. It's not caught with -Wall -Wextra -pedantic

$gcc -v
... gcc version 4.0.1 ...

$ gcc -Wformat=2 -Wall -Wextra -Werror -pedantic foo.c
cc1: warnings being treated as errors
foo.c: In function ‘main’:
foo.c:4: warning: format not a string literal and no format arguments

Ok, that does it. I'm writing a new book on C/C++ and software engineering. Really. Stay tuned for details.

SSI Secure Software Programmers Certificates

The Software Security Institute is granting "Secure Software Programmers certification". Right now, tests exists for Java/J2EE and C, but other tests are coming for C++, PHP, perl, and .NET/ASP.

The C/C++ mostly covers low-level programming and bad system calls (more on this in another post). The Java one looks at correct configuration of J2EE and threading. I think the php or perl tests will cover Cross Site Scripting and SQL injection issues once it comes out.

Normally I think programming certificates are worthless since most of them cover stuff you have to know just to do your regular job. But security is a little different since it's issues around the edges and if you never seen the issue it's unlikely you'll think about it. Even better, it's a sure-thing your employer will pay for it ($500)!

Check it out and let me know what you think.

Thursday, November 22, 2007

Remembering Java 1.0

In November 2007, Java is the #1 programming language. Yet for all it's popularity, there are probably an equal number of malcontents. You've heard the issues: that it's excessively verbose, bloated, neither fish-nor-fowl (ie. what is Java's goal? Or what is Java? A language? A runtime?), causes developers to focus on secondary issues and excessive abstraction, etc. Regardless if you agree or not, for now let's pretend all the complaints are valid. If so, then how did this clunky language climb to the top?

It's easy to forget what a revolution Java was when it came out in 1995. Back then C++ barely worked and was slow, you had perl4, a million Unix variants, and you were lucky if you had 90Mhz computer. The number of high quality open source libraries was a lot smaller, and it was always a issue getting them to compile.

The Promise of Java

Here's what I recall when java came out, and how those advantages compare now days

Compiler - Features

Pre-Java Each unix platform had a different compiler than implemented C++differently. So needed either lots of #ifdefs or custom libraries working around compiler bugs

Java One compiler on every platform. Worked

NowC/C++ compilers are relatively mature now. Many scripting languages have replaced C/C++ code.

Compiler - Speed

Pre-Java Compiling the entire application could take hours. ClearCase (source repository) provided a system of pre-compiled source files on a central server to help the speed compilation.

JavaWhen Java came out it was a godsend. The entire application was compiled in seconds or minutes

NowToday, compilers are a lot better and the time to compile is "reasonable".

Basic Types

Pre-Java Different platforms has different sized integer and floating point types, signed and unsigned, with strange casting rules. And different endians, so long term storage was problematic

JavaDouble, Int, Long, Byte. Done.

NowEveryone uses x386 now so endian isn't an issue (haha). Interestingly the big Java champions are IBM and Sun, whose chips use big endian not x386 little endian. Types still are a problem in C/C++ but can be mitigated with good practices.

Basic Data Structures

Pre-Java STL did not work or exist or could not be compiled or were slow. Had to buy libraries since open source versions didn't exist or were not mature.

Java Rich types: strings, vectors, hash tables, etc

NowSTL works and works well, numerous platform libraries (APR, NSPR, GLib), scripting languages.

Basic OS libraries

Pre-Java Every OS had different implementations of libc and posix calls

Java Standardized and worked.

Nowposix compliance is pretty good. Autotools works well. numerous platform libraries (APR, NSPR, GLib), scripting languages

Program Structure

Pre-Java one header file, one source file, each could contain multiple classes or gasp, functions

Java one class, one source file. Everything is in a class. At the time, this was viewed as a great simplification

NowC++ is still the same. Java now supports "inner classes". Not so sure that Java is a better solutions (it's different but it has pluses and minuses

Object Serialization

Pre-Java Custom, or had to buy 3rd party library (if it existed)

JavaBuilt-in, with versioning!

Now for C++ there is Boost, most scripting languages have a native way of doing this and/or JSON

Remote Method Invocation

Pre-Java Horrible CORBA, which barely worked or different versions were incompatible.

Java RMI ! while nobody uses it now (?), at the time, it was considered amazing, as compared to CORBA

NowUse HTTP and/or XMLRPC

Exceptions

Pre-JavaAs I recall, even C++ exceptions were "experimental" or had performance issues due to crappy compilers.

JavaBuilt in, worked. You could get a stack trace!

NowThey certainly work in C++, but not as rich as Java

Threading

Pre-Java

JavaBuiltin, with nice syntax. Current concurrencies libraries are really excellent.

Nowposix libraries work. C++/OS abstraction libraries (boost) make it even easier. Atomic operations can be provided by other libraries. Many scripting languages don't have true threads.

Robustness

Pre-Java Delicate. Easy to core dump.

JavaSolid. (crashes are a very rare event).

NowEasy to make mistakes, but can be greatly reduced by using STL and references not raw pointers. But of course the risk exists. Of course scripting languages provide the same safety.

Memory Management

Pre-Java DYI. Easy to leak. Had to buy tools to find them.

Javagarbage collector!

NowLeaks still exists in C/C++, but by careful engineering, with constructors/destructors you make leaks go away, and by avoid raw pointers. Plenty of free tools to help catch them. Scripting languages of course do this too.

Simplicity

Pre-Java C++ is a complicated language. Perl4's concept of OO was very primitive

JavaJava, when it came out billed as being simple and self-contained. I seem to recall some demo they gave where they printed out the complete java spec and compared it to 3 meter stack of books for Win32. Also I see to recall that 1 line of Java was like the equivalent of 3 lines in C++ or something. At the time, perhaps it was true.

NowWell, C++ is still complicated, no doubt. The STL and other standard libraries are greatly improved too, improving productivity. But Java has exploded, not just in libraries but the language itself. I don't think anyone can say java is a simple language anymore.

Modular

Pre-Java Shared libaries are tricky.

JavaJAR files are simple. You can give someone a JAR file and they can use it instantly

Nownot as much as an advantage. Setting CLASSPATH is still annoying

Documentation

Pre-Java What documentation?

Javajavadoc!

NowNow days, every scripting language has some type of autodoc system, frequently inspired by Java. For C/C++ Doxygen works well.

GUI

Pre-JavaBack in 90s, writing for Unix meant writing for X. The widgets looked horrible. or you could buy a Motif license, which looked a lot better but was buggy. And then how to write something cross platform so it worked on Windows too. And the Win32 was terrifying.

JavaJava's AWT allowed mere mortals to write cross platform GUIs.

NowOf course now for Java there is AWT, Swing, and SWT. And on the other side there is wxWidgets and others.

Java Today

As you can see when it came out, Java was quite compelling, but those advantages don't always hold anymore.

So what is Java good for now? This is an easy question to answer for just about any other language, but for some reason it's hard to answer with Java. It's not that I'm saying Java is bad, and it certainly has some good features, libraries, and applications, but what does it excel at that other languages have a hard time doing? What type of project would you use it for if starting from scratch (and besides the fact you know it already).

Wednesday, November 21, 2007

sqlite3 and "ON DUPLICATE KEY UPDATE"

MySQL has a great SQL extension "INSERT ... ON DUPLICATE KEY UPDATE" (doco here). As you might guess it either inserts a new row, but if it exists already, you can specify an update. It's particular great for doing frequency counts:

INSERT INTO atable SET name='foo', count=4 ON DUPLICATE KEY UPDATE count=count+4

sqlite3 doesn't have this functionality, but it's easy to fake with a little programming. I'm going to use python as an example, but I'm sure it applies to other languages.

import sqlite3

# setup code here

try:
   cursor.execute("INSERT INTO atable SET name='foo', count = 4")
except sqlite3.IntegrityError, m:
    cursor.execute("UPDATE atable SET count = count + 4")

# more

With Sqlite3 you'll need to make sure a unique index exists (in this example, for the 'name' field).

The Joys of Rounding

Quick! What is the output of this:

#include 
int main() {
    printf("%.1f %1.f %.1f", 1.5, 2.5, 3.5);
    return 0;
}

It turns out Unix based systems (at least on linux glibc and bsd/mac systems), use Round-To-Even rules (this is good):

2 2 3

But on Windows, it's Round-Half-Up (this is bad)

2 3 4

For truly portable programs, you'll need to use a 3rd party implementation of printf (see APR, NSPR or GLIB)

Unfortunately I can't test this for C++ and it's formatting styles and see if there is a difference between Unix and Microsoft systems.

Frequency Counting in Python

Let's say you need to keep track of frequency count of arbitrary items. For instance parsing a log file or doing a word count and seeing what terms come up most often.

Python lets you subclass the built in dict class (hash table). An extension to aid in frequency counting is listed below. It's nothing particular special, but it's simple, it works, and it's fast. Enjoy!

class dictcount(dict):

    def add(self, key, value=1):
        self[key] = self.get(key,0) + value

    def sum(self):
        return sum(self.itervalues())

    def sortByValue(self, reverse=True):
        return sorted(self.iteritems(),
                      key=lambda (k,v): (v,k),
                      reverse=reverse)

Sample run:

$ python
>>> d = dictcount()
>>> d.add('foo')
>>> d.add('foo')
>>> d.add('bar', 2)
>>> d.add('bar', 3)
>>> d.sum()
7
>>> d.sortByValue()
[('bar', 5), ('foo', 2)]

Of course all the regular dict methods are available too.

Sunday, November 18, 2007

Performance Tests and Sqlite3

Performance "unit tests" present a bit of problem to the usual QA tools since it's not the usual "pass/fail" so typically you'll have to write your own harness and store the results and apply some extra logic to determine if the build was ok or not

I can't help you write the harness, but I can help with where to store the results.

RRDtool is great. However it has a lot of options and I've found unless the engineer is already familiar with it, it's a bit of overkill for a performance database. It also works best when the test is run regularly. (I should actually write up a HOWTO for this). MySQL also excellent, but it requires setting up a server, and permissions, and all that.

sqlite is perfect. No server. No config. One File. Standard SQL. Here's the SQL for a sample metrics db:

DROP TABLE IF EXISTS metric;
CREATE TABLE metric (
       date       TEXT NOT NULL DEFAULT CURRENT_DATETIME,
       name      TEXT NOT NULL,
       value       REAL NOT NULL 
);

DROP INDEX IF EXISTS datename;
CREATE UNIQUE INDEX datename ON metric (date,key);

You might want to jazz up this table and add a build number, SVN/CVS id or product version. The 'name' field is just the name of the test.

To create the database, just do sqite3 DBNAME < FILE, where FILE contains the SQL above

python and php are now shipping with sqlite3 out of the box and you can use the fancy APIs, however you can use sqlite3 directly: just add the sql statement as the last argument, sqlite3 db 'sql':

# python example
import os
os.spawnlp(os.P_WAIT, 'sqlite3', databasename, 
                 "INSERT INTO metric SET name='%s', value=%f' % (name,value')

In php, see exec, in perl see exec.

Sleazy, yes. By all means, actually use the real API. But it's not available, this works. It also means you can use sqite3 via bash.

Now go make your intern make pretty graphs for you and do alerts if the last run is 20% slower than the last.

Whoops, I forgot the optimization flag

So it turns out if you use macports on Mac OS 10.5, python2.5 is compiled without any optimization. This makes it run at least 2x slower than the version in /usr/bin (the bug report is here).

I only noticed this since I'm a nerd and tuned on verbose/debug flags on macports and then thought it was weird that I didn't see the usual -O2. I then wrote a minor benchmark to confirm that indeed, the macports version is slower.

Which reminds me of major screwups I have done involving pushing debug code live instead of the normal optimized version (in that case it was C++ code). In this case I think it took 6 hours of many people's time to "undo" since all the servers were on fire and end customers were pissed off.

Besides turning on al compiler warnings, besides running your unit tests, besides running your integration tests, you also need to have some type of minor performance test to catch these problems. Even the simplest test will catch these type of problems. And it's not just for C++: even if you are writing using a scripting language, the packager of the language can screw up, or use a different optimization between builds, or use a different compile or just not use the best optimization possible.

In the macport example, it appears that some tweeks needed to be done to make python compile on the new OS. In the process of fixing that, the optimizer flags were ignored silently. Nothing was "removed". I've done the same thing fiddling with Makefiles. I didn't remove optimization, I just screwed up some rules, so the wrong flags were used. In other words, it's really easy to this.

Have a performance smoke test

Fast datetime parsing in python

So you need parse a date time string in a log file a zillion times? For instance the Apache log has this lovely time format 08/Nov/2007:04:05:07

The python docs say use datetime and strptime like this:

import datetime
import time
def apachetime_slow(line):
    return datetime.datetime(*time.strptime(line, "%d/%b/%Y:%H:%M:%S")[0:6])

Short, ugly, and slow. In my application, 25% of my runtime was this function! The Apache time format is gross, but it is fixed length. This means you can use string slices to get the individual bits. The only "trick" is mapping the month abbreviations to a number.

import datetime

month_map = {'Jan': 1, 'Feb': 2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6, 'Jul':7, 
    'Aug':8, 'Oct':9, 'Nov': 11, 'Dec': 12}

def apachetime(s):
    global month_hash
    return datetime.datetime(int(s[7:11]), month_map[s[3:6]], int(s[0:2]), \
         int(s[12:14]), int(s[15:17]), int(s[18:20]))

On my box this is a full 10x faster! Enjoy!

Monday, November 12, 2007

Sorting a python dict by value

Don't search for "python dict sort by value" since you'll get outdated answers. As of python 2.4, the "right" way to do this is:

alist = sorted(adict.iteritems(), key=lambda (k,v): (v,k))

to get the reverse order, add on a ,reverse=True

This is the fastest way to do this and it uses the least amount of memory. Enjoy.

>>> adict = {'first':1, 'second':2,'third':3, 'fourth': 4}
>>> adict
{'second': 2, 'fourth': 4, 'third': 3, 'first': 1}
>>> sorted(adict.iteritems(), key=lambda (k,v):(v,k))
[('first', 1), ('second', 2), ('third', 3), ('fourth', 4)]
>>> sorted(adict.iteritems(), key=lambda (k,v):(v,k), reverse=True)
[('fourth', 4), ('third', 3), ('second', 2), ('first', 1)]

Tuesday, November 6, 2007

String Compare, Whitespace Insensitive

In the process of writing some unit tests, I needed to compare the output string with the expected output (. Except the strings are XML snippets, and I didn't want the tests to break if I changed the formatting. So I need to write up a simple "string compare, whitespace insensitive" function (e.g " f o o " == "foo"). Very handy, but not built into any programming language that I can recall.

Writing this function is a good simple programming, weed-out question for interviewing. It works for any programming language, too. The higher level languages will have a few ways to do it. If the interviewee can't come up with something, well, it will be a short interview.

Fun with iChat 4.0

In Mac 10.5, ichat 4.0 has some nerdy fun new features. I'm not sure if the AIM supports these directly and ichat is just enabling them or if they are ichat-proprietary. Either way, it's great. I don't use all the wacky ichat features - audio, video, screen sharing - yet.

Tabs

Finally! And they work very well. There are vertical "tabs" on a left sidebar. To enable you have to go to ichat preferences, messages, and check "Collect chats into a single window." it rules.

Invisible

by going invisible, you can see the status of your friends, but they can't see you and you can still send and receive messages. It's like a more aggressive "Away" status.

Oh yeah, I found a super minor "bug" too - ichat menubar icon missing 'invisible' status.

/me status messages

This only works when both parties are using ichat, but if you type /me something it sends a IRC-like status message. The best way to understand is to try it. I first read about it at tuaw.com

Async Message Handling

I'm not sure how this really works, but it appears that if you are offline and someone sends you a message, the next time you log in, you will get old messages that were sent while you were away (just like cell phone SMS). Yahoo IM has had this for long time.

Anyone got any details on this? Or is this just the sender resending the message. Either way, it's a good addition

Invoking Applescript on Messages

Ok I haven't tried this, but you can now invoke applescript in response to messages. So you could "auto-accept" new chats. Even better, if everything is set up correctly, you can make a "poor man's remote control" for your computer.

You can read more about this at MacWorld and TUAW