I'm trying out blogging clients for Mac OS X, so posts may be oddly formatted for a little while.


There used to be a company named Digital Archaeology (may still be; they got absorbed in October 2000). Their line of business was data mining -- sifting through all the myriad data that a company collects and generates, looking for interesting and useful patterns.

I'm finding out the hard way that they didn't know what real digital archaeology is like.

I bought a new computer recently. Okay, new for me; it's an HP Vectra VL, with a Pentium II processor and 64MB of memory. I picked it up for $25 from a company that upgraded all of their desktops. It's replacing my old desktop machine, a Compaq 486. You can see how I'm happy about this.

In the process of transferring all of the data from the old disk to the new one, I managed to wipe the old disk. Completely. No backups. (Please, no smart-aleck comments. Yes, I screwed up. Keep reading.) All of my personal email going back a couple of years was gone, as was my GPG key. I still haven't figured out what else was lost; I keep thinking of things.

Fortunately, what I lack in better judgement, I make up for in tenacity. After a few attempts to "repair" the disk (I didn't expect much success, since the disk wasn't really damaged to start with), I downloaded The Coroner's Toolkit. It copied all of the unallocated blocks (basically, the entire disk) to another disk, classifying them (as text, email, HTML, programs, etc.) as it went. It took the better part of two days to run, primarily because either it keeps a lot of history in memory as it works or it has a memory leak -- I set up 384MB of swap space, and about 300MB of it was in use at the end.

Now I have 2GB worth of raw disk blocks to try to string back together.

I'm working on my mailbox file first, and to be honest it's more tedious than heartbreaking. My email address is unique enough that I was able to find most of the appropriate blocks by grepping for it. Now I'm filling in the gaps, caused mainly by TCT classifying blocks containing mail from certain mail programs (*cough*Outlook*cough*) that send HTML by default. It helps immensely that ext2 keeps file fragmentation to a minimum, so usually all I have to do is look at all blocks between this one and that one, sequentially -- and it's usually just one or two.

That's the good news. The bad news is that I have a lot more of this to do. The good news is that most of the contents of the disk were part of Linux rather than data that needs to be found. The bad news is that the data that does need to be found is buried in the haystack of other data.

I finally read Paul Graham's seminal paper on Bayesian spam filtering yesterday (thanks to Jason Kottke for pointing me to it). I have to admit that my feelings are mixed.

On the one hand, I like the idea of an adaptive spam filter. To paraphrase somebody (was it a Supreme Court justice?), "I know spam when I see it." Besides, one man's spam is another man's portable, canned, processed meat product. So to speak.

On the other hand, it feels like closing the pantry door after the cans have gotten out. So to speak. By the time spam gets to the mail client, it's already done its primary damage, wasting bandwidth. If you try to push the spam filter further upstream, you lose the ability to define what constitutes spam; it becomes a collaborative definition or (worse) someone else (like your bandwidth provider) defines it for you.


Take two.