It’s Not Like You Care About Your Documents

Recently, as part of the many antitrust/anti-competition legal actions they’re suffering under, Microsoft released specifications for the old Office binary file formats. As expected, they’re big and complex. Joel Spolsky (a former member of the Excel team) had some thoughts on their size and complexity:

With a little bit of digging, I’ll show you how those file formats got so unbelievably complicated, why it doesn’t reflect bad programming on Microsoft’s part, and what you can do to work around it.

The digging turns up reasons that make some sense: the limitations of older computers, feature creep, a complete lack of attention to the future. But it’s hard to see some of these reasons as “why it doesn’t reflect bad programming on Microsoft’s part”. Carelessness is common, sure, but we don’t call it a virtue because everybody does it.

And these are problems that should have been on someone’s radar at Microsoft. It’s one thing for a grunt programmer to hack a feature to meet a deadline; it’s another for the management to simply go along with it, or to not order a rethink when the problems come to light. When you read about hacks like the following, everything sounds nice and reasonable, until you remember what the end result is: that Microsoft Excel doesn’t have a standard format for storing and manipulating dates!

There are two kinds of Excel worksheets: those where the epoch for dates is 1/1/1900 (with a leap-year bug deliberately created for 1-2-3 compatibility that is too boring to describe here), and those where the epoch for dates is 1/1/1904. Excel supports both because the first version of Excel, for the Mac, just used that operating system’s epoch because that was easy, but Excel for Windows had to be able to import 1-2-3 files, which used 1/1/1900 for the epoch. It’s enough to bring you to tears. At no point in history did a programmer ever not do the right thing, but there you have it.

It may not have been the wrong decision, in the sense that it enabled them to ship, and shipping is everything in some circles. But as a design decision, how can anyone defend such inconsistency?

Business information technology was able to move forward in the early ’90s because older document formats like 1-2-3 and WordPerfect were simple enough to import easily into Microsoft Office. Today, when we talk about moving to open-source suites like OpenOffice or online systems like Google Docs, detractors left and right cite the pain of document conversion as a reason to hold back. But if Joel is right about the old binary formats, the pain of transition is like the pain of changing your oil: you can pay now, or you can pay a lot more later. Even Microsoft is having trouble opening its own files from long ago, with “long ago” being a period measured in years, not decades.

Maybe you didn’t write anything a decade ago you’d care to read again today; maybe you can’t imagine any of your stuff being worth reading a decade from now. Do you want to take that chance?

Thankfully, I was a geek, and kept most of my documents in plain text. Today, I take care to save important documents in formats and encodings designed for the long haul, like Unicode, ODF, and PDF. It helps that I avoid Microsoft software like the plague. (If you think they’ve changed since the bad old days, just surf the web in Firefox on Linux sometime, and see how many badly-rendered pages look much better when you switch their text encoding from Unicode to “Windows-1252”.)

If you have a lot of Office documents, even if you’re happy with Office, you might consider whether you care about opening those documents ten years from now, and whether you’d rather take the time to future-proof them while you still can.