Friday, June 19, 2009

Of course I spoke too soon

I just knew if I said my interpreter was self-hosting that I'd have to take it back. It really can self-host, but I need to rebuild from the original MIT Scheme binaries again.

As I mentioned in a far earlier post, my interpreter works by walking a tree-shaped data structure called SCode. SCode is essentially an abstract syntax tree for Scheme. Both MIT Scheme and PLT Scheme interpret code by walking an abstract syntax tree (although PLT Scheme has been using a JIT compiler lately, and I'm not sure where that fits in to the picture).

The SCode for MIT Scheme comes from two sources. The runtime system will create SCode on the fly for when the user evaluates code at the prompt or loads a Scheme source file. The other source of SCode is a program called ‘sf’ which translates Scheme source code off-line into SCode and dumps the SCode into a binary file. Sf is written in Scheme, so there is a bootstrapping problem if you don't have a previous version of sf.

To solve the bootstrapping problem, I wrote a loader that can read the data format that the standard MIT Scheme distribution uses. The format is basically a memory dump with some base and offset information so you can relocate the pointers. The loader builds C# data structures using the memory dump as sort of a template.

I can read the binaries produced by the standard MIT Scheme, but I cannot write them. I never intended to be bi-directionally compatible in this way. The uni-directional compatibility is just for bootstrapping, anyway. Instead, I'm writing binary files by using the serialization facilities built in to C#.

As of yesterday I was able to load the old format binaries of sf and then run it to create new format binaries for sf and the rest of the runtime system. Then I was able to restart my interpreter using the new binaries only and perform the cycle again.

So what's the catch?

MIT Scheme SCode only has about a dozen node types. These are sufficient to describe the language, but they are also fairly generic. By adding more node types we can greatly improve performance by using specialized versions of the evaluation code. My interpreter does this on the fly at SCode creation time. The specializations and optimizations are controlled by configuration flags. In theory, any combination of the configuration flags would work, the only difference being the performance of the interpreter. The Scheme runtime system has primitives to manipulate SCode, and it doesn't expect to see these specialized nodes. But the API to the SCode hides the specialization and Scheme is none the wiser.

The problem comes in serializing the SCode to the binary format. My intent was to serialize in an unspecialized format. This would enable me to toggle the configuration flags before starting Scheme and have the system re-specialize (or not) on the fly as it deserialized the binaries. My plans, however, “gang aglee” as they are wont to do and I ended up with the specialized objects in my serialized binaries. This isn't a big deal, but when I changed how the specialized code is serialized, it broke the deserializer. So now I have to back to the original MIT distribution and read its binaries.

No comments: