Thin Air

Essential Code

Posted by Colin Putney on April 28, 2004

Not long ago, Avi Bryant posed an interesting question on the squeak-dev list: "which is more authoritative, the source code or the bytecode?" Or to put it another way, what is the essence of a program, and how can we represent that in the machine?

From an information-content standpoint, the two forms are nearly equivalent. Source code is compiled to bytecode which can be decompiled back into source code. But there are subtle differences - bytecode looses the temporary names and formatting of the original source code, and with it something of the author's intent. On the other hand, the bytecode is better connected to the runtime system - variables are bound, selectors have been interned, etc. But neither is really suitable as an "authoritative" form a method, at least not from a tools perspective.

There are two problems with source code. The first is that it's out of date. It represents the method at the time the author compiled it, but (as Avi mentioned in his original post) that same string might not compile now, because of changes elsewhere in the system. At the same time, source code is really difficult for tools to work with. It has to be parsed for even such "simple" operations as selecting a message send or variable, to say nothing of an operation like "browser senders."

The problem with bytecode, on the other hand, is that it's an implementation detail. It's meant to be executed by the VM, and the performance of the system depends on the VM's ability to do that efficiently. So a CompiledMethod's ability to represent the abstract structure of the method is held hostage by the need to optimize its execution.

Now, for a lot of purposes, a method would be ideally be represented as an abstract syntax tree. If it were carefully designed, the AST could carry enough information to reconstruct the original source code with the author's formatting intact, and would be equally easy to convert to optimized bytecode or even native code for execution. Best of all, it would be easier to write tools such as the Refactoring Browser or SmallLint which take advantage of an easy-to-examine-and-manipulate representation of methods.

In all the systems I'm familiar with, ASTs are very transient things - produced during compilation but immediately thrown away. We'd have to rearrange quite a few of the basic assumptions of the system in order to use ASTs as the canonical representation of source code. There would be practical considerations as well - how much space does an AST require compared to a CompiledMethod or a chunk in the .changes file? How long would it take to generate bytecode from an AST?

Serialization and compression may help overcome some of these problems. This paper by Stork and Haldar presents a way of encoding ASTs based on their grammar, and is designed for fast decoding by a Just-In-Time compiler.

I'm not planning on writing a new VM for Squeak anytime soon, but I can't help thinking that some of these ideas will find their way into Monticello and OmniBrowser, one way or another. OB already uses Squeak parse nodes to do syntax-based selection in code browsers, and I'm very interested in Andy Tween's Shout package, which was released today.

Posted in monticello