Thin Air

Everything about compilers

Essential Code Implemented

A while back I posted on the question of which should be considered more authoritative, source code or byte code. The conclusion I came to was that neither is ideal as a "canonical" representation of a program; an abstract syntax tree would better fill that role.

Well, that notion stuck with me and I've started working on a simple tree-shaped representation for Smalltalk code. The idea driving this project is fairly simple: there's no "one true representation" of a program. It's really quite an abstract thing and it needs to be represented differently in different contexts. However, the most abstract representation is the AST, which can be easily converted to other forms as needed.

The AST form is most natural for manipulation by tools such as browsers, debuggers, type inferencers, version control, translators etc. ASTs can be executed directly - this is how the Ruby interpreter works, for example, but Smalltalk traditionally compiles to bytecode, which is can be more efficiently interpreted by the VM.

For presentation to the programmer, you want yet another form - class browsers and source code. And there may be other representations that are useful for presenting to the programmer: class diagrams, pattern summaries etc. (This is one of the core concepts of Intentional Programming as Darius Clarke commented on my last post on this topic.)

So the idea is to shift between these representations as fluidly as possible, and preserve as much of the available information as possible. So the AST form preserves much of the formatting information that the programmer originally entered with the source code, and can reconstruct that source faithfully.

However, that goal shouldn't get in the way of optimizing a particular representation for its context, which is the whole point of multiple representations in the first place. I'm really interested in and excited by projects for optimizing Smalltalk execution, such as Eliot Miranda's AOStA or Bryce Kampjes Exupery. In optimizing bytecodes or native code for fast execution, we may loose the information-equivalence between compiled methods and their ASTs, and that's OK.

In fact it's a good thing, because decoupling the representations used by the tools and the VM can make each more flexible and more powerful. Take the "senders" button in the browser, for example. If we optimize away certain messages sends by inlining the methods they call, we interfere with the browser's ability to trace the senders. If the browser is operating the AST, however, we don't have that problem. We are free to optimize the compiled methods for fast execution, the AST for ease of analysis, and the programmer's representation for clarity.

The first application of this new representation will be in OmniBrowser, which I'm in the process of adapting to operate on syntax trees rather than directly on the runtime. (Actually, OmniBrowser already has a layer of indirection between it and the runtime - this is what makes things like the Package Browser possible - so this is will actually be a simplification of that layer.)

Further down the road, I'd also like to use the same package representations in Monticello, since they provide a much richer model of the package, and could allow versioning and merging at a finer grain than the current model allows.

Posted in compilers

More Static on Types

There's another thread on Lambda the Ultimate which I've been following with some interest. Unfortunately it seems to have degenerated into a static- vs. dynamic-typing flame war, which happens all-too-frequently for my taste.

It's quite a shame really, because I think the functional languages community and the dynamic languages community have a lot to learn from each other. (And since we're all outcasts from the mainstream computing world, we ought to stick together.) These are a few thoughts I've had about why the two groups communicate so poorly.

First, I think there's often confusion around terminology. In particular, the static/dynamic and strong/weak typing dichotomies are often conflated.

Labelling a language as statically- or dynamically- typed refers to the way variables are treated during compilation. In statically-typed languages, the compiler attaches type information to variables, and uses that information to catch type errors and perform optimizations. Compilers in dynamically-typed languages treat variables simply as named references to values, and leave it until runtime to determine how to perform operations on them.

The strong/weak dichotomy refers to how values are treated at runtime. Strongly typed languages attach type information to values, and programs cannot alter those types, while weakly typed languages treat values as "bits in memory" and how those bits are treated is largely a convention.

Programming languages are often either weakly- and statically-typed (eg, C++) or strongly- and dynamically-typed (Smalltalk), which may be one of the reasons for confusion.

Another thing that seems to enter into these debates is perspective. They often have the feel of "mathematicians vs. engineers" or "theory vs. practice". The exchanges end up being endless repetitions of, "Static typing guarantees that you won't introduce certain kinds of errors," which is rebutted with, "Yes, but it also guarantees that I can't get my work done easily."

Dynamic folks like to take a gardening approach to programming. They're up to their knees in the mud, hands dirty, planting and pruning, swatting bugs as they appear and composting weeds for fertilizer. They view the system as a living, evolving thing, and value testing, feedback and iterative development for figuring out what works and what doesn't. They don't worry about ensuring that everything goes right from the beginning, because a little pruning or landscaping can fix any problems that come up.

Static folks, on the other hand, take the architecture approach. They sit at a drafting table, and design structures of concrete and steel. They view the products of their work as monuments which must withstand the pressures of time and work hard to imbue them with mathematical grace and harmony. They know that structural failures can be catastrophic, so they build safety into their designs from the beginning.

Ok, maybe I went a bit overboard with the metaphor, but I hope this illustrates the different perspectives I'm talking about. Being a Smalltalker, I tend to prefer the iterative approach to development. Or rather, having learned the hard way that I can't plan for every eventuality, I appreciate that Smalltalk lets me quickly develop possible solutions, gather feedback and begin the cycle again. That said, I also appreciate the utility of the mathematical tools used by the functional programming community. What I want is a system that combines the agility of Smalltalk with the robustness of, say, Haskell.

And there's no reason the two need to be placed in opposition. Strongtalk was a great example of dynamic system that allowed rapid development, yet also provided tools for performing static type analysis in order to catch errors. In addition, the Strongtalk VM did optimizations based on type information gathered at runtime, which eventually wound up in Sun's HotSpot Java VM.

Frank Atanassow has promised to write up his take on the whole issue, in a paper called "How to argue about types," which I figure will be quite interesting as a reasonable view from "the other side."

Posted in compilers

Assembling Turtles

Lambda the Ultimate recently had a post on High Level Assembly with a link to an example of Object Oriented HLA. Reading that code is just creepy. There's something very attractive about being able to create powerful OO abstractions, while at the same time being able to control the machine at a low level. This is one of the things I like about Smalltalk, actually, although in that case it's low-level control of the virtual machine. On the other hand, I shudder at the thought of writing real software in assembly. (I guess I'm showing my age with that statement.)

This reminds me of a point made by Ian Piumarta about "language levels" in Squeak. At the top is eToys, a prototype language used in education. It lets children create graphical objects using paint tools and attach scripts to them to provide behaviour. Despite its simplicity, eToys is quite powerful; it can be used to create complex animations and simulations.

At some point though, kids grow up, and some of them might like to peek behind the curtain and see what's going on at the next level down, in Smalltalk. At this level, objects belong to classes. We can create instances and send messages to them, or create new classes and methods for responding to those messages.

Below that is a level that might be called meta-Smalltalk. It's the part of Smalltalk that deals with its own implementation - Metaclasses, the Compiler, MethodContexts, etc.

Below meta-Smalltalk is the virtual machine, which is implemented in a Smalltalk-like language called Slang. Slang is syntactically valid Smalltalk, and in fact, the Squeak VM can run a copy of its self, but Slang's semantics are such that it can be translated into C and compiled into native machine code for fast execution.

But here there's a bit of a disconnect, because the VM level is different from all the levels above. The higher levels of Squeak are all integrated into the same environment - to move from eToys to Smalltalk, for example, one need only click a button in the script viewer to see the Smalltalk code for a particular script. To move from Smalltalk to meta-Smalltalk, one need only click the "class" button in a browser or bring up a debugger.

To make changes at the VM level, though, one has to generate and build a separate VM; the Slang implementation is not accessible from Smalltalk. And this is what is appealing about HLA. Its lowest level of execution is accessible from the level above. Of course, HLA only has two levels, and neither of them are accessible at runtime - as in Slang, the program has to be reassembled and launched again before the changes can take effect.

Still though, it makes me wonder about the possibility of creating a system that really is turtles all the way down. It seems like Exupery might be a component of such a system, as might Ian's VVM project. Perhaps a more dynamic conception of HLA might even play a role.

Posted in compilers