Thin Air

Everything about monticello

Questions on the versioning model

Bruce Badger posted a comment in response to my post on the versioning model used in Monticello 2. He has some questions about methods:

  • What is the identity (or primary key) of a method?
  • Within what scope is the identity unique?
  • If I wanted to use a particular version of a particular method in two classes, could I (setting asside the question of whether this is a good idea or not)?

The short answer is that Monticello two uses the same semantics that the Smallltalk runtime uses. The identity of a MethodElement is class name and selector; it's only guaranteed to be unique within a given image. You couldn't put the same method in two classes, it would have to be copied.

Now, Avi and I have kicked around ideas for a deeper model of Smalltalk code. Rather than identifying elements by name, they'd each have UUIDs. Method sources would be versioned as an AST. The nodes for variable references would have the UUIDs of the elements the variables are bound to in the compiled method.

This would have two advantages:

First, it would help with platform independence. Rather than depending on names to bind variables during compilation, we'd be relying on UUIDs. This would make it easier to transform the names when moving code back and forth between dialects. This would make it easier to handle Namespaces in VW, for example, or differences in platform libraries.

Second, it would allow us to provide a more accurate reproduction of code between images. We'd be restoring methods to their compiled states rather than just their source code. This is one of the things that's so compelling about Spoon, and it would allow Bruce's scenario of the same method version being used in two different classes.

On the other hand, it's that much more code and complexity. It would require a custom parser, an AST able to handle all the syntactic quirks of the various dialects of Smalltalk where Monticello will run, and a compiler back end for each platform. Monticello 2 is already an ambitious project, and a significant improvement over Monticello 1. Our goal for now is to get the current version up to production quality so we can start using it. Maybe some of these ideas will be part of Monticello 3.

Posted in monticello versioning smalltalk

Slicing the image

In my last post, I mentioned that version history in Monticello 2 isn't tied to packages. Instead, it introduces the concept of slices.

A slice is, quite simply, a set of elements - an arbitrary slice of the code in the image. We can define several different kinds of slices:

Packages

In Squeak, we can use PackageInfoSlice to get packages identical to those used by Monticello 1. In other dialects we'd create slices to interface with the native packaging code - PackageSlice and BundleSlice in VisualWorks for example.

Change Sets

A ChangeSet also defines an interesting slice of the image, and by implementing ChangeSetSlice, we can make them versionable and mergeable, just like packages. I'm really looking forward to this one, actually. It'll make the lives of package maintainers easier, since contributors can just send them change sets rather than full packages.

Modules

Lately, I've become interested in combining Monticello with Spoon. One of the keys to that integration would be to create a NaiadSlice. This would define a slice based on the elements involved in executing a given Smalltalk expression.

Explicit

Probably the simplest kind of slice is defined with a collection of elements. At some point, I'd like to create a UI for easily creating an ExplicitSlice. I'm imagining a window which lists the contents of the slice, and accepts new elements via drag and drop from OmniBrowser. For now, though, ExplicitSlices can be created pogrammatically, and are really handy for testing.

Others

Although they're probably not useful for everyday development, there are other kinds of slice one might want. A FileOutSlice would enumerate all the elements in a particular chunk file. We could do the same thing with the sources and changes files. We could create a slice that scanned the changes file and included all elements modified between a pair of snapshot markers. When demoing Monticello 2 I sometimes joke about creating a slice that includes all the elements that match a given rewrite rule. I don't know how useful it would be, but why not?

For the moment, I've only implemented ExplicitSlice and PackageInfoSlice, since they're needed to acheive feature parity with Monticello 1.

Posted in monticello

Monticello's versioning model

Although Monticello has proven very useful for developing applications that run in Squeak, it hasn't been very helpful in supporting the development of Squeak itself. The problem is that the versioning model used in Monticello 1 is based on the assumption of packages with well-defined and relatively stable boundaries. In Squeak, the well-defined packages have already been removed, and what remains is a large chunk of tangled and inter-dependent code.

Monticello 2 adopts a new versioning model, one that's not tied to packages as the fundamental unit of versioning. Instead, Monticello 2 divides the system into its fundamental elements. In Squeak, Smalltalk code is made up of following elements:

  • Classes
  • Methods
  • Class comments
  • Instance variables
  • Class variables
  • Class instance variables
  • Pool Imports

Rather than maintaining the version history of packages, Monticello 2 keeps version history for each element.

Right off the bat, this makes it easy to implement a feature that Monticello has never had before: the ability to view previous versions of a given method. More importantly, though, it makes it much easier to deal with fluid package boundaries. Packages can be created, renamed or destroyed, elements can move back and forth between packages, elements can even belong to more than one package at a time. Since the version history is attached to the element, it's not affected.

Another consequence of element-level version history is that merges can be performed on individual elements. Although Monticello 1 supports cherry-picking, it does so in an awkward and non-intuitive way. In Monticello 2, cherry picking is the norm, and merging an entire package is just a special case.

Posted in monticello

Monticello 2 alpha release

One of the things that surprised me at Smalltalk Solutions this year was the continuing interest in Monticello 2 from outside the Squeak world. Now that I'm not working in VisualWorks day-to-day anymore, I've been more focused on solving the problems that we have with using Monticello 1 in Squeak.

However, there is a real need for tools to make cross-dialect development easier, and versioning is an important component of that. After doing a few demos, I had volunteers to maintain VisualAge and Dolphin ports. The VisualWorks folks all seem pretty busy, but I'm sure somebody will step up when MC2 gets to production quality.

With all that momentum coming out of the conference, I cleaned up the code a bit, wrote an installer and posted the first alpha to SqueakMap. The reaction has been mostly positive, particularly given that Monticello 2 is still very raw and there's no documentation at all.

To remedy that I'll post some discussion of the architecture and features of Monticello 2 over the coming weeks.

Posted in smalltalk monticello dolphin visualworks visualage squeak

Versioning Smalltalk

Having been working in Smalltalk for a few years now, I find I occasionally forget just how different it is from the mainstream world of programming. The other day Avi posted about the recent interest in versioning systems and how what we're doing in Monticello is both similar and different to what's going on in other languages.

On the one hand, we're wrestling with the same information-theoretic problems as all other versioning systems. Essentially we want to be able to merge the work done by developers working separately in such a way that changes that don't affect each other are handled automatically, but those that do conflict are detected so that a human can figure out how to harmonize them. We want the merge process to be fast, the history data to be compact, and the restrictions placed on how developers work to be minimal.

On the other hand, Smalltalk code isn't like that of other languages. The issue isn't so much where it's stored - text files or image files - but how it's created. The structures needed to execute the code at runtime, classes and compiled method objects, are built up directly by the development tools. The only text involved is little snippets that make up method bodies. Heck, even when Smalltalk is written out to a text file, that file just contains a series of expressions that can be compiled and executed to rebuild the same executable objects in another image.

So for large parts of a Smalltalk program, there is no text to version. This is a problem because it means versioning Smalltalk programs with the same tools that the rest of the world uses is very difficult.

It can be done, of course. The precursor to Monticello was called DVS, and was mainly concerned with representing Smallalk code textually so that that we could version it with CVS. It would scan the text files for CVS's conflict markers and present them to the user for resolution. This worked ok most of the time, and was an improvement over collaborating via change sets.

But CVS has problems (hence then need for Subversion, Arch, Monotone, darcs, Codeville, BitKeeper etc.), and DVS wasn't able to completely bridge the gap between the objects created by the Smalltalk dev tools and the textual representations that CVS was dealing with. The result was lots of bogus conflicts. If two developers created methods that sorted near each other alphabetically, for example, that would be a textual conflict as far as CVS was concerned, but not a conflict at all in the Smalltalk world.

In trying to work around these problems, DVS had grown from a "little utility" for versioning Smalltalk code with CVS into a versioning system that used CVS as a backend. The only way to improve it was to ditch CVS and do the versioning in Smalltalk. And this is where the lack of a textual representation turned into an opportunity.

A Monticello snapshot is a list of definitions that make up a package. Working with them is almost absurdly easy compared to working with text. The standard diffing and patching that tools like CVS do is trivial, and that let us put our effort into solving the harder problems that the post-CVS generation of tools are tackling. As Avi noted, the solutions we came up with work, but they're not very elegant, and now we're looking for better ones.

Now, Smalltalkers tend to be enthusiastic about Smalltalk, and that can come across as arrogant. zippy's reaction isn't all that unusual. But I think language holy wars are a distraction from the intent of Avi's post. Smalltalk really is different from other languages, and that makes it interesting. What happened with Monticello is a recurring pattern. There's lots of tools out there that the Smalltalk community can't use, and so we're forced to write our own. Fortunately doing so easier than one might think, and what we end up with is pretty good.

The other thing that's easy to miss is that the Monticello approach can be applied to any language, not just Smalltalk. It's a bit more work, because you need to parse the language syntax before doing versioning operations, and of course, you loose the language-independence that text based tools have traditionally enjoyed.

Even so, I think mainstream versioning systems will end up there eventually. IDEs are leading the way - Eclipse, IDEA and their ilk are gradually replacing generic text editors like vi and Emacs, opening the way for syntax-aware versioning. The Stellation project was pursuing this, though it doesn't seem to have made progress for a while.

In the meantime, it'll be interesting to see how Monticello evolves as we make the most of our handicap.

Posted in monticello

Essential Code

Not long ago, Avi Bryant posed an interesting question on the squeak-dev list: "which is more authoritative, the source code or the bytecode?" Or to put it another way, what is the essence of a program, and how can we represent that in the machine?

From an information-content standpoint, the two forms are nearly equivalent. Source code is compiled to bytecode which can be decompiled back into source code. But there are subtle differences - bytecode looses the temporary names and formatting of the original source code, and with it something of the author's intent. On the other hand, the bytecode is better connected to the runtime system - variables are bound, selectors have been interned, etc. But neither is really suitable as an "authoritative" form a method, at least not from a tools perspective.

There are two problems with source code. The first is that it's out of date. It represents the method at the time the author compiled it, but (as Avi mentioned in his original post) that same string might not compile now, because of changes elsewhere in the system. At the same time, source code is really difficult for tools to work with. It has to be parsed for even such "simple" operations as selecting a message send or variable, to say nothing of an operation like "browser senders."

The problem with bytecode, on the other hand, is that it's an implementation detail. It's meant to be executed by the VM, and the performance of the system depends on the VM's ability to do that efficiently. So a CompiledMethod's ability to represent the abstract structure of the method is held hostage by the need to optimize its execution.

Now, for a lot of purposes, a method would be ideally be represented as an abstract syntax tree. If it were carefully designed, the AST could carry enough information to reconstruct the original source code with the author's formatting intact, and would be equally easy to convert to optimized bytecode or even native code for execution. Best of all, it would be easier to write tools such as the Refactoring Browser or SmallLint which take advantage of an easy-to-examine-and-manipulate representation of methods.

In all the systems I'm familiar with, ASTs are very transient things - produced during compilation but immediately thrown away. We'd have to rearrange quite a few of the basic assumptions of the system in order to use ASTs as the canonical representation of source code. There would be practical considerations as well - how much space does an AST require compared to a CompiledMethod or a chunk in the .changes file? How long would it take to generate bytecode from an AST?

Serialization and compression may help overcome some of these problems. This paper by Stork and Haldar presents a way of encoding ASTs based on their grammar, and is designed for fast decoding by a Just-In-Time compiler.

I'm not planning on writing a new VM for Squeak anytime soon, but I can't help thinking that some of these ideas will find their way into Monticello and OmniBrowser, one way or another. OB already uses Squeak parse nodes to do syntax-based selection in code browsers, and I'm very interested in Andy Tween's Shout package, which was released today.

Posted in monticello