Thin Air

Everything about design

Scripting languages and IDEs

On the Squeak development list there's been a lot of talk lately about creating a scripting language based on Squeak. On the surface it seems like a great idea. Scripting languages are popular, dynamism is in vogue, and it would be nice to be able to use Smalltalk for all the day-to-day utilities and admin tools that tends to get done in Perl or Ruby. On top of that, the main drawback of scripting languages is that there aren't any good IDEs for them. Squeak has a great IDE, and should be able to provide a great script development environment.

I'm pretty skeptical of the idea, because I think scripting languages and IDEs are like oil and water. They just don't mix. What follows is a post I made to the Squeak list defending this position. First, I'd like to define some terms.

IDE - This is a program that allows one to view and manipulate another program in terms of it's semantic elements, such as classes and methods, rather than in terms of the sequence of characters that will be fed to a parser. IDEs might happen to display text, but they also provide tools like class browsers, refactoring and other transformations, auto-completion of identifiers etc, things that require a higher level model of the program than text. Examples include various Smalltalk implementations, Eclipse, Visual Studio, IDEA.

Scripting language - a programming language and execution model where the program is stored as text until it is executed. Immediately prior to execution, the runtime environment is created, the program's source code is parsed and executed, and then the runtime environment is destroyed. This is an important point - the state of the runtime environment is not preserved when execution terminates, and one invocation of a program cannot influence future invocations.

Now, one might quibble over my definition of "scripting language." Fine, I agree that it's not a good general definition of everyday use of the term. But it's an important feature of languages like Ruby, Python, Perl, Javascript, and PHP and one that makes IDEs for those languages particularly hard to write.

Damien Pollet brought up the key issue in designing a Smalltalk-bases scripting language - should the syntax be declarative or imperative?

Imperative syntax gives us a lot of flexibility and power in the language. A lot of the current fascination with Ruby stems from Java programmers discovering what can be done with imperative class definitions. The Ruby pickaxe book explains this well:

In languages such as C++ and Java, class definitions are processed
at compile time: the compiler loads up symbol tables, works out how much
storage to allocate, constructs dispatch tables, and does all those other
obscure things we'd rather not think too hard about. Ruby is different. In
Ruby, class and module definitions are executable code. 

Executable definitions is how metaprogramming is done in scripting languages. Ruby on Rails gets a lot of milage out of this, essentially by adding class-side methods that can be called from within these executable class definitions to generate a lot of boring support code. In Java, we can't modify class definitions at runtime, and that's why Java folks use so much XML configuration.

Python does this too. Perl5 is pretty weird, but Perl6 is slated to handle class definition this way as well. Javascript doesn't have class definitions, but we can build up pseudoclasses by creating objects and assigning functions to their properties.

When writing an executable class definition, we have the full power of the language available. You can create methods inside of conditionals to tailor the class to it's environment. You can use eval() to create methods by manipulating strings. You can send messages to other parts of the system. You can do anything.

I'm making a big deal out of this, because I think it's a really, really important feature of modern scripting languages.

Declarative syntax, on the other hand, gives us a lot of flexibility and power in the tools. Java, C++ and C# have declarative class definitions. This means that IDEs can read in the source code, create a semantic model of it, manipulate that model in response to user commands, and write it back out as source code. The source code has a cannonical represenation as text, so the code that's produced is similar to the code that was read in, with the textual changes proportional to the semantic changes that were made in between.

This is really hard to do with scripting languages, because we can't create the semantic units of the program just by parsing the source code. You actually have to execute it to fully create the program's structure. This is problematic to an IDE for many reasons: the program might take a long time to run, it might have undesirable side effects (like deleting files), and in the end, there's no way to tell whether the program structure we end up with is dependent on the input to the program.

Even if we did have a way to glean the program structure from a script, there would be no way to write it back out again as source code. All of the metaprogramming in the script would be undone, partially evaluated, as it were, and we'd be stuck with whatever structures were created on that particular invocation of the script.

So, it would appear that we can have either a powerful language, or powerful tools, but not both at the same time. And looking around, it's notable that there are no good IDEs for scripting languages, but none of the languages that have good IDEs lend themselve to metaprogramming.

There is, of course, one exception. Smalltalk.

With Smalltalk, we have the best of both worlds. A highly dynamic language where metaprogramming is incredibily easy, and at the same time, a very powerful IDE. We can do this because we sidestep the whole issue of declarative vs. imperative syntax by not having any syntax at all.

In Smalltalk, classes and methods are created by executing Smalltalk code, just like in scripting languages. That code creates objects which reflect the semantic elements of the program, just like in the IDEs for compiled languages. One might say that programs in compiled languages are primarily state, while programs in scripting languages are primarily behavior. Smalltalk programs are object-oriented; they have both state and behavior. The secret ingredient that makes this work is the image - Smalltalk programs don't have to be represented as text.

And that's why a Smalltalk-like scripting language wouldn't be worthwhile. It leaves out the very thing that makes Smalltalk work so well - the image. It would have to have syntax for creating classes - either imperatively or declaratively. We'd end up limiting either the language or the tools, or if we tried hard enough, both.

I'd much rather see a Smalltalk that let me create small, headless images, tens or hundreds of kilobytes in size, with just the little bits of functionality I need for a particular task. If they had good libraries for file I/O, processing text on stdin/stdout and executing other commandline programs, they'd fill the "scripting language" niche very well. If they could be created and edited by a larger IDE image, they'd have the Smalltalk tools advantages as well.

I have high hopes for Spoon in this regard. Between shrinking, remote messaging and Flow, it's already got most of the ingredients. It just needs to be packaged with a stripped down VM, and integrated into the host operating system.

Posted in ide semantics refactoring language design scripting squeak

Not Messages

In my last post, I speculated that a hypothetical "good modelling language" wouldn't revolve around message-sending, but rather focus on the relationships between objects and making explicit the patterns of cooperating objects that we see in good OO design. I shouldn't have imagined that I could get away with being that vague; Vincent Foley quickly asked the pertinent question:

I have a little question: I quite like Smalltalk (though I'm more a Ruby
guy), but I was wondering what you meant by a language that is not
message-oriented? What would that look like?

One of the first things I do when modelling is to identify some of the objects that will be in the model: the nouns in the domain language. These would be objects. I also want to classify objects, so they should have classes or types attached to them.

I also want to describe relationships between objects. I'm imagining a way to build up complex relationships from simpler ones, in the same way that "high level" methods can call "low level" methods. Perhaps the language or libraries would provide basic relationships such as is-a and has-a. By combining several of these I could build complexity. For example, I might define the relationship between an invoice and line items like this:

  • an invoice has a collection of line items
  • an invoice has a total
  • a line item has a value
  • an invoice's total is the sum of the values of its line items

Now, this sort of declarative definition of relationships is great in the abstract, but I still need a way of causing computation to occur. The state of the program has to change over time, so I need a way to describe the changes that might happen in the relationships of objects at runtime. By defining a transformation, I provide a transition from one state to another. For example, I might say that given an invoice and a line item, the line item can be added to the invoice's collection.

Finally, all this needs to be hooked up to input and output. With the right hooks, input would create new objects, whose existence would trigger a cascade of transformations, state changes in the model, and ultimately, a result going to output.

I'm still doing a lot of hand-waving here, but one can at least imagine that such a language could exist, and perhaps what programming it might feel like. This notion of input triggering a cascade of transformations sounds a bit like monads, which makes me wonder if this is maybe a lazy functional language in disguise. I should probably take a cue from Blaine and learn Haskell.

Posted in design

Beyond Smalltalk

Well, I'm back from Smalltalk Solutions, and now that I've caught up on my sleep, reflecting a bit on what I saw and heard. Rather than reporting blow by blow from the conference, I like to record what I learned while I was there.

For me, the highlight of the official program was Eric Evans' keynote. He took us through a fairly simple refactoring of a hypothetical shipping application, showing some of the techniques he uses to distill a domain expert's knowledge into a model that can drive the design of the application. His thesis was that a good model provides the language that the team uses to communicate about the domain, and should be directly reflected in the code.

One of the points Eric made in his talk was that he preferred language rather than pictures for modelling, and as such, he preferred modelling in Smalltalk or Java rather than UML. Later in the day, I joined a conversation he was having with Blaine Buxton, and the three of us spent quite some time arguing about language design. If Smalltalk is the best modelling language we've yet encountered, and we were to design something better, what would it be like?

One of the key things we settled on was that it would be object-oriented, but not message-oriented. Eric insisted that sending messages is a much lower-level operation than what we do when we talk about models with experts in the business domain, and that a good modelling language should operate at that level. What's missing from OOP, we decided, is explicit capture of the relationships between objects.

Thinking about it over the last few days, it occurred to me that relationships are also at the heart of many of the Design Patterns that have become popular in the OO world over the last several years. Indeed Ubiquitous Language and Pattern Languages serve much the same purpose in a software development context. Of course, the term "language" is overloaded here - human languages, such as English, computer languages such as Smalltalk and pattern languages such as those created by Christopher Alexander or the Gang of Four.

One of the primary criticisms of design patterns is that they're really just techniques for working around language deficiencies. Dynamic typing, block closures and metaclasses make many of the patterns used in the C++ and Java unnecessary in Smalltalk.

Iterator is a classic example. An iterator object encapsulates the state required to loop over the contents of a collection. In Smalltalk, one might use an iterator to send #doSomething to all the objects in a collection:

iterator := aCollection iterator. 
[iterator hasNext] whileTrue: 
    [iterator next doSomething]

The beauty of iterators is that by encapsulating the loop state, they make all collections polymorphic. One can use the same looping code to iterate over any collection, be it an Array, a Set or a LinkedList. The problem with iterators is that they are an incomplete abstraction. They capture the state of the loop, but leave the looping behavior its self implicit, and require the loop to be duplicated wherever the iterator is used. In contrast, Smalltalk's #do: method provides the complete abstraction: by making the implicit loop explicit, it provides resusable polymorphism.

I think that step of making the implicit explicit is important. How might we make explicit, for example, the relationship between an AST and a Visitor? I don't know, but I think such a language would be good for domain driven development.

Posted in design

Modules and Late Binding

Travis Griggs just posted some musings on namespaces and imports in VW. We do things a bit differently at Quallaby. We have very few namespaces - a "main" one for most of the code, one for test cases, and a couple of other special purpose namespaces that help enforce conceptual boundaries. This works pretty well for us; most of the time we only think about it when creating a new package, and even then the norm is to import just the main namespace. Still, it feels like this is a way to avoid the problem without really solving it.

This issue has come up several times on the squeak-dev list in recent months and has been debated pretty extensively. There hasn't been anything even approaching a consensus, but a couple interesting tidbits have come up.

Forth has been put forward as an example of how to do namespaces right. The idea, as I understand it, is to decide on how the names in a module should be resolved, not when a module is defined, but when it's loaded. When you load a module you give the compiler an (ordered) list of namespaces to look in to resolve names, and a "target" namespace, where the names defined in the module will be placed. (This seems pretty unusual to me - I don't know of any other language that allows a module to be compiled without reference to it's own contents!)

This is an attractive idea to me because it casts the issue of namespace and imports as a question of early- vs. late-binding. Do we decide on how variables will be resolved when the code is written, or when it's compiled?

Another option that takes that idea even further is Dan Ingalls' "Environments," which was used as part of the (now-defunct) modules system in Squeak 3.3. It pushed name-binding even later, from compile-time to execution-time, by making it a message send. Instead of writing dotted names (Module.Class new), you'd send messages: Module Class new.

It would be interesting to see how late-bound module dependencies work in practice.

Posted in design

Scalability

Now that I've been at Quallaby for a little while, I've begun to get a sense of what is going on in our app. The most striking thing is an apparent contradiction: At first glance it's an incredibly boring, even trivial application. We fetch files from remote machines, parse them, and load the data into a relational database. Users view the data via a web-based reporting tool. But when you look more closely, the gymnastics we have to go through to accomplish this are amazing.

One reason is scale. We're processing statistical data gathered from devices on very large networks. The exact volume varies from customer to customer of course, but it's pretty easy to get over a million records per minute going into the database, hour after hour, day after day. It's so much data that statistical reports on it can't be computed on the fly. It all has to be pre-computed as the data is loaded, or the reporting interface won't be responsive enough to be usable. Of course, that puts even more stress on the backend - that "trivial" application to fetch data and load it into the database.

Another source of interesting complications is the nature of the statistics we need to compute. Conceptually they're pretty simple; for example, the number of packets sent or received on a particular port of a particular device. But the method for locating those bits of data is enormously variable, as each type of network device presents the information differently. So we make this part of the application scriptable, and turn over the job of dealing with the quirks of ATM paths, in-octets and QoS thresholds to the networking experts.

As a result, the subproblems we have have to solve to get data from A to B are fascinating. Take scripting: we currently have several DSLs for specifying how data should be handled as it goes through various stages of processing on its way to the database. There are too many of them, in fact, and we're currently working on consolidating the user-scriptable portions of the app on two languages: ECMAScript and SQL.

From a computer-science point of view these are really interesting choices. On one hand we have a dynamic, imperative, prototype OO/functional language. ECMAScript might be described as a cross between Self and Lisp, wrapped up in C syntax. It fits in nicely with many of the things we're used to doing in the Smalltalk world, but with a more mainstream syntax.

On the other hand, we have SQL, a declarative query language based on relational algebra. But instead of executing the queries against tables in a database, we're applying them to virtual tables representing data in network devices, intermediate results as it moves through the processing pipeline, or in any one of several tables in the central database. Naturally, the implementations of both languages have to be robust, memory efficient and fast.

Personally, I'm fascinated by computer languages, so for me this is the most interesting part of what we're doing. But there's gobs of other interesting problems that we run into: memory management, execution optimization, cluster computing etc. Recently we've been digging into the research that Google Labs has been doing in this space. We don't have scale nearly as high as Google does, of course, but we're running into many of the same issues they are, and every good idea helps.

Posted in design