If we could do Scheme again …

If we could do Scheme again from scratch, what would we change? Even as a staunch defender of R7RS, I have to acknowledge the decisions we’ve made and are continuing to make that are bound by history. If we could do things from scratch, what would we change?

Symbols and identifiers

I think the developments in hygienic macro technology since the R4RS have made clear that once of Scheme’s biggest mistakes as an exclusively lexically-scoped, block-structured language was adopting the traditional Lisp concept of a ‘symbol’: that the string type used for variable names is uniquely instantiated so that comparisons of all kinds can be done by simple pointer comparison, rather than comparing character-by-character. If R4RS had been willing to commit more thoroughly to its hygienic macro idea, rather than punting it out as an optional appendix, we might have already got these semantics back then; but after R4RS came IEEE, and subsequent RnRSes chose to stick with IEEE compatible behaviours where possible.

Because they were unwilling to modify the semantics of a symbol, R4RS’s low-level macro system and the later syntax-case were stuck with wrapping symbols inside syntax objects to maintain hygiene. If we could start Scheme from scratch, I would make symbols non-interned, with identifier semantics without the need to wrap them, that is:

Two symbols are eqv? iff they appeared at the same location in a source expression that was read by the compiler or the read procedure, were created by the same string->symbol operation, or similar.
quote has the same effect on symbols (or on symbols within other structures) as syntax does in R6RS to non-pattern variables (or quote-syntax in some non-standard extensions to R6RS).
To usefully compare symbols for the purposes of macro expansion etc, free-identifier=? and bound-identifier=? must be used.

This would neatly allow procedural Scheme macros to look pretty much identical to Common Lisp macros while still being hygienic — though personally I’d like to keep Scheme’s more advanced pattern matching for destructuring.

Clojure shows that interned strings with classical Lisp semantics can still be useful, for keys in key-value mappings and the like. Like Clojure, I would add a separate keyword type, with distinct lexical notation, that behaves like this. (Perhaps I would call the non-interned, hygiene-aware type ‘identifiers’ and the interned, globally-namespaced type ‘symbols’, but what does it matter what colour the bike shed is?)

Mutability and persistent data structures

I think the current R7RS mess on string types makes the desirability of immutable strings abundantly clear. I would confine string-set! to the dustbin of history to trivially require amortized O(1) string-ref, regardless of the implementation’s underlying string representation.

Less clear is whether to keep pairs mutable. Racket made all vanilla pairs immutable, but also introduced a new mutable pair type at the same time — I’m not sure if that was for any reason other than just compatibility, to give people who really needed mutable pairs a way out. But having a completely disjoint mutable pair type is certainly more elegant and less likely to introduce surprise errors than the current situation in R5RS and above, where literal pairs read in from source code are silently immutable but practically all other pairs are mutable. (Most Scheme implementations have just chosen not to enforce the immutability.)

The fact is that Lispers, including Schemers, like to be able to freely cross the boundary between functional and imperative programming. It’s what makes SICP such a successful introductory CS textbook despite never introducing any language other than Lisp. The classical Lisp optimization trick of sprinkling a little mutation in here and there has allowed Lisp programs to outperform their competitors for decades.¹ Having data structures mutable by default is an important part of this. I think it would require research and experience to know to what extent making mutable types disjoint would make this kind of code optimization difficult.

The question naturally also extends to other built-in types like vectors. For vectors in particular, I think the temptation to make all vectors Clojure-style persistent branching trees should be avoided, as Scheme is somewhat lower-level than Clojure² and there are applications in which their performance and simplicity is important: Scheme’s built-in vectors are the primitive type with O(1) access to an arbitrary number of slots; if persistent vector-like data structures are to be proposed, they should be on top of the primitive ‘array of pointers’ Scheme’s vectors provide.

For bytevectors, on the other hand, the situation is pretty clear: they should be mutable. Their mutability currently allows them to be used to optimize low-level I/O code (by e.g. copying a file by reading chunks into the same allocated bytevector each time, instead of allocating a new bytevector for each chunk). They can also be used as views into memory allocated in a foreign-function interface, and as an ‘escape valve’ to implement things like mutable strings for applications that really do need them and can take care of the performance concerns themselves.

List representation

SRFI 101 proposed adopting an alternative representation for lists which gives O(log n) access to items rather than O(n). The benchmarks in the paper by Okasaki suggests that this comes at the cost of higher constant factors for small lists. It would be worth considering whether to adopt these (in specification terms, by imposing the same performance growth requirements) if we did Scheme again from scratch. (This would depend on making pairs immutable.)

Polymorphic equivalence predicates

The trifecta of eq?, eqv?, and equal? is at least one level of distinction too many. I would reduce it to two by renaming eqv? to eq?, and give consideration to making Henry Baker’s egal? the only polymorphic comparison predicate.

Delimited continuations

Enough said, I think. Though the SRFI 226 approach of making call/cc itself delimited when used in the right way is appealing, I’m not sure we would choose to adopt it if starting from scratch.

Mandatory proper tail recursion modulo cons

See my page on recursion. Recursion is indeed the most natural way to express many algorithms, but Scheme providing only regular tail calls means that many such implementations have to be bent into a quasi-iterative style with explicit accumulators, which the compiler could actually keep track of for us. The naïve implementations of single-list map, list length, factorial, etc. become iterative with proper tail recursion modulo cons.

Static typing, type names, and type checking

I’ve observed elsewhere that Scheme’s choice to use exclusively monomorphic procedures wherever possible makes static type checking without annotations trivial. Whether this would be a good idea in practice is another issue. In practice programmers would likely want at least some kind of type annotation for readability, and to ensure their assumptions about what types will be inferred are absolutely correct. Lisp has traditionally done this (as a compiler optimization hint, not an enforced check) with a procedure cleverly called the: (the list xs), (the integer n), etc.

One issue that arises here is the fact that Scheme traditionally (for minimalism, I think) does not name its types. Scheme has no type-of, only type predicates. I currently lean towards thinking we should change this and provide type descriptors, which would make a number of things (like generic polymorphic functions) simpler. But this is not strictly necessary for this purpose. Since the name the is slightly less satisfactory when used with a type predicate (with a ?) instead of a type name — and as it was intended for use as an optimization hint, not a type enforcer, it can’t easily be used to annotate function argument and return types — I would lean towards a special form called where:

(define (gcd a b)
  (where (exact-integer? a) (exact-integer? b)
         (exact-integer? (gcd a b)))
  (if (= b 0)
      a
      (gcd b (remainder a b))))

(You could, of course, put the where clause on the same line as the (gcm a b), to make it clearer that they belong together, or omit it entirely, in which case the type inferrer can work out from = and remainder that you wanted a and b to be numbers, at least.)

Developing a version of this proposal which is sound and decidable is left as an exercise to the reader.

Then arises the question of whether this would really be a good idea for Scheme. I think the option to use it, as gradual typing with strict static checking, would be a good idea, but it should be either opt-out or opt-in to allow completely dynamically-typed code. This suggests the correct approach may be like that of MyPy or TypeScript, where the type checker is a distinct program from the compiler. See also Steme.

Minor things

I would do away with cond and just have n-ary if. The => binding feature of cond I would provide with if-let as in Emacs Lisp, but likewise n-ary. Indeed I might choose to make all bodies (of let, lambda etc.) unary and require explicit begin (i.e. progn) wherever multiple expressions in a body are required. But this makes annotations and extensions like where mentioned above tricky, since they need to be special-cased or put inside the begin, which is icky.

I would probably keep the car and cdr names because they do compose well and their meaninglessness is still an advantage seeing that pairs can represent trees as well as lists, in which case alternatives such as first and rest are wrong. Of course, the compositions of cdr alone applied to lists can often be replaced by list-tail (though not always, as for example when they’re passed as arguments), and compositions of car to any number of cdr can be replaced by list-ref (also in higher-order functions, thanks to SRFI 1’s first, second, third etc). Use of those should be encouraged as a matter of good programming style. But it’s good that the primitive deconstructors of the primitive composite data type have such meaningless names, because it allows that data type to have infinitely many meanings. Real record types aren’t a complete replacement for SICP-style lists-as-records: sometimes it’s okay to use pairs to build lightweight data containers.

I would also like to rename lambda completely to λ, but I acknowledge that this is difficult to type for most people. Perhaps merely an alias like call/cc for call-with-current-continuation, as in Racket, but I find two names for the same construct somewhat inelegant. Perhaps a single backslash would be a reasonable ASCIIfication.

Monomorphism would still be the rule. One length, one vector-length, one string-length … making these polymorphic is cool but involves too much complexity for too little gain, violating the Prime Clingerism. Monomorphism of these procedures also makes them run faster even on non-optimizing implementations, an important property for a language that values small, embeddable implementations.

Do we need to do Scheme again from scratch?

No, I don’t think so. Not yet, anyway. These complaints are relatively minor, considering the grand picture of a language which is still on the cutting edge of language design after nearly fifty years. But it’s worth acknowledging the warts the language has grown and noting how we would fix them if we could — because one day, it may well be time to go back to the roots and start things all over again.

Footnotes:

Although one can ask how often such optimizations are really needed in 2021, vs in 2001 (when ITA were using it for what was, at the time, probably the most complex graph search application in the world), vs in 1981 (when saving actual memory usage as well as allocation/collection operations was a serious concern even for small applications).

Marc Nieper-Wißkirchen also pointed out to me the other day that mutating operations not only might not be semantically safe in modern multithreaded environments (if data mutates in one thread in a way another thread isn’t expecting), but also that they might actually be slower than consing because of write barriers.

I mean this in the sense of what kinds of applications it targets, rather than in the usual sense of ‘power’ or whatever: Clojure will never have a completely self-hosting compiler and runtime system (de rigeur for serious implementations of Common Lisp, though sadly less so for Scheme), nor will it ever be used to write an operating system (Loko Scheme can run on bare metal as of 2021 — if we as an industry weren’t so addicted to POSIX and the Web Platform, we might have a complete working production OS in a Lisp today).