May report of the Raku Development Grant of Jonathan Worthington

Tue, 16-Jun-2020 by Matthias Bloch edit post

Jonathan writes: In my [April report](https://news.perlfoundation.org/post/jonathan-2020-04) I mentioned that I had performed an amount of design work for a new generalized dispatch mechanism in MoarVM. In May, I forged ahead with the implementation work on this - although in reality, that work isn't just churning out code, but also making lots of smaller design decisions along the way too. I started work on MoarVM in 2012. I've learned a thing or two about building runtimes since then, and have applied many of those learnings to MoarVM. However, thus far, that has tended to come in the form of additions to what was already there - initially in aid of supporting all that the Raku language wanted to offer, and later in terms of running it more efficiently. The generalized dispatch work is a bit different: it's about asking how we can do moar with less. The entry point to it all is a new `dispatch` instruction. It identifies a dispatcher - something that knows how to take a set of arguments and turn them into an outcome - and provides it with some arguments. Those arguments may be aimed at the dispatcher (such as a method name), a final piece of code we choose to invoke (method arguments), or perhaps both (arguments in a multiple dispatch). This instruction will come to replace all existing invocation related instructions, including specializer plugin resolution, method caching, and multi dispatch caching. The dispatch instruction also marks the first time in MoarVM's history that the calling conventions will change. The current invocation sequence for a method call looks like this: ``` findmethod r3, r1, 'foo' prepargs <callsite> arg_o 0, r1 arg_o 1, r2 invoke_o r4, r3 ``` That is, look up a method `foo` on the object in register `r1`, and store it in the register `r3`. Then, take the code object in register `r3`, invoke it with object arguments in registers `r1` and `r2`, and store the return value into register r4. The callsite object provides the static properties of the arguments (for exmaple, which ones are named and which ones are flattening). The new `dispatch` instruction would look more like this: ``` const_s r0, 'foo' dispatch_o r4, 'raku-method', r0, r1, r2 ``` First, we put the method name into register `r0`. Then, we do a dispatch using the `raku-method` dispatcher, providing it with the arguments `r0`, `r1` and `r2`, and put the result of the method call into `r4`. It turns a 32-byte sequence into a 24-byte sequence. This leads to rather more compact bytecode. It also means we have just two instructions for the interpreter to dispatch, not 5. The whole concept of an "args buffer" also goes away; the register indices become a "map" of how to look up the arguments directly from registers. This change of calling conventions runs pretty deep - but thankfully, it was possible to introduce it alongside the current approach, using it only when a `dispatch` instruction initiated things. Of course, the end goal is that the old calling conventions will fade from use, and all code associated with them will go away. Without going into more details, I'll note that the argument processing and memory management with the new approach is also overall simpler. But what does `dispatch` do? The first time we reach a `dispatch`, we take the string name of the dispatcher and resolve it to something we can invoke to decide what will happen. That something receives the arguments as a capture object. It can read the arguments, and use them to decide what to do. For example, the `raku-method` dispatcher can take the invocant argument, the method name argument, and call `$obj.^find_method('foo')` in order to ask the MOP for the method we should call. A dispatcher can then *delegate* to another dispatcher. In fact, all user-defined dispatchers delegate. Only a handful of built-in ones, provided by MoarVM, are "terminals": these do things like evaluating to a constant, evaluating to one of the input arguments, invoking bytecode, or invoking some VM primitive function. When one dispatcher delegates to another, it provides it with an argument capture. This is derived from its input argument capture. There are a set of operations for doing this; one can: * Drop some arguments * Insert an argument at a difference place, or duplicate it * Insert an object or string constant argument * Read an attribute from an argument, and insert the result of that read These operations are part of building up what I call a "dispatch program": a program consisting of a small set of simple instructions that map the arguments into a final outcome (the outcome often being to run some bytecode, which is what you'd expect of a dispatch). As well as capture transformations, we can also add *guards* into the dispatch program: things that check the types of arguments. If the guards are not met, the dispatch program fails. To give an impression of how delegation will be used, let's consider a method dispatch in Raku. It will: * Start in a `raku-method` dispatch, which does a lookup of the method, and inserts the result at the start, and delegates to a `raku-method-resolved` dispatcher (which will also be reached by other forms, such as qualified method dispatch). * In `raku-method-resolved`, check if this is a multiple dispatch routine or not, and delegate as appropriate. Before it does so, it removes the method name argument (it will retain this as "dispatch state" for if we need to resume the dispatch for a callsame; this piece is yet to implement, however). * Assuming it was a single dispatch, end up in `raku-invoke`. This will see if it's a `Code` object of some kind, and if so extract the `$!do` attribute, which is a VM-level code handle. The Raku-level code object argument will be dropped, and and VM-level code handle inserted in its place. * Finally reach the `boot-code` dispatcher, which is the terminal, and results in invoking bytecode. After we've recorded a dispatch program, we transition the `dispatch` instruction from an "uninitialized" state into a "monomorphic" state - that is to say, we've set up one dispatch program, and so far as we know, every dispatch we encounter will look just like the one we already saw. And if we only ever call a method on the same type, for example, then that is how things will stay. If, however, the dispatch program fails - perhaps we are dispatching a method on a difference type - then we'll do a second recording. At this point, the `dispatch` instruction is transitioned to a "polymorphic" state. These states are stored in an array indexed based on bytecode position of the dispatch instruction, a means of achieving a kind of (polymorphic) inline cache, but without losing our ability to `mmap` the bytecode for sharing across multiple processes. The whole delegation from one dispatcher to another is erased in the dispatch program that we end up with. Furthermore, the way the transformations and guards are recorded, and then compiled, automatically eliminates duplicate guards, duplicate attribute lookup paths, duplicate guards on those attribute lookups, and so forth. All of the capture transformations are also erased; an access into an N-times modified capture that maps to an original argument will simply read that argument. In fact, in the common case where the dispatcher gets its arguments of interest at the start (for example, the method name), and the arguments for the final destination of the dispatch are all at the end, then the entire load of shuffling around between the dispatchers is optimized into a single addition to the arg indices pointer! By the end of May, I'd largely got what I described so far implemented, although was not at the point of having started using the new dispatcher in Rakudo to replace the existing specializer plugins. Thus, I'll write about that step in the next report. Doing the dispatcher work did entail a rather significant, but worthwhile, detour. We need to store the various bits of state associated with dispatchers somewhere. That "somewhere" must be somewhere that we can "slice off" when we take a continuation. However, so far, we'd only used our callstack to store call frames, and even then we wiped it clean at the point we did a heap promotion of a call frame. Thus, that all had to be refactored. The changes are rather nice. The new callstack can contain a range of kinds of record. Unlike before, even if a frame is heap promoted (because a GC-managed object needs to point to it), it still retains a presence on the callstack. The callstack can also store dispatch records and run records. But there's more! Previously, taking a continuation would force a heap promotion of all the frames on the callstack, and we'd have to traverse each frame. Under the new design, a continuation reset (the base of a delimited continuation) always starts a new callstack segment. This means that we can cheaply find the base of the continuation, and "slice" off the segment(s) containing its frames just by NULLing out a the pointer at the appropriate point in the linked list of segments. We also don't need to do heap promotion of all the frames; we just mark the callstack entries just like we would if they were referenced by virtue of being in dynamic scope rather than in a continuation. Invoking the continuation is then sticking those previously sliced off segments on the end of the segment linked list. (This is a slight lie: we do end up having to walk the frames on one side or the other to clear up dynamic variable cache entries. This is a solvable problem, and then we can really say that continuations are O(stack segments) rather than O(call frames).) As if having to revist continuations wasn't enough fun, I also had to revisit deoptimization, and of note, uninlining. Inlining is an optimization where we insert the body of a callee into the body of the caller to avoid the cost of the call and, more importantly, to allow us to do inter-procedural optimizations. However, since we do speculative optimizations, we have to be able to undo that, thus uninlining: recreating the callstack as it would have been if we had not done the inlining optimizations, at the point we discover that one of the speculations we made is not valid. Previously, uninlining would force all call frames to be heap promoted, which was kinda nice in that it meant we were now dealing with a linked list. In a case where a mixin causes us to have to deoptimize the whole callstack, it was thus easy to insert any frames producing by uninlining in the middle. But what when we don't have that option any more? The naive approach involves a whole load of hairy rewriting of the call stack, loads of `memcpy`, and no doubt trashing much of our CPU cache contents along the way. Thankfully, having spent plenty of the time in the Perl community, I know that laziness is a virtue. So what's the answer? Lazy deoptimization! We just mark call frames that we need to deoptimize and potentially do uninling of, and then, when we reach them while unwinding the callstack, go and do the required bit of uninlining before we continue execution. This means we're only ever fiddling around at the stack top. Something really surprising - at least to me - happened when I did this. I've always dreaded working on deopt, or having bugs there, and especially so if there was uninlining involved. It was a case of "if you had to be clever to write the code, you better be having a darn good day if you're going to debug it". Only ever touching the stack top, rather than dabbling with a linked list-ish thing, and recreating the frames outermost to innermost (it was the other way round before) comes out *vastly* simpler. As in, I can probably understand what the code is doing even on one of my average days now. This is welcome. Oh, and we no longer have to heap promote frames because we're doing a deoptimization, which makes speculative optimization cheaper in the caess reality doesn't match up with our speculations. All in all, I feel this is a solid start to the dispatcher work. Of course, there's quite a bit more between this and something we're shipping in production releases; I'll be working to close that gap in the month ahead. ``` 72:17 Implementation work on new dispatch mechanism, as described above Total reported time in this report: 72:17 Remaining approved hours on current grant period: 54:04 ```

Category: Raku Development

Comments (2)

[ 2191 ] | Wed, 17-Jun-2020 by fritzzaucker

Impress what kind of things are going on behind the curtain!

[ 2192 ] | Wed, 17-Jun-2020 by fritzzaucker

... impressive ...

May report of the Raku Development Grant of Jonathan Worthington

Comments (2)

Sign in to add comment

About TPF

Categories

Popular Tags

Recent Tags

Get Perl

Links