My grant time in June was spent on the new unified dispatch mechanism; my [May report](https://news.perlfoundation.org/post/jonathan-2020-05) provides some background on what that is, and this report will make more sense having read the May report first. In May, I reached the point of having a significant part of the new dispatch mechanism implemented, and had exercised it with some unit tests. However, these were all rather artificial. During June, I started applying it to rather more realistic problems.
Today, Rakudo (the Raku compiler and standard library) uses a mechanism called "specializer plugins", which allow it to "explain" a range of language constructs to MoarVM (the virtual machine), in order that it can optimize them decently well. In fact, the overall approach I explored in specializer plugins has been hugely informative for the design and implementation of the generalized dispatch mechanism that I am currently working on. I decided that replacing the specializer plugins with the new generalized dispatch mechanism was a good place to start. Of note, I replaced (in a branch):
* The return value type check plugin. The new generalized dispatch mechanism will allow for a shorter instruction sequence for these checks in the bytecode. However, there's another advantage. Specializer plugins work by looking at a number of arguments and mapping them - based on their matching of some guards - to something to invoke. In the case where we replace the type check with an exact type guard, the thing to invoke is the identity function. This invocation has overhead. We then rely on inlining to inline the identity function to eliminate that cost. Thus on hot paths we eventually have quite nicely optimized code. However, it takes time for the optimizer to kick in, and inlining costs something too. With the new dispatcher, we can do identity results without having to invoke anything, and the optimizer will be able to deal with them much more cheaply than using the inlining machinery too.
* The return value container handling plugin. It also often wants to just return identity, so gets the benefits described above. And in the case it wants to decontainerize a `Scalar`, we don't need an invocation either, since dispatch programs can read attributes. (It's not done yet, but if it's a `Proxy`, we shall be able to wire the dispatch directly to a call to its `FETCH` function.)
* The private method call plugin. Again, these come out with a shorter bytecode sequence.
* The qualified method call plugin. Guess what? Yup, a shorter bytecode sequence for these too. Also, error handling is done in the dispatcher, rather than needing a further delegation.
* The maybe method call plugin (for `.?foo`). This produces `Nil` if there is no such method, which is done rather cheaply done now (again, thanks to being able to produce value results without an invoke). Shorter bytecode too.
* The assign plugin. In fact, this is only partially complete, and still missing its optimized versions. Shorter bytecode again.
Return value type checks can involve coercion types, which means that we don't just need to check the type, but also to coerce it into another one. Since this involves a method call or potentially even a multi method call, this required me to start looking at implementing those too - ahead of switching all normal calls over to them. Standard method resolution is quite straightforward. Multiple dispatch, on the other hand, is rather more interesting.
Rather than wedge the current multi-dispatcher into the new dispatcher with minimal changes, I've decided to revisit how it works to take advantage of the new opportunities. This is still a work in progress. The main change is that I'm factoring it as a two-step process:
1. Perform a pre-filtering step on the candidates based on properties that we can guard using dispatch program guards.
2. If the outcome of the pre-filtering is a single candidate that can be deemed the correct choice purely based on the guards, then produce it as the result. Otherwise, retain the filtered outcome as the basis for doing the rest of the dispatch work per call.
Today, when there are `where` clauses or bind checks, we have to run the full dispatch algorithm to work out what to call, including the nominal type checks. Under the new design, the parts of the work that can be settled by the dispatch program guards need not be repeated. If there are many candidates that can be eliminated by type alone, this could be a significant saving, especially since guards are understood by the type specializer and so might disappear entirely.
Getting all of this to work took an amount of work down in MoarVM. Of note, I:
* Implemented various capture introspection operations on the new argument capture representation
* Completed implementing the "not literal object" guard, which makes sure that the object is *not* equal to some other object literal. This is used to for "anything but `Nil`" in assignment optimization.
* Completed implementing support for complete rewrites of argument lists, where the final invocation is not just invoking a tail of the initial arguments.
* Implemented dispatch programs reading attributes and being able to guard against those. When different dispatchers in the chain want to read and assert against the same attribute, or even path of attributes, these reads are automatically deduplicated thanks to the way the dispatch program is recorded.
* Filled out various GC marking bits across the new dispatcher implementation, which I'd stubbed, but waited until I had something to exercise them.
* Got the specializer to at least be able to pass through dispatch instructions, and to start doing some of the analysis that will be needed to optimize dispatches.
* Assorted small bug fixes.
The final major task I accomplished was to implement handling of flattening arguments in the new calling conventions and dispatcher. Making use of argument flattening has traditionally resulted in missing out on a lot of optimizations. For example, multi dispatch caches and specialization would not be possible to use if the arguments of the call were being flattened. I wanted to address that as part of this work.
Under the new design, we:
* Eliminate the cost of checking if there is flattening at runtime, by having the callsite's unlinked state point to different callbacks, and following the same pattern for monomorphic and polymorphic dispatch program callbacks too.
* If there is flattening to do, perform it as a first step, before any other dispatch-related work takes place. This means that everything downstream can assume it doesn't have to deal with argument flattening too.
* With limits (to avoid runaway memory use), intern the callsite that results from the flattening. Since interned callsites are keys to both dispatch programs and specializations, this lifts the optimization restrictions.
* Store the flattened arguments as a callstack record, possible thanks to the new callstack design I discussed in last month's report. This eliminates a pair of `malloc`/`free` calls that were required in the previous design, meaning the flattening process itself should also be cheaper.
Callsites that use flattening with relatively consistent patterns of numbers and types of arguments particularly stand to gain a lot. For things that flatten in massive numbers of arguments, which are highly variable in number, we probably lose out a bit under the new scheme. However, having studied the usage of flattening across a number of ecosystem modules, that seems decidedly rare. There'll be solutions if it turns out to actually be a problem.
46:01 Implementation work on new dispatch mechanism, as described above
Total reported time in this report: 46:01
Remaining approved hours on current grant period: 8:03