Grant report: Optimizations building upon the new Raku dispatch mechanism
Sat, 04-Dec-2021 by
My latest [grant proposal](https://news.perlfoundation.org/post/grant_proposal_raku_dispatch_optimizations)
was [recently approved](https://news.perlfoundation.org/post/grants_nov_2021_votes).
I had the possibility to dedicate quality time to Raku work at the start of November, but knew I would be tied up with some other
work in the latter part of the month. Thus I optimistically forged ahead with some grant work while I could,
crossing my fingers for approval. This report covers what got done.
The main completed task was to reorganize and streamline return and stack unwind handling. My work on the
new dispatch mechanism introduced a new callstack layout. This created an opportunity to simplify the way
we handle stack unwinding - that is, removing frames either because we are returning or because of an
exception. Since this happens for every single non-inlined block or routine that we call, savings here
have an impact on all but the most micro of micro-benchmarks. For example, a recursive Fibonacci benchmark
(written recursively to frustrate inlining) showed a 5% improvement from this work. The work also elimiated
some `malloc` calls in favor of callstack allocation in a number of situations, and resulted in overall
simpler and smaller code in MoarVM. Faster and simpler is certainly welcome.
I also did some optimization on frame invocation, primarily by splitting the specialized and unspecialized
callframe setup paths, which allowed for eliminating a number of branches that the C compiler was not able
to. This new factoring also revealed an opportunity to fold two `memset` calls into one, which was also a
welcome saving. This was worth a further 3% off the recursive Fibonacci benchmark. (To give a picture of how
Raku compares with Perl in this benchmark, Raku runs it in around two thirds of the time, despite the fact
that it has to cope with the potential upgrade of `Int` to a big integer.)
A central goal for the grant as a whole is to make progress on escape analysis. When I worked on this previously,
a particular challenge was the reliance on attribute container to vivify (get allocated) upon first touch. This
was not so much an optimization as a means to determine if an attribute had been initialized, for the purpose of
running defaults. Unfortunately, however, it greatly complicates the escape analysis of object graphs at creation
time, and makes all attribute access a little more costly. (One could also get occasionally surprised by the fact
that reading an attribute during a constructor would count as initializing it too.) Thus, I started working on a
new appraoch, based upon container descriptors, which are also the mechanism used in array and hash element
auto-vivification. The work in progress is currently a pull request, which needs further work to analyze why
it causes regressions in a small number of modules; this new approach does, however, already passes the
specification test suite.
I also did some design work for a faster and simpler way to handle `LEAVE` blocks. Today they carry quite some
performance overhead, we are unable to ever inline them, and needing to support them imposes a small, but non-zero,
cost on the exit of every callframe, regardless of if they have a `LEAVE` block. The new design I have worked on
should fix all of these issuses, and I hope to implement it during December. Even if `LEAVE` is rarely directly
used, it plays an important part in ensuring locks are reliabily released, and so is used implicitly in many
Finally, I also tracked down and fixed a bug in the intersection of dispatch resumption and inlining.
Time spent on the grant: 30 hours 52 minutes
Time remaining on the grant: 169 hours 8 minuates