Grant Proposal: Optimizations building upon the new Raku dispatch mechanism
Mon, 25-Oct-2021 by
Jason A. Crome
edit post
## Project title
Optimizations building upon the new Raku dispatch mechanism
## Author
Jonathan Worthington
## Synopsis
Recently a new dispatch mechanism was merged into MoarVM, along with changes to
Rakudo to make use of it. This led to a number of [performance improvements](https://6guts.wordpress.com/2021/09/29/the-new-moarvm-dispatch-mechanism-is-here/)
in the immediate, and incidentally fixed various long-standing bugs. This
is, however, just the start of what can be achieved with the new dispatch
architecture and the changes made to the runtime as part of implementing it.
This grant application requests funding to help deliver further improvements
unlocked by that work.
## How will your work specifically benefit the development of Raku?
Raku performance has improved significantly since the initial release of the
language. This is thanks to a multi-faceted effort by many Raku contributors,
including putting more powerful optimizations in the runtime, the careful
optimization of the standard library, and developing better tooling for
analyzing performance. However, there is most certainly more to be done in
order that performance becomes a limiting factor for an ever smaller number
of developers considering using Raku.
This work will deliver further performance improvements for many Raku scripts
and applications. It is quite varied in nature, although primarily focuses on
improvements at the level of the language runtime (MoarVM), especially given
the fact that the current compiler frontend will ultimately be replaced thanks
to the ongoing RakuAST effort. Given I am also working on that, I'm well-placed
to avoid doing throwaway work.
## Project Details
As with previous grants, I request an hourly rate, and propose a set of key
areas that I intend to work on, on the understanding that there may not be
time to complete all of them, or that studying profiler and compilation output
might reveal some more profitable improvements to work on (bigger performance
gains for less effort).
One key area of work is to reduce the cost of calls, which is relatively high.
A lot of this cost is hidden by MoarVM being able to inline (including multiple
levels deep) and, in the case a speculative optimization proves invalid, to
uninline. However, in anything except the tightest microbenchmark, we can't
just inline everything, and the calling cost thus needs to be reduced also. Of
note, I propose to:
* Reduce the cost of callframe setup by allocating space for registers and,
where possible, the lexical environment directly on the callstack rather
than as separate allocations. This should make every non-inlined call
faster.
* Reorganize callframe entry to avoid repeated checks (for example, around
specialized/unspecialized, JIT compiled or not, etc.), perhaps getting rid
of many of the checks entirely in specialized code by compiling the steps
needed for callframe setup.
* Find a different, and cheaper, way to handle `LEAVE` and similar phasers.
These are often required for reliable release of locks, but suffer from
relatively poor performance. Their potential presence also imposes a small
cost even on frames that do not rely upon them, which would be good to
eliminate.
* Further rework callframe exit in order to make it cheaper. Of note, MoarVM
has a "special return" mechanism that is used internally to avoid nested
runloops in favor of a Continuation Passing Style approach, which currently
also imposes a small cost on every frame exit (checking if it is needed).
The callstack layout after the new dispatch mechanism work will allow
us to avoid this.
Prior to working on the new dispatch mechanism, I started some work on partial
escape analysis, an analysis that allows for eliminating or delaying object
allocations. Raku makes heavy use of objects, which not only imposes the
direct cost of their allocation and later garbage collection, but also the
indirect cost of having further optimizations frustrated by not being able to
look "inside" the object. Some basic escape analysis has already been merged,
while further work was not completed, and in part blocked by things that have
now been resolved. I would like to return to this work, bringing it piecemeal
into the main branch. Of note, this will deliver:
* Support for transitive escape analysis, so we can avoid the allocation of
entire object graphs (for example, a Scalar container holding a Complex
could potentially elide both allocations).
* Restructuring object creation. Today we often lazily allocate containers
for attributes, in part because we depend on it to know whether to apply
attribute default values. This creates complications for escape analysis,
as well as forcing initialization checks in quite a number of places.
* Delayed materialization of allocations. For example, if an object is read
and written many times before it escapes (that is, returned or stored
somewhere we can't reason about), we can delay the allocation of the object,
treating its attributes as local variables up to that point. In some cases
the allocation may only happen in one branch (for example, in an exceptional
path), and so be possible to eliminate entirely in the common case.
* Handling of `Int`s in scalar replacement (the process by which non-escaping
object's attributes are stored in VM registers). `Int` is important given how
common integer math is, but also challenging because of the potential big
integer upgrade. Special care will be needed to never leak memory that is
allocated for a big integer, even in the situations an enclosing object's
allocation is elided.
Another area of work is to further improve or make further use of the new
dispatch mechanism. Of note:
* Apply the same caller-side decontainerization of `Scalar` containers approach
to also handle native references, so we can optimize their creation away in
many more cases. This should lead to performance improvements in code that
makes use of native types.
* Provide a way to attach dispatch programs a level down the callstack, which
may turn a megamorphic program point into a monomorphic one. For example, at
the moment, when we have a `proto` with a complex body and a `{*}`, the cache
at the `{*}` will end up with entries for all argument types that are seen.
However, if we were to locate the cache in the caller of the `proto`, we'd
often end up with only a single entry. The same applies in a number of other
situations (for example, the dispatch resumption is currently inside of the
`callsame` sub, so we scale poorly in large applications that use `callsame`
in many different places).
* Use the dispatch mechanism to better optimize use of the FALLBACK method
* Use the dispatch mechanism to better optimize various forms of Raku type
check
* In the case we have specializations available, fold the selection of a
specialization into the dispatch program, avoiding repeated guard checks
## Project Schedule
I would start immediately upon grant approval, and expect to give a minimum of
40 hours a month to the grant, however more should be possible most months.
## Bio
I am the founder and architect of MoarVM, the most popular runtime for Raku users,
and the architect of the Rakudo compiler. I have contributed to the implementation
of numerous Raku language features, and played a key role in the design of the
concurrent and parallel aspects of the language. I hold a degree in Computer Science
from the University of Cambridge, and actively work in the field of developer
tooling and compilation.
## Amount requested
200 hours * $60 USD / hour = $12,000 USD
Comments (4)
A very enthusiastic +1 from me. The new-disp work so far has been great, and I eagerly look forward to any and all further improvements.
Also +1 from me. Jonathan's work on Raku has been excellent and focused on exactly the right place. Raku is an amazingly expressive language for which the only downside is occasional performance deficit. Continued effort towards improvements in that area would be greatly appreciated.
+1 from me; Jonathan's previous work in performance has not only realized the direct gains but the updated architecture makes it easier for other core developers to contribute improvements as well.
Better performance (both startup and runtime) is what is holding back the wider spread use of Raku in production environments. With all the changes in new-disp some areas already got quite an improvement already. Obviously, Jonathan is the right person to implement the now possible optimisations and also provide the ground work for others to explore these new opportunities.