Grant Proposal: Optimizations building upon the new Raku dispatch mechanism

Mon, 25-Oct-2021 by Jason A. Crome edit post

## Project title Optimizations building upon the new Raku dispatch mechanism ## Author Jonathan Worthington ## Synopsis Recently a new dispatch mechanism was merged into MoarVM, along with changes to Rakudo to make use of it. This led to a number of [performance improvements](https://6guts.wordpress.com/2021/09/29/the-new-moarvm-dispatch-mechanism-is-here/) in the immediate, and incidentally fixed various long-standing bugs. This is, however, just the start of what can be achieved with the new dispatch architecture and the changes made to the runtime as part of implementing it. This grant application requests funding to help deliver further improvements unlocked by that work. ## How will your work specifically benefit the development of Raku? Raku performance has improved significantly since the initial release of the language. This is thanks to a multi-faceted effort by many Raku contributors, including putting more powerful optimizations in the runtime, the careful optimization of the standard library, and developing better tooling for analyzing performance. However, there is most certainly more to be done in order that performance becomes a limiting factor for an ever smaller number of developers considering using Raku. This work will deliver further performance improvements for many Raku scripts and applications. It is quite varied in nature, although primarily focuses on improvements at the level of the language runtime (MoarVM), especially given the fact that the current compiler frontend will ultimately be replaced thanks to the ongoing RakuAST effort. Given I am also working on that, I'm well-placed to avoid doing throwaway work. ## Project Details As with previous grants, I request an hourly rate, and propose a set of key areas that I intend to work on, on the understanding that there may not be time to complete all of them, or that studying profiler and compilation output might reveal some more profitable improvements to work on (bigger performance gains for less effort). One key area of work is to reduce the cost of calls, which is relatively high. A lot of this cost is hidden by MoarVM being able to inline (including multiple levels deep) and, in the case a speculative optimization proves invalid, to uninline. However, in anything except the tightest microbenchmark, we can't just inline everything, and the calling cost thus needs to be reduced also. Of note, I propose to: * Reduce the cost of callframe setup by allocating space for registers and, where possible, the lexical environment directly on the callstack rather than as separate allocations. This should make every non-inlined call faster. * Reorganize callframe entry to avoid repeated checks (for example, around specialized/unspecialized, JIT compiled or not, etc.), perhaps getting rid of many of the checks entirely in specialized code by compiling the steps needed for callframe setup. * Find a different, and cheaper, way to handle `LEAVE` and similar phasers. These are often required for reliable release of locks, but suffer from relatively poor performance. Their potential presence also imposes a small cost even on frames that do not rely upon them, which would be good to eliminate. * Further rework callframe exit in order to make it cheaper. Of note, MoarVM has a "special return" mechanism that is used internally to avoid nested runloops in favor of a Continuation Passing Style approach, which currently also imposes a small cost on every frame exit (checking if it is needed). The callstack layout after the new dispatch mechanism work will allow us to avoid this. Prior to working on the new dispatch mechanism, I started some work on partial escape analysis, an analysis that allows for eliminating or delaying object allocations. Raku makes heavy use of objects, which not only imposes the direct cost of their allocation and later garbage collection, but also the indirect cost of having further optimizations frustrated by not being able to look "inside" the object. Some basic escape analysis has already been merged, while further work was not completed, and in part blocked by things that have now been resolved. I would like to return to this work, bringing it piecemeal into the main branch. Of note, this will deliver: * Support for transitive escape analysis, so we can avoid the allocation of entire object graphs (for example, a Scalar container holding a Complex could potentially elide both allocations). * Restructuring object creation. Today we often lazily allocate containers for attributes, in part because we depend on it to know whether to apply attribute default values. This creates complications for escape analysis, as well as forcing initialization checks in quite a number of places. * Delayed materialization of allocations. For example, if an object is read and written many times before it escapes (that is, returned or stored somewhere we can't reason about), we can delay the allocation of the object, treating its attributes as local variables up to that point. In some cases the allocation may only happen in one branch (for example, in an exceptional path), and so be possible to eliminate entirely in the common case. * Handling of `Int`s in scalar replacement (the process by which non-escaping object's attributes are stored in VM registers). `Int` is important given how common integer math is, but also challenging because of the potential big integer upgrade. Special care will be needed to never leak memory that is allocated for a big integer, even in the situations an enclosing object's allocation is elided. Another area of work is to further improve or make further use of the new dispatch mechanism. Of note: * Apply the same caller-side decontainerization of `Scalar` containers approach to also handle native references, so we can optimize their creation away in many more cases. This should lead to performance improvements in code that makes use of native types. * Provide a way to attach dispatch programs a level down the callstack, which may turn a megamorphic program point into a monomorphic one. For example, at the moment, when we have a `proto` with a complex body and a `{*}`, the cache at the `{*}` will end up with entries for all argument types that are seen. However, if we were to locate the cache in the caller of the `proto`, we'd often end up with only a single entry. The same applies in a number of other situations (for example, the dispatch resumption is currently inside of the `callsame` sub, so we scale poorly in large applications that use `callsame` in many different places). * Use the dispatch mechanism to better optimize use of the FALLBACK method * Use the dispatch mechanism to better optimize various forms of Raku type check * In the case we have specializations available, fold the selection of a specialization into the dispatch program, avoiding repeated guard checks ## Project Schedule I would start immediately upon grant approval, and expect to give a minimum of 40 hours a month to the grant, however more should be possible most months. ## Bio I am the founder and architect of MoarVM, the most popular runtime for Raku users, and the architect of the Rakudo compiler. I have contributed to the implementation of numerous Raku language features, and played a key role in the design of the concurrent and parallel aspects of the language. I hold a degree in Computer Science from the University of Cambridge, and actively work in the field of developer tooling and compilation. ## Amount requested 200 hours * $60 USD / hour = $12,000 USD

Category: Grants

Comments (4)

[ 2331 ] | Tue, 26-Oct-2021 by MasterDuke

A very enthusiastic +1 from me. The new-disp work so far has been great, and I eagerly look forward to any and all further improvements.

[ 2332 ] | Tue, 26-Oct-2021 by jdoege

Also +1 from me. Jonathan's work on Raku has been excellent and focused on exactly the right place. Raku is an amazingly expressive language for which the only downside is occasional performance deficit. Continued effort towards improvements in that area would be greatly appreciated.

[ 2333 ] | Fri, 29-Oct-2021 by coke

+1 from me; Jonathan's previous work in performance has not only realized the direct gains but the updated architecture makes it easier for other core developers to contribute improvements as well.

[ 2334 ] | Tue, 02-Nov-2021 by fritzzaucker

Better performance (both startup and runtime) is what is holding back the wider spread use of Raku in production environments. With all the changes in new-disp some areas already got quite an improvement already. Obviously, Jonathan is the right person to implement the now possible optimisations and also provide the ground work for others to explore these new opportunities.

Grant Proposal: Optimizations building upon the new Raku dispatch mechanism

Comments (4)

Sign in to add comment

About TPF

Categories

Popular Tags

Recent Tags

Get Perl

Links