Perl 6 Performance and Reliability Engineering: November 2017

Sun, 19-Nov-2017 by Coke edit post

As part of the Perl 6 core development fund, Jonathan Worthington has completed another 200 hour block of hours, and his report of what was completed follows the break. Many thanks to the TPF sponsors of this and other grants. If you're interested in supporting work like this, please donate: http://donate.perlfoundation.org/ ---- # Grant Completion Report: Perl 6 Performance and Reliability Engineering At the end of July, I was granted a [200 hour extension](http://news.perlfoundation.org/2017/07/grant-extension-approved-perl-.html) of my Perl 6 Performance and Reliability Engineering grant. I used this time primarily to focus on MoarVM's dynamic optimizer, although did many other fixes and improvements aside from that. ## Background on the dynamic optimizer improvements Modern runtimes, especially for dynamic languages, rely on dynamic optimization techniques. These analyze the behavior of the program at runtime, and use that data to drive optimization. MoarVM's dynamic optimizer is typically referred to as "the specializer", or "spesh" for short, and this nicely captures its core strategy: taking code with lots of potential dynamism, and producing one or more specialized versions of the code with as much of that dynamic behavior stripped out as possible. The specializer was planned as part of MoarVM from the start, although its implementation came after the initial public release of the VM. Soon after that, the focus switched to the Christmas release of Perl 6, where nailing down semantics was a much bigger focus than optimization. Since then, the specializer was improved in numerous ways, however various limitations in its design started to show up repeatedly as we analyzed Perl 6 program performance. Furthermore, stress testing showed up a range of bugs in the optimizer that had potential to cause incorrect behavior in user code. Therefore, for the sake of both performance and reliability, it was desirable to invest some time improving the specializer. ## Specialization in the background Modern hardware is parallel, and it is desirable to find ways to take advantage of that. Originally, the specializer would do optimization work on the same thread that user code was running on. This not only paused execution in order to do optimization, but it also meant that multiple threads running the same code (say, in a data parallel program) would all race to do and install the same optimization. I introduced a background thread for performing specializations. This not only meant that optimization work and JIT compilation would not interupt execution, but also resolved the issue of multiple threads scrambling to do the same optimization. Since there was now only one thread producing optimizations, some locking logic could also go away. A further upshot of this is that even a single-threaded user program can now get some advantage from multi-core hardware. One downside of this is that the exact timing of specializations being installed becomes unpredictable, and this can make debugging more difficult. Therefore, I added a deterministic mode, enabled by environment variable, which makes a thread pause while the optimization thread does its work. This, for a single-threaded user program, offers deterministic behavior. ## Better data for better decisions The specializer's decision making about what to optimize, and how to optimize it, will only be as good as the data available to it. The data model prior to my work under this grant was rather limiting. For example, it was not possible to get a high level overview of what was monomorphic code (same types always), polymorphic code (handful of different types) and megamorphic code (loads of different types). There were also too few samples to know if a type that was seen to differ once was really rare or not. When there are only ten or so samples, and a type differs one time, then it could vary up to 10% of the time; that will tend to be too costly to leave deoptimization to handle. By contrast, if there are a hundred samples and it happens one time, then it is much safer to leave that slow path to be handled by the interpreter, for the sake of running the common case potentially a lot faster. I implemented a lightweight interpreter logging mechanism. It writes compact logs about encountered types, invocation targets, and more into a sequential thread-local buffer. When the buffer is filled, it is sent to the specialization thread. Over on that thread, the recorded events are played back in a stack simulation, and a data model built up that aggregates the statistics. This is then used by a planner to decide what optimizations to produce. Along the way, I introduced a new kind of specialization, which specializes only on the shape of the callsite (how many arguments and which named arguments) rather than the incoming types. This means that megamorphic code (that is, code called on many different types) can now receive some optimization, as well as compilation into machine code. Before, a few specializations were produced, and then everything else was left to run slowly under the interpreter. ## New optimizations Besides allowing for better decision making, I introduced some new optimizations as well as making existing ones more powerful. * I enabled many more calls to be inlined (a powerful optimization where a call to a small routine is replaced with the code of the routine itself). This was achieved by using the new statistics to spot when the target of a call was reliably the same, and introducing a guard clause. Prior to this, inlining was only available to methods resolved through the cache or subs in the setting or the outermost scope. I also handled the case where the passed arguments were consistently typed, but it had not been possible for the optimizer to prove that, again using guard clauses. * I implemented inlining of closures (that is, code that refers to variables in an outer scope). * I made dead code removal happen far more eagerly, and improved the quality of type information available in code following the eliminated conditional. This is a significant improvement for parameters with default values, as well as branches based on types or constants. * I made frames that are reliably promoted from the call stack onto the heap be allocated right away on the heap, to save the promotion cost. (This promotion happens when a callframe needs to be referenced by a heap object.) * I changed the way that control exception's flow is represented to be more accurate, enabling elimination of handlers that become unreachable once the code they cover also becomes unreachable. The change also resulted in more accurate type information propagation, which can aid other optimizations. * I made the optimization that rewrites control exceptions into far cheaper `goto` instructions apply into inlines. ## Specializer fixes The specializer usually only operates on "hot" code, so that the time that it spends optimizing will have maximum benefit. However, it is possible to set an environment variable that lowers these thresholds, making the specializer try to optimize every bit of code that is encountered. This, combined with the deterministic mode, provides a means to stress test the optimizer, by forcing it to consider and optimize a great deal more code that it usually would do. Running the NQP and Rakudo builds , together with the Perl 6 test suite, in this way can flush out bugs that would not show up when only optimizing hot code. Prior to my work under this grant, failures would show up from this stress testing as soon as the NQP build. After a good amount of bug hunting and fixing, the NQP build and tests, together with the Rakudo build and basic tests, are completely clean under this stresstest. The handful of remaining failures in the stresstest are a result of inlining distorting backtraces (at the moment, the inlined frames do not appear in the backtrace), thus causing some error-reporting tests to fail. The fixes included addressing a long-standing but rarely occurring crash involving the intersection of specialization, multiple dispatch, and destructuring in signatures; a number of different crashes boiled down to this. Another important range of fixes involved eliminating poor assumptions around `Scalar`s and aliasing, which aside from fixing bugs also stands us in a better position to implement escape analysis (which requires alias analysis) in the future. ## Notable results from the specializer work The specialization improvements showed up in a number of benchmarks. Two that are particularly worth calling out are: * The daily `Text::CSV` module benchmark runs hit new lows thanks to the specializer improvements. * The "read a million line UTF-8 file" benchmark that I've discussed before, where Rakudo on MoarVM used to be just a bit slower than Perl 5, is now won by Rakudo. This is a result of better code quality after specialization. ## Improved GC thread sync-up I re-worked the way that garbage collection synchronizes running threads, to eliminate busy-waiting. The idea of the original design was that running threads would quickly react to the GC interupt flag being set on them. However, this presumed that the threads were all really running, which is never a certainty given CPU cores are competed over by many processes. Furthermore, under tools like valgrind and callgrind, which serialize all threads onto a single CPU core, the busy-waiting strategy produced hugely distorted results, and greatly increased the time these useful, but already slow, tools would take. Now the synchronization is done using condition variables, meaning that both the kernel and tools like valgrind/callgrind have a much better idea of what is happening. While callgrind showed a large (10%-15%) reduction in CPU cycles in some multi-threaded programs, the improvements under normal running conditions were, as expected, much smaller, though still worthwhile. ## Other work Along with the improvements described above, I also: * Added support to `Proc::Async` to plumb the output handle of one process into the standard input of another process, with the handles being chained together at the file descriptor level. * Hunted down and fixed a SEGV when many processes were spawned and all gave an error. * Fixed RT #131384 (panic due to bug in error reporting path of ASCII decoder) * Fixed RT #131611 (sigilles variable in coercion could generate internal compiler error) * Fixed RT #131740 (wrong `$*THREAD` after `await` due to lack of invalidation of the dynamic variable lookup cache) * Fixed RT #131365 and RT #131383 (`getc` and `readchars` read too many chars at the end of the file) * Fixed RT #131673 (`is rw` with anonymous parameter reported error incorrectly) * Fixed MoarVM issue 611 (memory errors arising from certain usage patterns of the decode stream) * Fixed MoarVM issue 562 (SEGV from a particular use of the `calframe(...)` function) * Fixed native callbacks when the callback is made on a thread other than the one that passed the callback in the first place * Avoided a linear lookup, knocking 5% off the code-gen phase of compiling `CORE.setting` * Removed the long-unused `Lexotic` feature from MoarVM, which allowed some code cleanup (this used to be used to handle `return`, but it now uses the control exception system) ## Dissemination I wrote a 4-part series on my blog about the MoarVM specializer. The posts walk through the MoarVM specializer's design and functionality, and mention the many improvements done as a result of this grant - explaining why the new way of doing things represents an improvement from the previous way. The posts are: * [Part 1: Gathering Data](https://6guts.wordpress.com/2017/08/06/moarvm-specializer-improvements-part-1-gathering-data/) * [Part 2: Planning](https://6guts.wordpress.com/2017/09/17/moarvm-specializer-improvements-part-2-planning/) * [Part 3: Optimizing Code](https://6guts.wordpress.com/2017/11/05/moarvm-specializer-improvements-part-3-optimizing-code/) * [Part 4: Argument Guards](https://6guts.wordpress.com/2017/11/09/moarvm-specializer-improvements-part-4-argument-guards/) I also travelled to the Swiss Perl Workshop and delivered a talk about the MoarVM specializer, titled "How does deoptimization help us go faster?". The [slides](http://jnthn.net/papers/2017-spw-deopt.pdf) and [video](https://www.youtube.com/watch?v=3umNn1KnlCY) were published online. ## Summary This latest grant extension enabled me to spend a significant amount of time on the MoarVM dynamic optimizer, both addressing bugs as well as overhauling the way information about program execution is collected and used. The new data allows for better decision making, and its availability allowed me to implement some new optimizations. Furthermore, I moved optimization work to take place on a background thread, so as not to interupt program execution. Aside from this work, I fixed other bugs and made some performance improvements unrelated to the dynamic optimizer. Finally, I gave a presentation about dynamic optimization at the Swiss Perl Workshop, and wrote an extensive 4-part blog series explaining the MoarVM optimization process.

Category: Grants

Perl 6 Performance and Reliability Engineering: November 2017

Comments (0)

Sign in to add comment

About TPF

Categories

Popular Tags

Recent Tags

Get Perl

Links