July 2018 Archives

Timo has developed a shiny new UI for the tool. It displays thread-level garbage collection details.

Read more at: Wow, check out this garbage


Wow, check out this garbage

Hello everyone! It's been more than a month since the last report, but I've been able to put hours in and get code out. And now I'll show you what's come out of the last weeks.

The Garbage Collector

One important aspect of dynamic languages is that memory is managed for the user. You don't usually see malloc and free in your perl code. But objects are constantly being created, often "in the background". Asking the user to take care of freeing up space again for objects that may have been created and become obsolete in the same line of code sounds like a good recipe for user dissatisfaction.

To take this potential full-time-job off the user's hands, MoarVM employs Garbage Collection, specifically a scheme based on "reachability". Whenever space needs to be freed up, the process begins. Starting from a set of objects that MoarVM happens to know for sure are currently needed, references from one object to another are followed until everything "reachable" has been reached and "marked" for being kept. Afterwards, everything that has been created but not "marked" will be removed in a step commonly called "sweeping". Garbage Collectors following this scheme are called "Mark & Sweep Garbage Collectors". Find a section about this on the "Tracing Garbage Collection" wikipedia article, though MoarVM's garbage collector has a few additional tweaks for performance reasons.

Naturally, going through a large amount of objects consumes time. Seeing what the program spends its time doing is an important part of making it perform better, of course. That's why the MoarVM profiler records data about the Garbage Collector's activities.

GC in the profiler

The old HTML profiler frontend already showed you how many GC runs happened during your program's run. For each time the GC ran you'll see how long the run took, whether it went through the nursery only (a "minor" collection) or everything on the heap (a "major" collection, called "full" in the tools). I recently added a column displaying when the run started in milliseconds from the start of recording. However, the old profiler frontend didn't do anything with information recorded by threads other than the main thread. Here's a screenshot from the old frontend:

moar_profiler_gc_screenshot-1

I'm not quite sure why it says it cleared about 4 gigabytes of data during the first two runs, but that'll go on the pile of things to check out later.

For now, I can use this to explain a few things before I go on to show the first version of the new interface and what it does with multithreaded programs.

The profile was taken from a hypered (i.e. automatically multithreaded) implementation of the fannkuch benchmark. The rightmost column shows how much data was copied over from the nursery to the next nursery, how much was taken into the "old generation" (either called "old" or "gen2"), and how much was freed.

There's also a count of how many references there were from objects in the old generation to objects in the nursery, the "gen2 roots". This happens when you keep adding objects to a list, for example. At some point the list becomes "old" and fresh objects that are inserted into it have to be specifically remembered so that they are considered reachable, even if the old objects aren't fully analyzed.

The new frontend

moarperf_gc_screenshot

Looking at the next screenshot, which shows the GC section of the new profiler frontend I've been working on, you'll see it looks very different. The existing information is re-organized. Instead of a column of "time taken" with a bar and "time since start", this info is now in simple text columns in addition to bar charts at the top. You can now expand individual GC runs to get at the other information from the previous profiler: The amounts kept, promoted, and freed, as well as the inter-generational roots. However, they are now per-thread, and there's an additional bar chart. It's lacking a legend or title right now, but if it had one, it'd say that it's the individual start and end times of each thread's GC work, in milliseconds. The little asterisk denotes which thread was responsible for kicking off the GC run.

You'll surely notice another difference: There's a whole lot more bars in the top bar chart than there were entries in the GC table in the old profiler frontend. The simplest explanation is that I just took a profile from a different program. However, it was the same code that generated these two profiles. The difference comes from the fact that the old frontend currently only displays the GC runs it finds in the first thread's list. That corresponds to entries in the new list that feature 1 in the "threads" column.

The near future

There's many more things I want to do with this tab, like getting statistics on a per-thread rather than per-run level, and maybe a "participation tally" graph that'd show when which threads participated. And there's details shown by the graphs that I haven't noticed before. For example, what causes some threads to join in near the end, and does that force the other threads to wait for a long time before doing anything at all? The screenshot below is an example of that pattern.

moarperf_gc_screenshot_2

The questions raised by this fresh perspective on the data we've already been capturing long ago has already made it worth quickly shoving the GC tab in between the other parts of the frontend. I had already started the Routine Overview tab, but it's not quite ready to be shown off. I hope it'll be pretty by the time my next report comes out.

Thanks to The Perl Foundation for the grant, and masak for this beautiful closing paragraph:

In conclusion, all of the people who oppose my plan for world domination will be treated in the most appropriate manner. I wish nothing but the best for my enemies. May you rest easily at night, and may your futures bloom and burgeon!

- Timo

Call for Grant Proposals (July 2018 Round)

The Grants Committee is accepting grant proposals all the time. We evaluate them every two months and another evaluation period is upon us.

May 2018 Grant Votes

The Grants Committee has concluded the voting of the May 2018 round. My apologies for the delay.

There were two proposals this round, but neither submission was approved.

Jonathan Worthington has requested an extension of $10,000 for his Perl 6 Performance and Reliability Engineering grant. This will allow him to dedicate another 200 hours to this work.

His previous work was successfully delivered as reported in the latest report.

Jonathan writes: "A further grant extension will enable me to continue my work on Perl 6 performance and reliability. The 200 hours I just completed have seen both optimizations delivered as well as work to pave the way for future optimizations; a further 200 hours will be spent largely on delivering those, together with the usual range of other fixes."

The funding for this extension would come from the Perl 6 Core Development Fund.

Before we make a decision on this extension we would like to have a period of community consultation that will last for seven days. Please leave feedback in the comments.

This report describes what I have accomplished in the latest 200 hours of my Perl 6 Performance and Reliability Engineering grant.

A new MoarVM specializer plugin mechanism

The most significant new optimizations developed in this grant period center around a new mechanism for helping the MoarVM specializer, which optimizes and JIT-compiles hot code, to understand a wider range of Perl 6 constructs. I used this mechanism to speed up:

  • Private method calls taking place within roles (6.5x faster)
  • Qualified method calls like $obj.Foo::meth (12x faster)
  • Duck dispatches like $obj.?foo (2.5x faster if there is usually only one type really seen at the callsite in a particular program, and 1.5x faster otherwise)

This mechanism was also used in a re-work of Scalar containers, discussed a little later in this report.

MoarVM specializer usage analysis improvements

Various optimizations depend on having good information on what values are used and where they are used in the program. Previously, we maintained a simple usage count, and incremented that number to account for places where the value would be needed if the code was deoptimized (that is, returning to the interpreter if a speculative optimization turned out to be wrong).

First, I introduced use chains, which allow easy finding of instructions that use the result of an operation. Along with this, I implemented a debug checker that makes sure the usage information is maintained correctly as optimizations are performed. This uncovered a number of usage bugs, all of which I fixed.

I used these chains to improve the elimination of useless set instructions (which assign the contents of one register in the VM into another). Various optimizations reduce other instructions to those, so it is an important cleanup step that saves wasteful work in the optimized code. The previous implementation of this optimization followed a number of adhoc rules, didn't rewrite the graph in a way that upheld the SSA invariants, and still left behind a number of set instructions that looked very possible to eliminate. The new algorithm using the use chains deletes more useless set instructions and performs just two different transformations, which between them cover what the adhoc rules did beforehand and more.

I also used the chains to implement box/unbox pair analysis, discovering places where we box a value into an object and then unbox it later. If that case we can simply use the original unboxed value and potentially eliminate the box operation and the memory allocation that goes with it. This optimization works, however it is not yet very effective since it doesn't allow for elimination of a box that took place in an inlined operation and unboxed outside of it.

The immediate reason that elimination of a box instruction in inlined code could not take place is because we don't yet have a way to delete guard instructions that we can prove are no longer needed, and the inlining code has a guard on the return value type. However, even if we could do that, we didn't have enough information to understand that deoptimization could no longer happen at that point, and so we'd still retain the box that we'd really like to get rid of.

In the course of this grant period, I took on the second part of the problem, which is the more difficult of the two. This involved designing a new algorithm to build more detailed information about what computations have to be retained for the sake of surviving a deoptimization. Then, if we prove that such a deoptimization can never happen, we may delete the instructions we retained just in case.

Taken together, this work has given the optimizer far more detailed usage information to work with, has already aided some optimizations, and will play an important part in future optimizations too. Along the way, the optimizer debug output was augmented to include this new information, which helps those working on the optimizer to better understand why instructions are being retained.

More inlining

Inlining is an important optimization that sees the code of small callees being incorporated into the caller, so they can be analyzed and optimized as a single unit. It also eliminates the overhead of making the call. For Perl 6, inlining is really important, because even basic operators like + are a multiple-dispatch subroutine call.

Previously, we would only inline calls where we had already optimized the callee. This usually worked out well, but in some cases this strategy did not work out and the inlining opportunity would be missed. With the latest changes, it will now be able to do inlining in such a case anyway.

Further, I enabled speculative inlining of void context calls. Previously that had not mattered much, as we didn't generate very many of them. However, I implemented a code-generation improvement in NQP that led to many more void invoke instructions being generated. This got rid of a wasteful throwaway boxing operation when the last instruction of a loop block produced a natively typed value.

Scalar improvements

Scalar containers are one of the most used, and most allocated, types in most Perl 6 programs. For this reason, much of the code to deal with the most common operations on Scalar containers was originally written in C. This made sense when MoarVM lacked sophisticated optimization infrastructure. Today, however, that C code is an opaque blob that the optimizer cannot see into. We'd like to implement optimizations that will in some cases be able to completely eliminate the allocation of short-lived Scalar containers, but the optimizer not being able to see inside of them blocks this.

To enable all of that to happen, it was necessary to revisit Scalar. A further problem was that auto-vivification was also done in a way that was hard to optimize. I worked out a way to solve that problem and at the same time make Scalar one attribute smaller, meaning we save 8 bytes (assuming a 64-bit environment) off every Scalar container.

I used the new specializer plugin mechanism as part of my changes to Scalar containers. This means that there is now the possibility to eliminate the assignment type check and the Nil assignment special case when the optimizer already knows the type that is being assigned.

This work fixed a number of shortcomings in the original container handling code. One of those fixes (to Proxy handling in return) has been temporarily removed again, since a number of ecosystem modules were relying on the bug. A further more subtle precompilation related issue - again, an improvement on the surface, but that has an interesting interaction with an ecosystem module - is still under investigation.

These issues aside, I completed the Scalar overhaul, putting us in a good place for the next round of interesting optimizations.

Working on inlining, lookup, and context introspection problems

Inlining is essential for Perl 6 performance, however some features are quite difficult to provide in combination with it. As inlining has become more and more capable, these problems have been exposed. I spent some time on addressing them, and implemented a solution for most cases.

The problem shows up in operations that want to walk caller and outer chains. Inlining eliminates caller structure. Therefore, this has to be "reassembled" sufficiently to answer the lookup or introspection in question. That's tricky, but was being done in, for example, dynamic variable lookup. Other kinds of lookup could not handle it, however.

I introduced a new abstraction - the frame walker - that exposes an iterator over frames as they would exist had inlining not taken place. This allowed for a great simplification to the dynamic variable lookup code, followed by fixing a number of other forms of lookup.

This made things a great deal better. Remaining is to deal with some of the most late-bound uses pseudo-packages like CALLERS, where they are stored away rather than used for an immediate lookup. This is, thankfully, rarely done, but still needs a fix.

Regex compilation and evaluation improvements

I made a number of improvements to regex compilation and evaluation. The most significant improvement helped with deciding which protoregex or alternation branch to pick in the case that there was a huge number of different literal prefixes. This was the case for parsing the infix category in the Perl 6 grammar, where there are a huge number of choices; the issue also shows up for terms. The search for literal prefixes now has logarithmic rather than linear complexity. In the compilation of the Rakudo CORE.setting, this led to a 13% reduction in the time spent performing Longest Token Matching, and took around 1% off the overall compilation time (which involves much more than just parsing, though parsing is a significant part of it). User grammars which also involve alternations or protoregexes with a large set of literal prefixes will also benefit, since they use the same engine as we use to parse Perl 6 source code.

I also worked to reduce some of the boilerplate code that is emitted when compiling regexes, by analyzing whether it was needed. For example, the capture stack handling code need not be generated if we can see that nothing is captured.

Finally, I made calling of action methods a little cheaper by providing a way to avoid a can/findmethod sequence of instructions; this required two lookups in the method cache, whereas now only one is needed.

Assorted additional performance improvements

  • Fixed a performance bug in the thread pool scheduler where some sequences of operations would lead to an ever-growing call stack, thanks to mis-handling of continuations
  • Reduced the number of allocations in some Rat operations
  • Reduced the number of allocations in adverbed hash indexing operations, as well as other code using coercion types
  • Specialization of isnull and tryfindmeth

Assorted other fixes

  • Fixed a memory leak when a hot loop repeatedly forced GC
  • Fixed a memory leak involving Promise.in(...)
  • Added missing functionality to get the process ID of a spawned asynchronous process
  • Fixed a bug involving incorrect caching of type variable lookups
  • Marked a routine using samewith as not being elligible for inlining, fixing a problem that could occur if that happened
  • Fixed missing GC marking of contexts held by a serialization context, which showed up as an occasional segfault in some longer-running Perl 6 programs
  • Fixed a bug involving lazy deserialization and vivification of lexicals in combination with inlining
  • Marked getaddrinfo as blocking, so it taking a long time would not block GC and thus progress on other threads
  • Fixed an IO::Socket::Async resource leak, and also a regression introduced by work to provide the host/port of a connection
  • Fixed a bug where the results of callframe would become unpredictable if an onlystar proto was inlined
  • Fixed a Rakudo code-gen bug for C-style loops, whereby if it were the last statement in that sub, it would try to return a null, not a Nil like other loop constructs
  • Corrected an off-by-one uninlining bug that could occur when an instruction at the very end of an inline was deleted
  • Fixed return handlers of calls made in void context not getting the return value

Enabling others

I also took the time to write detailed replies on a number of Rakudo tickets to explain various issues, especially performance ones, such that others may be able to look in to them.

Dissemination

I wrote 5 blog posts:

I also gave a talk at the German Perl Workshop about the MoarVM optimization infrastructure.

The TPF Board of Directors approved extension of these two grants under the Perl 5 Core Maintenance Fund.

The Board also approved to allocate $40,000 to the Perl 5 Core Maintenance Fund to enable the above.

We thank the TPF sponsors to make these grants possible.

May/June Status

Apologies for the delay; the GC is now conducting our voting for the May/June round and will post results next week.

Thank you for your patience.

Please look for the next Call for Proposals before the end of the month for the July/August round.

Dave Mitchell has requested an extension of $20,000 for his Maintaining Perl 5 grant. This will allow him to dedicate another 333 hours to this work. During this grant he sent regular reports to the p5p mailing list as well as providing monthly summary reports that have been published on this site, the most recent of which are linked below:

Before we make a decision on this extension, we would like to have a period of community consultation. Please leave feedback in the comments field below by July 10th 2018.

The funds for this extension would come from the Perl 5 Core Maintenance Fund.

This is a monthly report by Dave Mitchell on his grant under Perl 5 Core Maintenance Fund. We thank the TPF sponsors to make this grant possible.

I spent the last month mainly working on:
* blocking issues for the 5.28.0 release.
* BBC tickets
* ASAN-related tickets


SUMMARY:
      2:22 RT #133204 Integer overflow leading to buffer overflow
     15:29 RT #133270 Blead Breaks CPAN: HMBRAND/DBD-CSV-0.53.tgz
      5:03 RT #133271 Blead Breaks CPAN: BRUMMETT/Devel-hdb-0.23.tar.gz
      1:06 RT #133305 heap-buffer-overflow in Perl_my_atof2
      1:22 process p5p mailbox
      2:30 review CPAN breakage
    ------
     27:52 TOTAL (HH::MM)

 245.9 weeks
3127.7 total hours
  12.7 average hours per week

There are 5.3 hours left on the grant

Timo is still not completely well. However, he has been able to make good progress:

  1. Optional parametres are now correctly logged

  2. Fixed a crash caused when the profiler is used on multithreaded code

Read more at: No Major Breakthroughs


No Major Breakthroughs

Sadly, the time since the last post on this blog hasn't been fruitful with regards to the profiling project. There have been slight improvements to the profiler inside MoarVM, like handling profiles with a very deep call graph better, making the first GC run show up again, capturing allocations from optional parameters properly, and hopefully finally making programs that have multiple threads running no longer crash during the profile dumping phase. A recently merged branch by esteemed colleague brrt will allow me to properly fix one nasty issue that remains in the profiler that relates to inlining.

Even though I can't show off lovely screenshots of the profiler UI (if you consider Coder's Artâ„¢ lovely), I can briefly go over the changes that have happened and what's next on the list. And of course I'm still very much interested in finishing the grant work!

katzenpfote_soft

Missed Optional Parameters

The first change I'd like to talk about is the one that was causing allocations from boxing optional parameters to go missing from the profile. Optional parameters are implemented as an op that accesses the passed arguments to see if something was present or not. Then it either runs code to put the default value in - if no argument was present - or it skips over that code. Additionally, it handles arguments that were passed as native ints, nums, or strings.

If an object was expected by the code that uses the parameter, this op will also create a box for the value, for example an Int object. The crucial mistake was in the instrumentation by the profiler.

Finding everything that is allocated is done by putting a little "take note of this object" op after every op that may create an object. This op then checks if the object was probably allocated by the last instruction, or if it was probably already logged earlier. If it was just allocated, that allocation is recorded for the profile.

The problem in this case lies in the placement of the logging op: It was placed right after the instruction that grabs the argument. However, that made it land in the place that gets skipped over if an argument was present. So either no argument was passed, and the logging op was just asked to log that a null was allocated, or an argument was passed that was perhaps boxed, and the logging op was skipped over. Oops!

Fixing this was simply a matter of following the skip and putting the logging op in the right place.

Multi-threaded Programs Crashing Mysteriously

If you used the profiler on code that runs multiple threads, you may have seen very suspicious looking internal error messages like "const_iX NYI" pop up. This was caused by the instrumentation aspect of the profiler, more specifically what it did when the instrumentation was no longer needed. Allow me to explain:

Instrumentation in this context refers to creating a version of the program bytecode that does some extra work in the right places. For the profiler this includes putting ops in the code that record that a function was called or exited, and ops that record allocations of objects.

This instrumentation happens lazily, i.e. when a function is entered the first time, it runs into the "instrumentation barrier", which pauses the program and creates the instrumented code right then and there. The instrumented code then gets installed and the program continues. This is implemented by having a global "instrumentation level" that just gets increased by 1 every time functions should go through an instrumentation step. This is done when profiling starts, and it is done when profiling ends.

Here's where the problem lies: Profiling is turned on before user code runs, which just happens to always be in single-threaded territory. However, profiling gets turned off as soon as the main thread is finished. This is done by increasing the instrumentation level by 1 again. Every function that is entered from now on will have to go through instrumentation again, which will restore the original bytecode in this case.

Other threads might still continue running, though. The first example that made this problem clear was finding the 1000st prime by grepping over a hypered range from 0 to infinity. Crucially, after finding the 1000st prime, some workers were still busy with their batch of numbers.

Here's where the instrumentation barrier becomes a problem. One of the remaining workers calls into a function, for example is-prime, for the first time since the instrumentation level was changed. It will have its instrumented bytecode replaced by the original bytecode. However, the other threads, which may still be inside is-prime in this example, will not know about this. They keep happily interpreting the bytecode when all of a sudden the bytecode changes.

Since the uninstrumented bytecode is shorter than the instrumented bytecode, the worst case is that it reads code past the end of the bytecode segment, but the more common case is that the instruction pointer just suddenly points either at the wrong instruction, or in the middle of an instruction.

Instructions usually start with the opcode, a 16 bit number usually between 0 and 1000. The next part is often a 16 bit number holding the index of a register, which is usually a number below about 40, but quite often below 10. If the instruction pointer accidentally treats the register number as an opcode, it will therefor often land on ops with low numbers. Opcode 0 is no_op, i.e. "do nothing". The next three ops are const_i8 through const_i32, which all just throw the exception that I mentioned in the first paragraph: "const_iX NYI". Two spots ahead is the op "const_n32", which also thrown as NYI error.

And there you have it, mystery solved. But what's the solution to the underlying problem? In this case, I took the easy way out. All the profiling ops first check if profiling is currently turned on or not anyway, so leaving the instrumented code in after profiling has ended is not dangerous. That's why MoarVM now just keeps instrumentation the same after profiling ends. After all, the next thing is usually dumping the profile data and exiting anyway.

The Next Steps

The MoarVM branch that brrt recently merged is very helpful for a very specific situation that can throw the profiler off and cause gigantic profile files: When a block has its bytecode inlined into the containing routine, and the block that was inlined had a "return" in it, it knows that it has to "skip" over the inner block, since blocks don't handle returns.

However, the block is still counted as a routine that gets entered and left. The long and short of it is that returning from the inner block jumps directly to the exit, but having the block inlined frees us from doing the whole "create a call frame, and tear it down afterwards" dance. That dance would have contained telling the profiler that a frame was exited "abnormally"; since the regular "prof_exit" op that would have recorded the exit will be skipped over, tearing down the frame would have contained the logging.

In this particular case, though, no exit would be logged! This makes the call graph - think of it like a flame graph - look very strange. Imagine a function being called in a loop, and returning from an inner block as described above. It would miss all of the exits, so every time the function is called again, it will look like the function called itself, never returning to the loop. Every time around the loop, the call will seem to be nested deeper and deeper. Since the profiler keeps around the whole call graph, the file will just keep growing with every single iteration.

Now, how does brrt's code change this situation? It will allow very easily to figure out how many inlines deep a "return from this routine" op is, so that the profiler can accurately log the right amount of exits.

On the UI side of things, I want to bring the routine overview list into a good state that will finally be worth showing. The list of GC runs will also be interesting, especially since the profiler recently learned to log how each individual thread performed its GC run, but the current HTML frontend doesn't know how to display that yet.

Hopefully the wait for the next post on my blog won't be as long as this time!
- Timo

I'm happy to note that the Curating and improving Perl6 documentation grant has submitted a final report, and the work has been accepted by the Committee.

Thanks to JJ for his work on this project, and also to Mark for acting as grant manager.

About TPF

The Perl Foundation - supporting the Perl community since 2000. Find out more at www.perlfoundation.org.

About this Archive

This page is an archive of entries from July 2018 listed from newest to oldest.

June 2018 is the previous archive.

August 2018 is the next archive.

Find recent content on the main index or look in the archives to find all content.

Pages

OpenID accepted here Learn more about OpenID
Powered by Movable Type 6.2.2