_Nicholas Clark writes:_
As per my grant conditions, here is a report for the August period.
Multiple cores are the future! And we need to enjoy this, because we aren't going to get any choice about it. The current hardware for the hot backup for perl5.git.perl.org has 24 cores, and even mobile phones are thinking about going quad-core.
The upshot of this is that the more that you can get the computer to do parallel, the faster the entire task will complete. On *nix the make has been able to run in parallel for quite a while, and back in 2008 I hacked away on TAP::Harness until we got the tests running in parallel too. (Sadly only on *nix - Microsoft's divergent implementation of socket() frustrates the same approach being used on Win32.)
Well, to be strictly accurate, the make has been able to run in parallel except when it can't, due to race conditions we were previously unaware of. (Just like the tests, actually.) We chip away at them as we find them, and I guess that the best we can hope for is "no known problems", and an increasingly long time since the last "unknown unknown" turned up.
Frustratingly there had been one "known problem" that we hadn't figured out a way to solve. SDBM_File contains a subdirectory in which is the source code to libsdbm.a, and the Makefile.PL for SDBM_File ends up generating a Makefile which has two rules, both of which recurse into ext/SDBM_File/sdbm/ and try to build the same target. Run in parallel, and you can end up with one process deleting libsdbm.a at just the same time that a second process is trying to link it, resulting in that second process bailing out with an error and the build failing. One rule was the default "build my subdirectory" rule generated by ExtUtils::MakeMaker. The second was a custom rule generated by the Makefile.PL to say "to build the dependency libsdbm.a, you need to do this..." and the difficulty was that the rule that we needed to keep was the second one.
What has eluded all of us for a very long time is that actually it's pretty easy to stop ExtUtils::MakeMaker generating that default rule - if you know *how*. So the fix turns out to be quite simple, once you know the secret, and now we think that this bug is finally fixed, and as no others have cropped up in the Makefile for a while, it's starting to look like the Makefile might be bug free and reliable in parallel.
If multiplying is good for cores, then it must be good for BSDs too, as they seem to be breeding like rabbits*. The problem with this is that each BSD ends up having a hints file to tell Configure about its peculiarities and foibles, those hints files keep getting bigger, and when (for example) Bitrig forks off OpenBSD, then the Bitrig hints file starts out as a *near* copy of OpenBSD's. (Although, of course, not identical, as inevitably something key will change, such as the version numbers)
The upshot of adding support for Bitrig was that now we had 3 hints files (OpenBSD, MirBSD and Bitrig) each of which contained a 59 line test to determine if the C library has broken 64 bit functions inherited from OpenBSD.
Triplication is never good, and as the code in question looked to be pretty much portable and likely to pass on any sane system, I looked into moving it into Configure, and hence running it everywhere.
Just for a change, this all went strangely smoothly. AIX and HP-UX, both of which have choked on these sorts of things before, built it just fine, as did the less problematic Solaris, FreeBSD and Linux. Various smoke-me smokers reported it clean on Darwin, NetBSD and Cygwin, and BinGOs manually tested on both 32 and 64 bit Cygwin without problems. So it went "in", and the code base got (net) 124 lines shorter.
It would be wonderful to parameterise all the BSDs into one master hints file and a set of options ("oh, it's basically FreeBSD-like, but they use git for a VCS, the NetBSD ports system, but default to ZFS"), but I suspect that that's more work than it's worth. Still, it remains that the hints files for *BSD are already more than twice as long as the Linux hints file, and sadly I don't think that that ratio is going to improve.
Less conclusive was my investigation into dl_unload_all_files(). This feature of DynaLoader dates back 13 years, to just before v5.6.0 was released, and the open ticket is of a similar vintage, "ID 20000402.003" in the old system (migrated to RT as #2968).
The problem that dl_unload_all_files() is attempting to solve is that in an embedded environment, the perl interpreter's exit may not always correspond to the process exit. Worse still, the perl interpreter might be be started a second time in the same process, and this ought to work. So to make everything work smoothly, it's important that when embedded, the interpreter frees up all resources that it was responsible for allocating.
One of the more obscure resources that had got forgotten was shared libraries that it had loaded, typically by DynaLoader when loading extensions dynamically. So dl_unload_all_files() was part of an attempt to have DynaLoader track all the shared objects that it loaded, and free them all at interpreter exit. Only it didn't quite work out, and despite the documentation claiming that it was active, it quietly got disabled unless DynaLoader.xs was compiled with DL_UNLOAD_ALL_AT_EXIT.
The summary is that "it's a mess".
The curious good news is that even building the 13-year-old code with the current gcc ASAN shows remarkably few problems. The quality of the core's code has been consistently good for over a decade.
But back to the bad news. Even for the single-threaded case, there are problems. The implementation of dl_unload_all_files() isn't robust. It uses perl_atexit() is called after object destruction, but *before* all the unblessed SVs gets destroyed. The problem is that this is too early for this use case - anything used later that has a pointer into something dynamically loaded instead has a pointer into garbage. For example anything that has tie magic will have the vtable point to static memory (in the sense that it's a static object allocated by the C compiler). If the tie was set up by XS code and that memory was in the shared object, that memory isn't there any more, but the SV still points to where it was. When the full global destruction happens it will attempt to call the relevant vtable routine to clear up the tied scalar, follow the pointer into the abyss and then crash (or worse). The actual problem that highlighted this was the OP free hook in File::Glob, which again is a pointer from something "owned" by the interpreter into memory "owned" by the shared object.
It might be possible to fix these global destructions problems by adding a hook that gets run much much later, after all the SVs are destroyed, pretty much as the last thing before perl_destruct() returns. However, this won't fix the multi-threaded case. It's clear that no-one is building with it with ithreads (or using it with Win32 pseudo-forks), because there is an explosion of fail if you do. The problem is that DynaLoader is using perl_atexit() to unmap shared libraries, and that's per *interpreter* exit. So as soon as an ithread is spawned, its exit triggers the unmapping of all shared objects, and the next time the parent thread tries to use one, kaboom!
I think to properly solve the "at interpreter exit" cleanup we'd really need to change dl_load_file() to internally track what it's been called for (tracking shared across threads), and (I think) how often, with ownership being upped on thread clone. At which point, interpreter exit is safe to call dl_unload_all_files(). But only really late, when the top level parent interpreter exits. And given that we've lived with this bug for over 13 years now, it doesn't seem to be a high priority to fix.
Something else that proved to have more "depth" than was initially obvious was a bug for threads::shared (RT #119089), where the test case demonstrated that shared references returned from subroutines could disappear on a return. The submitter's concise test case enabled Dave Mitchell to quickly determine that the reported problem had been fixed by a commit in the core C code back in January 2012, and Dave also suggested a work around to avoid the bug. Thus it seemed like case closed.
However, I had my suspicions about that change (6f48390ab209d16e), and whether it was just fixing the symptoms rather than the cause:
bq. [perl #95548] Returned magical temps are not copied
bq. return and leavesub, for speed, were not copying temp variables with a refcount of 1, which is fine as long as the fact that it was not copied is not observable.
bq. With magical variables, that *can* be observed, so we have to forego the optimisation and copy the variable if it's magical.
bq. This obviously applies only to rvalue subs.
What raised my concerns was that if assertions are enabled then the test case triggers an assertion failure on v5.14.0. If assertions are not enabled, it crashes. So I went looking back in history to find what introduced the assertion failure, it turns out that there's a heck of a lot more to this than first meets the eye.
So, what caused the assertions to start failing between v5.12 and v5.14? A simple bisect suggests that it's a commit to use cBOOL for some casts, which has nothing directly to do with threads shared, or (not?) copying values on subroutine return. It turns out after a bit of digging that it's this section of that commit matters:
bc. diff --git a/mg.c b/mg.c
index 3fb8ec4..4a8d767 100644
@@ -193,7 +193,7 @@ Perl_mg_get(pTHX_ SV *sv)
const I32 mgs_ix = SSNEW(sizeof(MGS));
- const bool was_temp = (bool)SvTEMP(sv);
+ const bool was_temp = cBOOL(SvTEMP(sv));
bool have_new = 0;
MAGIC *newmg, *head, *cur, *mg;
/* guard against sv having being freed midway by holding a private
deep in the routine that deals with get magic.
It's not obvious from the code above, but SvTEMP(sv) returns either 0 or 0x00080000, and in 2010, C was typedef'd as C, so the above code always set was_temp to 0, due to truncation. The cast was added to avoid a warning about the truncation. The (buggy) truncation itself was added when the type of was_temp was changed from a suitably large integer to a bool as part of this commit whose only description was "Adding const qualifiers" (ie the commit message doesn't even note that some variables' types were changed)
The variable in question is used to enable Perl_mg_get() to retain the SVs_TEMP bit set on values it is processing. This is needed because Perl_sv_2mortal() conflates "temporary" with "mortal" and automatically sets the SVs_TEMP flag bit on everything. Hence this code seeks to remove SVs_TEMP if Perl_sv_2mortal() set it. The upshot of the bug was that SVs_TEMP was always being removed, so code in subroutine entry that does *not* copy TEMPs is skipped because the value is becoming unmarked. Hence the values are always copied.
So, all that conceals what actually going on. When that truncation bug was first introduced, it didn't matter that TEMPs weren't copied. When that bug was fixed, it mattered a lot that TEMPs weren't copied. What changed in the middle?
Finding out that involved bisecting between the two commits, patching the bug described above. Porting/bisect.pl makes even this pretty easy, as it can automatically apply user supplied patches to the tree before building. The true trigger of the behaviour regression was a change in March 2010 to how magic is handled on elements from magical arrays and hashes, which fixed bugs most visible with pos. The key part of that change was 1 line which got rid of a copy of a value. Hence that change removes one place that copied the return value, and the change with the cast fix removes the other place, hence why both are needed to trigger the problem. And hence why the identified patch fixes it, by re-instating a copy.
So, I assumed that taking blead, and reverting the patch that fixes the test case, would cause the test case to fail again. This is easy to check, and so I verified this assumption.
The assumption is wrong. More bisecting revealed that subsequently this change also has an influence:
bc. commit 4bac9ae47b5ad7845a24e26b0e95609805de688a
Author: Chip Salzenberg
Date: Fri Jun 22 15:18:18 2012 -0700
bq. Magic flags harmonization.
because part of it removes the entire was_temp (micro?) optimisation logic mentioned above, meaning that values will always be copied.
(Remember, the bug became exposed because the value ended up not being copied. If it's copied once, or the twice because a copy is copied, then the bug is hidden. And it's this insensitivity to the *number* of copies that makes it such a crafty little critter)
So that explains the given test case. But the story is still not over.
Whilst temporary values should be copied (by default), LVALUEs returned from :lvalue subs should not. Knowing this, I could easily convert the test case to use :lvalue subroutines and make everything crash (v5.14.0, blead, all in between).
I believe that the problem is because threads::shared wrongly assumes that the proxy for a shared aggregate will always outlive any LVALUE proxies for its elements. This is generally true, but not for element proxies returned from :lvalue subroutines for aggregate proxies which go out of scope with the subroutine.
I believe that the fix is to change threads::shared to deal with cleanup differently, so that the last magic struct which frees up the relevant mg_obj is responsible for cleaning up shared space. This is a reasonably involved change on some complex code, and it's still pending waiting for a confident second opinion that I got it right and it's good to merge.
The full explanation is in the middle of RT #119089, and most easily viewed as "https://rt.perl.org/rt3/Ticket/Attachment/1236123/643477/":https://rt.perl.org/rt3/Ticket/Attachment/1236123/643477/
Last month it was HP-UX doing something strange. This month AIX, as an innocent looking test had started failing on it, but not other Unix systems. Specifically, t/op/fork.t, which had not been modified, was now failing in two places, when previously it used to pass. This implies that some unrelated change had unfortunate side effects, but what? And why?
Fortunately this is just the sort of job that git bisect excels at, and now that most of it is automated it only takes small amount of setup effort to get the computer busy figuring out the answer. Which turned out to be a perfectly innocent looking commit to a harness function that t/op/fork.t was using which fixed a frustrating bug:
bq. commit 684b0ecaad281760b04dd2da317ee0459cafebf6
Author: Tony Cook
Date: Tue Jul 16 14:57:20 2013 +1000
bq. [perl #116190] feed an empty stdin to run_multiple_progs() programs
bq. Two tests for -a were attempting to read stdin and blocking with the -a implies -n change.
So why does this cause a problem? And why only on AIX?
run_multiple_progs() runs its programs using backticks. (backticks turn out to be the only sufficiently portable way to easily capture output.) The code it calls already had a feature to feed things to STDIN, which for various reasons is done by using a pipeline from another perl process. So what that commit does is change the generated backtick command from
bc. `./perl -Ilib -e ...`
bc. `./perl -e "print qq()" | ./perl -Ilib -e ...`
so that stdin is at EOF, and the thing we're testing can't end up hanging if it inadvertently reads STDIN.
It turns out that this causes fun on AIX with two tests, because on AIX /bin/sh is actually ksh, and ksh does pipes differently (with one fewer process). With sh, for the latter command line the sh process forks two children, which use exec to start the two perl processes. The parent shell process persists for the duration of the pipeline, and the second perl process starts with no children. With ksh (and zsh), the shell saves a process by forking a child for just the first perl process, and execing itself to start the second. This means that the second perl process starts with one child which it didn't create. This breaks the tests assume that wait (or waitpid) will only return information about processes started within the test. One can replicate this is on Linux:
bc. $ sh -c 'pstree -p $$ | cat'
$ ksh -c 'pstree -p $$ | cat'
Aside from who gets the child process, the behaviour is identical, because both structures end up with the exit status of the pipeline being that of the last command - in the ksh case because the last command *is* the parent, in the /bin/sh case because the sh process ensures that it calls exit() with the exit status of the child that it reaped.
Problem is that the tests in t/op/fork.t *are* sensitive to who gets the child process, because they are counting how many children get reaped. I thought about fixing the tests to make them immune to unexpected extra child processes, but then realised that it was probably easier to change the generated backtick command to be