Improving Perl 5: Grant Report for Month 15

No Comments

Nicholas Clark writes:

As per my grant conditions, here is a report for the December period.

A frustrating month.

Having completed the investigation into the state of hashing described in November's report, and becoming comfortable that it wasn't likely to explode without warning, I turned to dealing with the backlog of everything else. Given that I've only been able to do about 2 weeks' work in the past 2 months, routine traffic on the list means that a lot of "stuff" which has accumulated which I've not had time to deal with.

The obvious thing to fix first was the failing smoke testing reports. Smoke testing reports are great as long as they pass. Starting from the known state of "passing", you make a change, and if everything still passes you have reasonable confidence that you didn't break anything. Alternatively, if the tests start to fail, you quickly know that your change has an unexpected problem, and a clue as to where to start investigating.

Failing smoke tests are frustrating because you no longer have this clean distinction between "it passes" and "it fails, hence I introduced a problem". The tests fail, but they were going to fail anyway. So you have to look carefully to work out whether the failures you see are new (and hence you broke something), or the ones that were there before (and hence you probably didn't break anything, but you can't be sure).

The right thing to do is to fix the problems that are causing the smoke tests to fail, before trying to change anything else. In this case the problem is with the smoker that George Greer runs which is building perl with using clang's Address Sanitizer, a tool which looks for C programming errors such as uninitialised and out of range memory access. It's particularly important to investigate (and fix) such reports, as the problems found are potentially nasty security risks. In this case, as the smoker had started finding problems at some point since September's blead release, it can't be a problem in a production version, but it does need to be nailed before v5.18.0 ships, and the sooner the better.

Unfortunately, I couldn't replicate the problem on any machine I had access to on which I'd successfully built perl with clang. So I tried to replicate the problem on dromedary, the rather beefy 8 core server which acts as hot backup to the main git server. What followed was several days of discovering, stalling, and eventually working round bugs in building LLVM and clang, or in building perl with clang, in the hope of getting to the point where I could replicate the problem under a debugger, and hence fix it.

dromedary already had a version of clang installed into /usr/local. However, I'd never been able to get it to complete the build of perl. The perl binary would build successfully, but the build will bomb out whilst building XS extensions with errors such as

    encengine.o: In function `fstat64':
    /usr/include/sys/stat.h:449: multiple definition of `fstat64'
    Encode.o:/usr/include/sys/stat.h:449: first defined here
    encengine.o: In function `fstatat64':
    /usr/include/sys/stat.h:457: multiple definition of `fstatat64'
    Encode.o:/usr/include/sys/stat.h:457: first defined here
    encengine.o: In function `gnu_dev_major':

This time, however, I did manage to get to the bottom of that one. Whilst it's reported as an LLVM bug: http://llvm.org/bugs/show_bug.cgi?id=1699 the underlying problem is actually in the system headers provided by glibc, which specify that certain functions are to be inlined always. If this instruction is followed correctly by the compiler, then one copy is inlined into each object file, and this results in the clash seen above at link time.

So how come it builds with gcc? Well, the version of gcc on the system doesn't correctly honour the inlining request, and hence doesn't generate multiple clashing bodies for the the functions. In turn, this is why the glibc headers are buggy - no-one was aware of the problem back when they were released, because no compiler was good enough to spot the problem.

So, how come it only fails in the XS modules? After all, the perl binary is made by linking multiple object files together, and surely they have duplicate function definitions? It turns out that it's a side effect of how the core's build system automatically adds various warnings and strictures when building the core C code, which it does not when building extensions. The flags are only used for the core C code, because as it's code we control, we've progressively tidied it to address the problems the warnings highlighted, and hence keep the build output clean. (Like the smokes, silent is good, because it's the easiest to skim.)

One of these flags is -std=c89, forcing ANSI C89 behaviour, which ./cflags.SH determines is viable when building with clang on Linux. It turns out that this changes the behaviour of the headers, to avoid inlining. (The inline keyword came in with C++ and C99.) At which point the problem of multiple bodies goes away, and the link is clean. So, explicitly add -std=c89 to the standard build flags and the build completes, tests run, and tests pass.

However, sadly, the locally installed version of clang is too old to be useful, as it's 3.0, and Address Sanitizer wasn't merged in until 3.1. Not to worry, let's build 3.1...

Turns out that also this isn't that easy. For starters, LLVM and clang are big, very big. Several gigabytes for the build tree, and another several gigabytes for the installed tree. So this necessitates a clean up of my home directory, just to make room on the partition for the build. And then the build fails with an assertion deep in LLVM, which Googling suggests is due to a bug in g++ 4.2 - LLVM puts a lot of stress on the C++ compiler used to build it. The bug is fixed in later versions of g++, but as dromedary's OS is old, g++, like the system headers, isn't the most current.

Rather than build my own newer copy of gcc, I decided to build a newer LLVM, in case that fixed it. Which means getting it from source control - http://llvm.org/svn/llvm-project/llvm/trunk At which point I discover that there is no subversion client installed on dromedary - presumably none of us have ever needed it. This suggests something about the trends in version control systems used by open source projects. So more yak shaving as I download and build that, which turns out not to be too hard. Which gets me a current LLVM, which I can install and use to build perl.

But I still can't replicate the problem.

So I checked out the exact same svn revision of clang and LLVM that George's smoker reports that it is using, built and installed that.

Still no joy.

Wondering whether this was something to do with the age of the header files, I tried on one other machine (a rather newer Ubuntu desktop). Again, the exact same revision as George, and no joy. Similarly svn HEAD can't replicate the problem.

After three days of trying I had learned a lot about how (not) to build LLVM and clang, but failed to actually solve the problem that I was setting out to solve. So I'm no nearer to actually nailing the cause of the "unknown-crash" which his instance of Address Sanitizer cryptically reports for a complex expression deep in the code which compiles regular expressions.

Prodding further, we discovered that Merijn could replicate the report when he built on his laptop, and that after shipping the build tree to dromedary, I could run it there and replicate the report. However, as clang built with optimisation, it wasn't very useful trying to debug this with gdb. (Actually, it wasn't at all useful trying to use /usr/bin/gdb, as the debug format has expanded. I had to use the newer gdb I'd built the week before from source, as part of my earlier clang yak shaving.)

So, he built again with -g, to enable C level debugging. Problem goes away. Bother. So he built again, with both -O and -g, to keep the optimiser, but also add debugging symbols. Debugging optimised code is a pain, but it's better than nothing. However, even this hid the problem. At which point, we gave up again, because remote-delivery of "adding printf" style debugging tying up two busy people simply isn't a productive use of time.

I did get somewhat further with another yak shaving exercise. To fix a bug, DosGlob had been updated, but in the process it gained a declaration after statement in the C code, which broke the build on some platforms. I wondered, why hadn't we got an automated test for this, to be able to rapidly test the change on a branch, and avoid the problem in the future. At least on FreeBSD the system headers are clean enough to be able to build with gcc -ansi -pedantic, which I thought was good enough to trap this. It turns out that I'm wrong - what one needs is gcc -ansi -pedantic-error. Trying to build blead with this, I discovered that there are only two places that don't build with this, which feels like it should be small enough to fix. One, in perlio.c, is a long standing problem which generates one of the only warnings from the C code. I'd love to fix it, but will need a Configure test, or possibly even removing the code in question as part of a major PerlIO implementation cleanup.

The other error was in GDBM_File.xs, and looked trivially simple to fix, by adding a single cast. So I added it, but much to my surprise and frustration, this cast broke the build on C++.

So, an aside - why are we testing on C++? After all, Perl 5 is written in C, not C++. So why the pain of also testing on C++. It's because we support building C++ extensions, which means that Perl's headers must also be valid as C++. Most ANSI C is also valid as C++, but a couple of constructions aren't, and occasionally these slip in. If we don't test for it, we don't notice when it has happened, and if no-one notices before a release, then it's too late to fix. Specifically, one slipped into one of the stable releases of 5.8.x that I made. This isn't a mistake I'd like to repeat. (But the problem was also in the release candidates. So I'm not loosing sleep over this. Release candidates exist for this sort of reason - to permit people to test for the things that matter to them, and get problems addressed before they are made permanent in a release.)

So, how to test with C++? The Configure system doesn't have a way to probe the location of a C++ compiler (and doesn't need to), as it's only interested in the C compiler. So easiest way to automate testing this is to get the testing environment to name a "C" compiler which is actually a C++ compiler. This does mean that we have to ensure that all the C code is also conformant C++, but that's generally not too onerous, and often the changes made result in cleaner code.

So, how to make C++ compilers happy with GDBM_File.xs? I investigated not adding the cast. This took me down into the rat's nest...

GNU GDBM is stable. Very stable. (But not "specialist biologist word for stable".) The current version is 1.10. 1.9 was released in 2011, the version before that, 1.8.3 was released in 2002, while 1.7.3 was released in 1995, which was between the releases of perl 5.000 and 5.001.

The cast is on the fifth argument to gdbm_open(). The fifth argument is an optional callback function for fatal errors. Only GDBM is this flexible - the other four *DBM libraries that the core wraps don't have this last argument.

The GDBM_File module provides a Perl wrapper to the GNU GDBM. It's been in the core since the start. It also turns out that from the start GDBM_File has attempted to provide Perl access to this fifth argument. The XS code in 5.000 looks like this:

    gdbm_TIEHASH(dbtype, name, read_write, mode, fatal_func = (FATALFUNC)croak)

GBDM_File's typemap file defines FATALFUNC like this:

    FATALFUNC               T_OPAQUEPTR

and in turn the global typemap file defines T_OPAQUEPTR like this:

    unsigned long *         T_OPAQUEPTR

which is a pointer to data, which cannot legally be cast to/from a pointer to a function under ANSI C, hence why gcc in pedantic mode is getting upset. But thinking further, this isn't very useful - the XS code is expecting to pull a C function pointer out of the fifth parameter, not a reference to a Perl function. This isn't exactly a useful interface. Searching CPAN suggests that no-one currently uses it. But has it ever worked? So I set out to test it. Inevitably, this proved more "interesting" than expected.

So, to test it, we need to determine when the fatal_func callback is actually called. Digging around the GDBM source code suggests that it's only called for "should never happen" situations, such as low level reads or writes failing on an open file descriptor. So I wrote a test that ties a GDBM file to a hash, figures out the numeric file descriptor that the GDBM library is using to access it, and then closes that file descriptor, forcing some sort of write error and hence use of the callback.

The callback was called. But then it crashed with a SEGV. GDBM_File.xs defaults the callback to calling Perl_croak(), but gdb showed that execution was in different C function that simply isn't reached from anything Perl_croak() calls. This shouldn't be possible, yet clearly it happens.

It turns out that the problem is really subtle, and one that I didn't know about. Perl_croak takes a pointer to a string, followed by a variable number of arguments:

    void Perl_croak(const char* pat, ...)

The callback is prototyped to take a pointer to a function that takes a string. That's compatible, right? After all, a variable number number of arguments includes the possibility of zero extra arguments, which this is.

Problem is that the machine-level calling convention for variable argument functions can differ from that for fixed arguments. Specifically, on x86_64 (on which I was testing), for a variable argument function, %rax is set to the number of arguments passed in floating pointer registers. I didn't know any of this, and I doubt that most people do. But the upshot is that on x86_64, if the compiler knows that it's calling a function with fixed arguments, it doesn't bother "wasting" an instruction setting %rax to zero, whereas the function prelude for the called function assumes that %rax is set to something meaningful, and computes a jump based on it. If %rax contains out of range garbage, that jump can leap into any nearby code. Whoops!

So, this means that likely no-one has ever even tried to rely on the callback function, because it's a booby-trap that likely will only crash. At which point it seemed simplest to remove the complications of an unused feature, and drop the ability to change the function used for the callback, simply always using croak() as the callback.

The observant will notice that there's a second bug here - Perl_croak() takes a *printf-style format string, whereas there is no guarantee that the string GDBM passes to the callback does not contain % characters. We can fix both this bug and the problem with the calling convention by using a small wrapper function

    static void
    croak_string(const char *message) {
        Perl_croak_nocontext("%s", message);
    }

and now we're done.

Sadly not. Failing smoke tests reveal that the prototype for the callback function passed as the fifth argument has changed between GDBM 1.8.3 and 1.9.0, from void (*)() to void(*)(const char *). This distinction doesn't matter to a C compiler - to it, the former is "arguments unspecified", and the latter is "one argument specified", but to a C++ compiler, the () of the former means "zero arguments specified". So if you're compiling with a C++ compiler, the above static function is C++, the types conflict, and it's a compile-time error without a cast.

But which cast to use? Really, I wanted to simply prototype my function as "C" linkage - ie

    static "C" void
    croak_string(const char *message) {
        Perl_croak_nocontext("%s", message);
    }

Sadly this isn't allowed. extern "C" makes sense. static "C" is obviously useless. Except that I've just found a use case for it. Bother!

In the end, I opted to go for conditional compilation based on the C-pre-processor macros defined by GDBM, instead of something more complex such as probing. There are just 8 tarballs of GDBM released over the past 18 years, so it was perfectly possible to verify that this approach worked on all of them.

A more detailed breakdown summarised from the weekly reports. In these:

16 hex digits refer to commits in http://perl5.git.perl.org/perl.git
RT #... is a bug in https://rt.perl.org/rt3/
CPAN #... is a bug in https://rt.cpan.org/Public/
BBC is "bleadperl breaks CPAN" - Andreas König's test reports for CPAN modules

HoursActivity
0.25DosGlob
10.00GDBM_File
0.75PERL_HASH
0.50POSIX::strptime
0.25RT #115910
3.00RT #115928
1.00Storable
0.25Unicode data structures
3.75cflags, DosGlob, -std=c89
23.75clang
clang dromedary
1.00dist/threads-shared/t/stress.t
0.25gcc -ansi -pedantic-error
2.75perl.h, X2P
0.25process, scalability, mentoring
32.50reading/responding to list mail
0.25smoke-me/SuperFast
1.00used only once warning

81.50 hours total

Leave a comment

About TPF

The Perl Foundation - supporting the Perl community since 2000. Find out more at www.perlfoundation.org.

About this Entry

This page contains a single entry by Karen published on January 14, 2013 10:18 AM.

2013Q1 Call for Grant Proposals was the previous entry in this blog.

Fixing Perl5 Core Bugs: Report for Months 33 & 34 is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

OpenID accepted here Learn more about OpenID
Powered by Movable Type 4.38