Improving Perl 5: Grant Report for Month 8

2 Comments

Nicholas Clark writes:

The largest part of of the month was spent on various topics related to cleaning up the build process. In particular, simplifying the top-level Makefile, by removing duplication and self-recursion.

On platforms that run Configure (so Linux, Unix, Cygwin and OS/2), the Makefile is extracted from a shell script Makefile.SH. Most of it is written out verbatim, but there is some shell logic to generate OS or configuration specific code.

In turn that Makefile is used by makedepend to generate a second Makefile containing dependency information (named makefile on most systems, and GNUMakefile on OS X - both names chosen so that the make utility picks the newer makefile in preference).

The makefile compiles all the object files and links them into miniperl, which is pretty much the real perl, only without any form of dynamic linking. Between 5.005 and 5.6.0 things got a bit more complex, because perl switched to using File::Glob to implement globbing. [File::Glob uses the BSD glob code, written by that well-known Perl contributor Guido van Rossum :-), safely quarantined in its own file to avoid licensing cross-contamination]. As File::Glob needs miniperl to build, and miniperl isn't built yet, to solve the bootstrapping problem op.c is conditionally compiled as opmini.o, and that instead calls the old "shell out to csh" globbing code. This need to link with opmini.o instead of op.o becomes relevant later.

miniperl bootstraps the rest of the build, and is used as much as possible to avoid writing any subsequent build system 3 (or more) times - shell, batch files and DCL scripts. In particular, miniperl builds DynaLoader and perlmain.c, and links both with all those already compiled object files to make the real perl.

There is a configuration option to build a shared perl library. In this case, the C code is compiled to be position independent, and linked into a shared libperl, with the perl binary just a small stub linked against it. This also becomes relevant later, as libperl.so (or .dylib etc) contains the code for op.o, but not opmini.o.

Currently in blead, the Makefile has 20 places where it calls $(MAKE) to run a different target in the top level directory. Of these, this one is the most troubling:

$(LDLIBPTH) $(RUN) ./miniperl$(HOST_EXE_EXT) -w -Ilib -MExporter -e '<?>' || $(MAKE) minitest

as it can cause a fork bomb with a parallel make if miniperl builds but fails to work. Having been bitten by that fork bomb one time too many, I decided to eliminate it. However, that above line is actually replicated in 4 different places, 3 OS specific and the generic fallback. Hence the first digression into yak-shaving - reduce that number.

So, we have AIX (and BeOS), NeXT, OS X and "general". 3 of those are very much alive, so we can't just kill the code...

Firstly, AIX.

On AIX, when building with a shared perl library, one needs the list symbols that library should export when building it. For 5.005 and earlier, that list was generated by a shell script, so no bootstrapping problem.

However, the export list had to be manually updated in the shell script, so with commit 549a6b102c2ac8c4 in Jul 1999, the Win32 solution was generalised to work on both AIX and Win32. This uses Perl script, makedef.pl to parse various files and generate the current export list automatically. This introduces a bootstrapping problem - miniperl is needed to generate libperl.a, but libperl.a is needed to generate miniperl. This commit solves the problem by introducing a new target, MINIPERL_NONSHR, which builds a special staticly-linked miniperl (named miniperl_nonshr) in order to run makedef.pl, which in turn permits libperl.a and the regular miniperl to be built, and the build to proceed.

All was well until commits 52bb0670c0d245e3 and 8d55947163edbb9f (Dec 1999) changed the default for CORE::glob() to use the XS module File::Glob, and linked miniperl against an opmini.o, built from op.c but with compiler flags to use the old glob-via-csh code. The change made for AIX was to build miniperl_nonshr with the bootstrapping glob code, but leave the build of miniperl unchanged. This broke the build on AIX - miniperl would build just fine, but would fail to build any XS extensions, as the ExtUtils::MakeMaker code requires working globbing.

The AIX build was fixed with commit 18c4b137c9980e71 (Feb 2000) by changing Makefile.SH so that AIX used the same rules to build miniperl as NeXT. The rules for NeXT generated miniperl from an explicit list of object files, instead of using libperl.a. The result of this change was that on AIX, miniperl was now identical to miniperl_nonshr. Both correctly use the csh globbing code, but now neither require the shared libperl.a to work.

This makes miniperl_nonshr redundant. So I eliminated it. This starts to converge the code for AIX with the other platforms.

The default case:

Curiously, as a side effect of commit 908fcb8bef8cbab8 (Dec 2006) which moved DynaLoader.o into libperl.so, the default platform build rules for miniperl were changed to use an explicit list of object files, instead of C<-lperl>, which had the side effect of building miniperl non-shared. Hence all platforms now have a non-shared miniperl when building perl with a shared perl library.

Next, OS X:

Commit cb3fc4263509f28c (May 2003) removed the use of -flat_namespace from the link flags, but added it specially in Makefile.SH for miniperl, so that symbols from opmini.o overrode those in libperl.dylib. However, a side effect of commit 908fcb8bef8cbab8 (Dec 2006) was to change the linker line to use explicit object files, meaning that op.o was no longer part of linking, meaning that the override is no longer needed. Hence darwin's link does not need special-casing.

Lastly, NeXT:

Whilst it's not clear whether anyone is still using NeXT, since the previous changes have refactored the other 3 cases to use an explicit list of object files, the only difference remaining between the makefile rule for NeXT and the rest is that next doesn't use $(CLDFLAGS) when linking miniperl, whereas all the other do. It's not clear if this difference is significant or accidental, but lacking any NeXT system to test on, it was simple enough to preserve the difference but use simpler code to implement it.

The result - all 4 cases are merged. However, I've not yet actually eliminated that particular troublesome C<|| $(MAKE) ...> rule, as to test it properly requires C to pass first time, without first building all the non-XS extensions - work I've made progress on, but not yet completed.

As AIX now uses regular miniperl to run makedef.pl, the AIX-specific section of the Makefile now starts to look much more similar to the OS/2 specific section that runs makedef.pl there. With some refactoring of makedef.pl to avoid needing a command-line parameter to specify the platform to build for, and passing in a -DPERL_DLL on AIX which does nothing, the two converge even further. At which point it was a small matter of using Makefile macros to encode the different dependencies and filenames used, at which point the two can share code, and Makefile.SH gets simpler.

In turn, I also eliminated a lot of redundancy in the various variant install targets (each of which had been implemented with a call to $(MAKE) in the same directory), and the pre-requisites for various debugging targets.

One of these needs to pass a flag to installperl to instruct it to run strip on the installed binaries. installman doesn't need to strip anything, so doesn't accept such a flag. As it's the only difference between the invocations of the two, I decided that the simplest option was actually to make installman accept a --strip flag and do nothing. This, as ever, isn't as simple as it seems. installperl's strip flag is currently -s, but installman uses Getopt::Long so accepts -s as an abbreviation for --silent Hence the best solution is to have both accept a long-option --strip. This means refactoring installperl to use Getopt::Long, which in turn is "fun" because it accepts both +v and -v as distinct options, which Getopt::Long can't support. Or, more pragmatically, can't directly support, in the general case. Fortunately installperl isn't the general case, as it takes no non-option arguments, which permits a reasonably simple solution.

With this, more duplication died. En route, I spotted and eliminated some dead code in installperl related to a 5.000->5.001 change - specifically where in @INC autosplit installs files. Small instances of this sort of cruft accumulate all over the source tree, but generally the code is never annotated sufficiently as to the purpose to make it obvious that it had a specific time-limited purpose. And of course, it's maybe only 1% of the slightly "look twice and wonder why" code that is actually redundant - most is subtly useful on some platform or configuration corner case that is hard to test, but likely still needed somewhere.

This work is in the branches smoke-me/Makefile-miniperl-unification, smoke-me/Makefile-norecurse, smoke-me/make_ext and smoke-me/perlmodlib

Makefile.SH is a lot better than it, although there's still some more to do that will make it simpler still. As well as fixing minitest, still to do are some further simplifications of how ./miniperl is invoked to run various utilities. Most of the Makefile command lines have -Ilib -Idist/Cwd -Idist/Cwd/lib, dating back to the time when Cwd was detangled from ext/ and lib/ into a single directory in dist/ with the same layout as the CPAN distribution. However, since then I refactored miniperl to use the sitecustomize code to set up the build @INC automatically, meaning that everything after that first -Ilib is now redundant. So that's still to clean.

As ever, the age and gnarliness of the codebase, combined with the complexities resulting from being able to build various different configurations on many different platforms means that often I spend a lot of time investigating things, but little code changed as a result.

The most obvious example of this was while investigating an unrelated problem, happening to use valgrind on OS X, and discovering that it was reporting an error during interpreter startup [specifically down from S_init_postdump_symbols()]. Strange and troubling - strange because I thought we'd nailed all of these, and troubling because interpreter start up is a fairly common code path. As in, 100% of programs traverse it. So it's important not mess up and possibly open the door for malice. Except that the more I dug the more this seemed to be S.E.P.* - either a bug in gcc or in valgrind. I'm aware of Klortho's advice on this matter:

#11907 Looking for a compiler bug is the strategy of LAST resort. LAST resort.

but it did really seem to be a bug somewhere in the toolchain - malloc allocating 142 bytes, and then memmove attempting to read 8 bytes from offset 136, 6 before the end of the 8*n+6 sized block. So I noted the symptoms as RT #112198 in case anyone else hit them and searched for them, and then rejected the ticket, thinking I was done.

Of course, I wasn't. Tony Cook recognised the symptoms as being the same as a Mozilla bug: https://bugzilla.mozilla.org/show_bug.cgi?id=710438#c3 which leads to a valgrind bug: https://bugs.kde.org/show_bug.cgi?id=285662 which had been marked as resolved in their svn trunk. I've built valgrind from svn and verified that their r12423 genuinely fixes it.

While on the subject of Advice from Klortho #11907, I investigated further the failure of HP-UX to build blead, failing on File::Glob with very similar symptoms to a mysterious Cygwin failure - lib/File/Glob.pm is reported as containing syntax errors, because a postfix when is not recognised as being enabled syntax - the use feature 'switch'; is being ignored.

However, in this case it turns out that the bug isn't the same as Cygwin (lack of a suitable cast in our code), but instead similar to an earlier bug on AIX. In that case, a compiler bug in handling the C99 bool type with paired ! operators caused the control V byte in the name $^V not to be considered a control character. In this case, a compiler bug with bool and the && operator** caused features never to be enabled. Again, worked round with a trivial change in a header file in how we express a construction. It's a bit frustrating, but it does seem that 12 years isn't really enough time for the compiler writers to get all the bugs out of these new fangled bool thingymabobs. Stay tuned next month for further fun with HP's compiler.

I investigated what gcc's -ftrapv flag reveals about the prevenance of signed integer overflow in the core C code. Unsigned integer overflow in C is well defined - it wraps. Signed integer overflow, however, is undefined behaviour (not just implementation defined or unspecified) so really isn't something we should be doing. Not having to care about undefined behaviour gives C compilers the loopholes they need to optimise conformant code. Whilst right now compilers usually end up exposing the behaviour of the underlying CPU (which is nearly always 2s complement these days) the core's code is making this assumption, but should not.

The -ftrapv flag for gcc changes all operations that might caused signed overlow to trap undefined behaviour and abort(). This, of course, is still a conformant compiler, because undefined behaviour is, well, undefined. (Be glad that it didn't format your hard disk. That's fair game too, although one hopes that the operating system stops that.)

It turns out that there's quite a lot of bad assumptions in the code about signed integer overflow. These seem to fall into three groups

0) The code used to implement the arithmetic operators under use integer;

1) The assumption that the signed bit pattern for the most negative signed value, IV_MIN, is identical to the bit pattern for the unsigned negation of that value. ie on a 32 bit system, the IV -2147483648 and the UV 2147483648 have the same bit representation

2) A lot of code in the regular expression engine uses the type I32

The regular expression engine code is complex. It only causes aborts under -ftrapv on a 32 bit system, probably due to type promotion on 64 bit systems for the problematic expressions, such that the values don't overflow. However, it's clear that in places the engine is relying on sentinel values such as -1, and it's not clear whether these are tightly coupled with pointer arithmetic, so a brute force approach of trying to "fix" the problem by converting I32 to SSize_t is likely to cause far more bugs than it fixes.

Sadly the use integer code sadly is doing exactly what it's documented to do - "native integer arithmetic (as provided by your C compiler) is used" And as your C compiler's integer arithmetic is undefined on signed overflow, so will your Perl code be. So, we have to accept this is going to raise the ire of -ftrapv. However, this causes problems when testing, as code to test it will abort, and it's not possible to make exceptions from -ftrapv on a function by function basis. So the best solution to this seems to be to break out the integer ops into their own file, and set the C compiler flags specially for that file (which we already have the infrastructure for).

The middle set of code turns out to be relatively easy to fix. In part it can be done by changing conversions between IV_MIN and its negation from obvious-but-overflowing terse expressions to slightly longer expressions that sidestep the overflow. Most of the rest can be fixed by bounds checking - negating values in range, and explicitly using IV_MIN when negating the value 1-(IV-MIN+1). One surprise was the code used to avoid infinite looping when the end point of a .. range was IV_MAX, which needs a special case, but the special case in turn was written assuming signed integer overflow. That code is fixed in smoke-me/iter-IV_MAX, the rest in smoke-me/ftrapv, and some related cleanup of pp_pow in smoke-me/pp_pow. All will be merged to blead after 5.16.0 is released.

To continue this, I investigated using CompCert to compile Perl. CompCert is a "verified compiler, a high-assurance compiler for almost all of the ISO C90 / ANSI C language, generating efficient code for the PowerPC, ARM and x86 processors." -- http://compcert.inria.fr/

However, its description of "almost all" is accurate - it doesn't include variable argument lists, which is kind of a show stopper for us. However, as an experiment I tried out running Configure with it to see what happened. Result! We generate an invalid config.h file. So that's now RT #112494, with a fix ready to go in post 5.16.0

I also spent some time investigating precisely how and when we lost support for using sfio (in place of stdio). It turns out that the build of miniperl is restored by 7 one-line fixes (now preserved for posterity in the branch nicholas/misc-tidyup as they're not relevant to getting 5.16.0 released). More usefully, it revealed a section of Storable.xs which is no longer needed, so that's 14 lines of code to kill. However, this doesn't mean that sfio is nearly working and about to be useful again. With this fixed, the build fails thanks to a strange issue which truncates some Makefiles' output by ExtUtils::MakeMaker at 8192 bytes. I dug further, because I wanted to be sure it wasn't the only known symptom of some mysterious core buffering bug. Based on some historical anomalous CPAN smoker results, I'm suspicious that we may well have one. It turns out, fortunately, that in this case it's not our bug. The cause was far more interesting - something I'd never even considered. In C, a pointer to the element immediately after an array is valid for pointer arithmetic. (But, obviously, not the value it points to. So don't dereference it.) The usual use of this is to store a buffer end pointer - increment the active pointer, and if it's equal to the buffer end, do something.

However, what's not obvious from that is that the memory address immediately after the array might also be the start of some other array that the program holds a pointer to. The bug was that sfio's "do something" (ie empty the write buffer) was deferred until the next call that wrote to the stream. However, ahead of that check was a check related to sfreserve(), which permits user code to write directly into the buffer. This is implemented by handing out a pointer to the internal "next write" pointer, and the user code signals that it is done by calling sfwrite() with this (internal) pointer. The special case was intended to express "is the passed in write-from pointer identical to the internal next write pointer?" However, that check breaks if it just happens that the internal buffer is full ("next write" points one beyond the buffer), and the user's buffer just happens to be in the very next byte of memory. Which turns out to be possible on FreeBSD, where malloc() is able to allocate contiguous blocks of memory. So, fortunately, this isn't our bug. I've reported it to Phong Vo, who confirms that it's a bug in sfio.

Even with this fixed locally, a perl built with sfio fails even more tests than one built without PerlIO enabled. Unless someone external with an ongoing interest in sfio works to fix these issues, the days of the current sfio code are numbered. Dead non-working code just gets in the way.

A more detailed breakdown summarised from the weekly reports. In these:

16 hex digits refer to commits in http://perl5.git.perl.org/perl.git
RT #... is a bug in https://rt.perl.org/rt3/
CPAN #... is a bug in https://rt.cpan.org/Public/
BBC is "bleadperl breaks CPAN" - Andreas König's test reports for CPAN modules
ID YYYYMMDD.### is an bug number in the old bug system. The RT # is given
afterwards.

HoursActivity
0.25'd' flags (on valid_utf8_to_uv{chr,uni})
0.75AIX bisect
5.25AIX make bug
AIX make bug (ccache)
AIX make bug (https://bugzilla.samba.org/show_bug.cgi?id=8906)
AIX make bug (xlc -DDEBUGGING segv)
0.50DL shared library (RT #40652)
6.50HP-UX build failure (cc bug)
0.25HP-UX cc bug
7.25ID 20000721.002 (#3561)
0.25ID 20010801.037 (#7425)
1.00ID 20020513.013 (#9319)
2.50LDLIBPTH
0.25MSDOS died with 5.000, DJGPP only appeared in 5.004
2.00Makefile improvements
Makefile improvements (VMS)
7.00Makefile-norecurse
11.00Makefile.SH miniperl rule unification
0.25Pod::PerlDoc
3.50RT #108286
0.50RT #112312
4.00RT #112350
1.25RT #112370
0.25RT #112404
0.25RT #112478
1.25RT #112494
2.00RT #112504
0.25RT #112536
2.25RT #24250
0.50RT #33159
0.25RT #36309
0.50RT #40652
1.50Solaris -xO3 failure
0.75Solaris bisect.
Solaris bisect. (-xO3)
0.50Testing on OpenBSD
12.00bisect.pl
bisect.pl (HP-UX and AIX)
bisect.pl (HP-UX build failure (cc bug))
bisect.pl (user fixups)
bisect.pl (valgrind)
8.50build process [autodoc, perlmodlib, $(Icwd)]
0.25cross compilation
3.50embed_lib.pl
5.50installman
6.50minitest cleaner
1.50pp_pow
0.50process, scalability, mentoring
24.50reading/responding to list mail
4.00review davem/re_eval
7.50sfio
2.50split %SIG panic
1.50the todo list
9.50undefined behaviour caught by gcc -ftrapv
3.75valgrind error on OS X (RT #112198)

156.00 hours total

* http://en.wikipedia.org/wiki/SEP_field ** to be more accurate, an expression involving a variable, ||, && and a function returning bool.

2 Comments

Hi Nicholas

Congratulations for all this effort. As I think I've said before, I don't do this type of work myself, but I do deeply appreciate that someone is doing it. And the benefits flow far and wide.

Cheers
Ron

Thanks, that was a really interesting read (and thank you also for actually doing the work).

Leave a comment

About TPF

The Perl Foundation - supporting the Perl community since 2000. Find out more at www.perlfoundation.org.

Recent Comments

  • Smylers: Thanks, that was a really interesting read (and thank you read more
  • Ron Savage: Hi Nicholas Congratulations for all this effort. As I think read more

About this Entry

This page contains a single entry by Karen published on June 3, 2012 9:17 PM.

Fixing Perl5 Core Bugs: Report for Month 26 was the previous entry in this blog.

Grant Accepted: Improving Cross compilation of Perl 5 is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

OpenID accepted here Learn more about OpenID
Powered by Movable Type 4.38