Improving Perl 5: Grant Report for Month 19
Wed, 12-Jun-2013 by
Karen Pauley
edit post
_Nicholas Clark writes:_
As per my grant conditions, here is a report for the April period.
I started the month looking at the Unicode Names code. At Karl's suggestion I changed it to parse the UnicodeData.txt file properly. Previously it had hardcoded various constants, particularly related to the CJK ideographs and Hangul syllables. The CJK ranges in Unicode have increased in the past, and so it's possible that they will increase again. Not only is it (more) future proof, it also made it simpler to detect significant gaps in the allocated character ranges, which is useful for size optimisations. By handing the gaps better I reduced the data size by 13K, and by using two sizes of arrays for the trie structure, saved a further 25K.
The intent of all this is to provide the data needed for the \N{} syntax directly as C code and static data, to avoid the tokeniser needing to load charnames if it sees \N{}. Given that the C code in question is generated by Perl, but to compile the Perl you need the C code, there's a potential bootstrapping problem here. Not wishing to ship >1M of generated code if avoidable, I experimented to see whether the \N{} escape syntax is needed by miniperl. It turns out that if you replace the \N{} handler by abort() in miniperl, you can still build perl perfectly. Excellent! Also, telling the Makefile to build distinct toke.o and tokemini.o is a one line change - it's nice when easy things *are* easy.
Frustratingly the work is not yet ready to merge into blead, as it's not yet finished enough, and other things keep taking priority.
We had a bit of a digression involving perl on s390. Merijn was given a CD for a linux distribution for s390, soon had it running on the emulator "Hercules". What does one do next? Obviously - try to build Perl. So he build blead (which worked) and ran the tests (which mostly worked). My initial reaction was:
bq. Is anyone actually using Perl on it?
bq. In that, we've not had any bug reports about this before, and for a lot of these somewhat esoteric platforms I'm sort of wondering at what point do they stop being "fun", and start being "work". Right now, I think it's more at the "fun" level, and it might be telling us something interesting about portability, as it might be that all these tests failing are down to the same problem.
The problems all seemed to be involve conversion of large floating point values to integers. Which we do quite a lot of, and should work, but historically we've been skirting the boundary of what's conformant ANSI C, sometimes on the wrong side. So Merijn and I exchanged code and results as I tried to remote debug what the cause was. We tried to establish whether it was even handling large unsigned integers correctly (it was). We tried to rule out unions and ? : ternaries (which have confused compilers in the past). Nope. In the end, we ascertained that it was a bug in the supplied gcc 4.3.4 - it generated bad code for casting from unsigned integers to doubles.
At which point Niko Tyni replied that the particular problem was already diagnosed as a compiler bug, and had been fixed. Debian was building on s390 with gcc 4.6.3, and he believed that gcc 4.4.7 was fixed.
So that all ended up being rather a waste of time, thanks to the continued installation and use of an obsolete and buggy compiler. Particularly frustrating given that a fix exists in newer versions of that compiler.
==
A significant development this month was having serious second thoughts about $* and friends. As mentioned in last month's report, everything smoked fine, so the change was merged to blead. Only then did the problems emerge. Specifically Father Chrysostomos demonstrated rather succinctly that the core's tests weren't comprehensive enough. The tests correctly verified that using any of @*, &* ** and %* generated the desired deprecation warning. But the warning was also generated by *{*}, *{"*"} and C<$_ = "*"; *$_>, none of which "need" to be deprecated. Nothing tested these, so nothing noticed that they now warned. This was because I'd adapted the code used to warn for $*, and made it warn for all. However, there's a significant difference. The "magic" functionality of $* (globally setting multiline matching) was what was deprecated and then removed. It was nothing to do with parsing the two characters $* as a punctuation variable, hence the warning needed to be triggered by any *use* of the scalar variable *, independent of what syntax was used to assign to it. For this reason, the best place to inject that warning was the code which creates typeglobs and used to set up the magic which made $* work. As that code is dealing in typeglobs, it already had logic to determine whether the request was for the SCALAR slot or one of the others, so it was simple to extend it to warn for all slots, extending warnings from $* to all variables named *.
==
Simple, obvious and wrong.
The error being that the intent was to deprecate the *syntax* @* etc, not use of the variable itself. Hence right place to insert a deprecation would be in the parser. Specifically, toke.c. 12151 lines of horror most aptly summarised as 'It all comes from here, the stench and the peril.'
==
Strangely for toke.c, it seems that it's actually fairly easy to deprecate the parsing of @* etc. Tokens are parsed by a routine S_scan_indent() which is relatively self-contained, and the control flow around it is also fairly clear. So a warning can be issued by adding another parameter to that routine, and only setting it true from the 4 places in the parser that deal with things starting '@', '&', '*', and '%' respectively. This worked.
==
However, the seconds thoughts went deeper than that. I think that even this
approach is wrong on two further levels.
==
Firstly the intent was to enable syntax of the form @*foo and %*bar. Having @*foo and %*bar would seem to imply that one can't also have @* or %*.
==
What hadn't sunk is is that we have both $# and $#foo (and $#$foo), and there doesn't seem to be a parsing problem with this. There *is* some special casing for which punctuation vars $#... will work on, notably only @+, @- and @@:
bc. if (s[1] == '#' && (isIDFIRST_lazy_if(s+2,UTF) || strchr("{$:+-@", s[2]\
))) {
PL_tokenbuf[0] = '@';
and, unlike most of Perl 5, recognising $#... is space sensitive:
bc. $ perl -le '$foo = [1..3]; $# = \*STDERR; print $#{$foo}'
$# is no longer supported at -e line 1.
2
$ perl -le '$foo = "bar"; %# = (bar => "Baz"); $# = \*STDERR; print $# {$foo}'
Baz
(in the latter, $# {$foo} is a hash lookup for key $foo of hash %#)
but it generally works without surprising anyone.
==
As best I can figure out, one could add @*foo, &*foo, **foo, %*foo $*foo without removing anything or breaking any sane code on CPAN. The only code which I think would change behaviour is that is either using $* as the variable for a file handle passed to print (anything else?), or code which would parse $*+=1 as $*+ = 1 instead of $* += 1, or code which is making array slices on %*.
==
So I think that the right thing to do is not to blanket deprecate parsing @* &* ** %* and $*, but instead change the parser to warn or deprecate on the specific ambiguous constructions. Which means that the "new" "needed" constructions need to come first. Or at least some idea of them.
But, I think I'm wrong again, because the specific intent was to have consistent "slurpy" syntax for subroutine signatures. Consistent with Perl 6, and consistent between the Perl 5 signature and regular Perl 5 code.
And I got this wrong. In that, Perl 6 does have @*foo. But that's a dynamic scoping lookup. The slurpy syntax is *@foo. (And *%foo, and *$foo) "http://perlcabal.org/syn/S06.html#List_parameters":http://perlcabal.org/syn/S06.html#List_parameters
For which we don't need to worry about the various sigils used with the ** typeglob at all. We need to consider how the parser deals with the 3 typeglobs *@, *% and *$. And based on how $# and $#foo are handled, I think that everything that is wanted for "new" syntax is currently a syntax error. Or, if not, all that is currently legal syntax is incredibly obscure corner cases.
So the net result of all of this was better tests, a bit better understanding of another 0.1% of the tokeniser, and a bug fix, in that $* and $# now warn for every location that references them. Previously there were "holes" through which they could be used but avoid the warning. A lot of motion, but not much movement.
The "second thoughts" fun continued with RT #116989, a bug originally related to S_croak_memory_wrap(), but which turned out to be the tip a considerably more pervasive iceberg.
The perl source code uses a lot of macros. Some of these were used to implement inline functions long before C compilers could. Now that the semantics of "static inline" functions are settled and widely supported, we've started to replace the (still-working) macros with inline functions.
However, this threw up an unexpected surprise. You'd think that replacing well-behaved macros (those that only use their parameters, and not more than once) with inline functions would have no functional change. But we found an interesting gotcha...
Unreferenced macros disappear at the end of pre-processing. Unreferenced inline functions stick around for a while longer. Some compilers only remove them if the optimiser is enabled. (Certainly some versions of gcc), whereas others remove them always (for example clang).
The problem was with macros (and now functions) that reference global variables or other functions. If the dead code isn't removed by link time, it causes unsatisfied dependencies, and the build fails. The effect is that the header files have changed such that they can now only be included by code which links to the perl library.
Why is this a problem?
Probing code which wants to figure out if a function is present, or compiler flags work, will compile a small test program using the perl header files, and if it runs assume that the probed-for thing can be used. The assumption is that "not running" is caused by the probed-for thing's absence, and not some problem in the probe. So the first problem is that this stopped being true, and the second problem is that we didn't realise this because none of the probing code is verifying this assumption by testing the null hypothesis - that a test program without anything changed does run. So when the header files change such that including them now makes the test program fail to link, all the probing code silently stops working - it simply thinks that it can't find anything, routinely reports this, with results such as Time::HiRes failing to find platform-specific functions that it uses, and hence disabling optional functionality. And we didn't realise this.
Probing code such as that in Time::HiRes could likely resolve this by linking its probe programs with the perl runtime library. But the code in cflags.SH wants to work out what extra compile flags it can use locally to *build* the perl library - ie which known potentially useful flags can't be added to the build command line locally because they cause compilation failures on this particular system. This is a bootstrapping problem, as we want to know what we can compile it with, but if the probe now only works once we can link with the perl codebase, then we can't answer that until after we've compiled it.
As well as me, Andy Dougherty and Tony Cook had both already worked on this, so we had a couple of proposed patches, along with a new regression test to actually test that null hypothesis of probe code. However, it still didn't quite feel complete. After thinking about the implications for a while, and experimenting with various approaches I can't see a way to have everything. ie
1) static inline functions to replace macros
combined with
2) defaulting to use them
3) not screwing up the "traditional" use without linking
I think that the least worst option is to keep (1) and (2), and force (3) to work by adding a define that disables inclusion of "inline.h". Of course, that means that you can't *use* anything from it. But as all it contains are functions *declarations* that are of no use without data structures initialised from the perl library, it's no loss. Hence we added a macro PERL_NO_INLINE_FUNCTIONS for that purpose - explicitly disabling the declaration of inline functions, for probes and similar code which needs it. Having tested it on Solaris and HP-UX (as well as the more common platforms), I pushed it to a smoke-me branch expecting it to work everywhere. Fortunately I left it over the weekend to let as many machines as possible catch up with it.
==
This revealed one last silly sting in the tail. Everything was happy except for *Darwin*'s flavour of g++ 4.2.1, which objected to using "-ofilename", petulantly demanding a space in between. It's just special. It's particularly special as the the FreeBSD g++ 4.2.1 is happy with or without a space. Same compiler, same reported version, but clearly one vendor patched theirs. Well *at least* one vendor :-(
==
So another round of testing with a space in, and finally everyone was happy and it was merged to blead, in time for v5.18.0.
I also looked at TAP::Parser. No-one pounced on my suggestion in the previous report on where to look for memory savings, so I followed it up myself. My guess was roughly correct - it uses an array for each test to store the numbers of the passing subtests. So for the usual case of all the subtests passing (in the right order, and matching the plan), this is an array [1 .. $plan]. Fortunately TAP::Parser is very nicely factored, with all the code using accessor functions instead of poking into the internals, so it was very easy to alter it to use an alternative storage representation, and hide this change from *everything* else just by changing the accessors functions to "unwrap" the optimisation.
While working with the code I tried out a second approach to reducing memory usage, albeit somewhat "unwarranted chumminess with the [...] implementation" - forcing the test numbers to be stored as numbers. The test lines are parsed by a regular expression, hence the captured results are internally strings. The grammar uses qr/\d+/, so these strings can only be numbers, hence by adding 0, the value stored is an IV, not a PV, which is smaller.
Combined, these dramatically cut memory use (about 90%). The bad news is that this change didn't make it in time for v5.18.0 - it's really not the thing you change during the pre-release code freeze. The good new is that it's already on CPAN, you you can upgrade all your installations, if you so wish. Including your older installed Perl versions.
A more detailed breakdown summarised from the weekly reports. In these:
16 hex digits refer to commits in "http://perl5.git.perl.org/perl.git":http://perl5.git.perl.org/perl.git
RT #... is a bug in "https://rt.perl.org/rt3/":https://rt.perl.org/rt3/
CPAN #... is a bug in "https://rt.cpan.org/Public/":https://rt.cpan.org/Public/
BBC is "bleadperl breaks CPAN" - Andreas König's test reports for CPAN modules
| Hours | Activity |
| 0.50 | CPAN #83167 |
| 0.25 | ExtUtils::CBuilder |
| 1.00 | HvFILL |
| 0.25 | RT #113794 |
| 0.75 | RT #114502 |
| 17.00 | RT #116943/S_scan_indent |
| 16.50 | RT #116989 |
| | RT #116989, toolchain bootstrap, probing. |
| 0.25 | RT #117003 |
| 0.50 | RT #117327 |
| 7.50 | RT #117501 (Open64 compiler) |
| 0.25 | RT #117543 |
| 1.00 | RT #117687 |
| 0.50 | RT #117743 |
| 2.75 | RT #54044 |
| 8.25 | TAP::Parser |
| | TAP::Parser (CPAN #84939) |
| 20.00 | Unicode Names |
| | Unicode Names (No \N{} in miniperl) |
| 0.25 | caller with undef package |
| 1.25 | genpacksizetables.pl |
| 0.25 | mktables |
| 0.50 | pp_pack.c |
| 4.00 | process, scalability, mentoring |
| 26.50 | reading/responding to list mail |
| 1.50 | s390 |
**111.50 hours total**
Comments (0)