April 2013 Archives

Nicholas Clark writes:

gcc 4.8.0 was released on 22nd March. This version of gcc has integrated Address Sanitizer, a Google project to provide a fast runtime heap and stack validation tool. I set it off building gcc from source, which pleasingly worked first time on the system I chose for the task. In turn blead built with it without problems, which is good in itself, and a known good starting point for building with Address Sanitizer enabled.

I didn't expect to find many problems with gcc's ASAN, as I believe we're currently clean when running under valgrind, and when building with the ASAN in clang 3.1. Fortunately I was right - I didn't find many. Unfortunately it wasn't zero, as I did find 3 problems, all of which are now fixed in blead.

Firstly, PerlIO_find_layer() was using the wrong length for memEQ(). So if it was looking for "perlio", it would compare 6 bytes, even if the current layer was named "unix". I believe that it's unlikely to be a problem in production (in this example, as "u" != "p", the comparison will fail without needing data from beyond the end of the shorter name). It would only exhibit itself if someone had layers named where one is initial substring of other and the random memory read happened to be identical to the longer layer, and if so would result in the shorter named layer being used where the longer should have been found. It is now fixed in blead and will be in v5.18.0.

Secondly, a test for heredoc terminators was sometimes reading beyond the end in a memNE(). The failing test case was for an unterminated heredoc used inside code a string eval, something which used actually to SEGV until fixed by Father Chrysostomos post v5.16.0. I think that the bug isn't triggered without something similarly obscure, but it is now fixed.

Thirdly, the memory referenced by the CV for &DynaLoader::boot_DynaLoader to record information about the C code used to generate it was pointing to the C stack, and so anything reading from it would pick up garbage. The tests for B::Deparse happened to read this name while searching for a different CV. The worst case would be that code using B would have a Perl string containing some of the contents of the C stack, when they were expecting something sensible. This is only going to cause bugs if that code was very interested in &DynaLoader::boot_DynaLoader specifically, and it's fixed now.

Something which might seem to be simple, but is actually an inherently complex problem, is the build system. The distribution needs to bootstrap itself up from source files to a complete working tested perl, without assuming that there is already a perl installed on the system (else you have a chicken and egg problem), but also without getting confused if there is a perl installed on the system (as it is probably a different version, or has incompatible build options). Moreover, most of the build system is written in Perl (being the same code as the CPAN tool chain), so that has to run against an uninstalled perl, and without touching files outside the build directory (so it can't use ~/.cpan) or talking to the Internet. Finally, there's no single tool you can rely on existing with which to start the bootstrap (DCL on Win32? Shell scripts on a vanilla VMS install? Batch files on *nix?), so there's more than one of it to get your head around.

YAPC::NA Call for Sponsors

The North American Yet Another Perl Conference is still seeking a few more sponsors to help make YAPC a success this year. Benefits of being a sponsor include increasing your brand awareness, recruitment, and giving back to a language that give you so much. There are many different levels of sponsorship with various perks available. So, please become a sponsor today.

TPF is looking for contributions to the Outreach Program for Women and I am pleased to announce that Renée Bäcker has started us off with a pledge of 1000 Euros. Together with existing funding this gets us one intern so far in the first round, and a opening for the second. Now we need additional generous donations to keep this rolling. If you would like to donate to this program please contact karen [at] perlfoundation.org

Applications for the program are due by May 1st, which will be here soon. The program requires that applicants have contributed to the project they will work on prior to their application. If you are are interested in an internship, and don't already have a Perl project you are working on, take a look at the list of projects and pick one. Contact the mentor listed, and get started now.

Dave Mitchell writes:

This month I mainly worked on one of the 5.18 blocker tickets; in this case how

overload::constant qr => sub { ...}

interacts with "constant" regexes such as qr/foo/ and qr/foo(?{bar})/ if the sub replaces constant strings like "foo" with an overloaded object. It turns out this was something I hadn't anticipated in my re_eval reworking, and my code didn't handle it at all well. I'm now about 3/4 the way to fixing it.

Over the last month I have averaged 6.5 hours per week

As of 2013/03/31: since the beginning of the grant:

160.0 weeks
1618.8 total hours
10.1 average hours per week

There are 82 hours left on the grant.

Report for period 2013/03/01 to 2013/03/31 inclusive


Effort (HH::MM):

15:00 diagnosing bugs
11:17 fixing bugs
1:00 reviewing other people's bug fixes
0:00 reviewing ticket histories
0:00 review the ticket queue (triage)
27:17 Total

Numbers of tickets closed:

5 tickets closed that have been worked on
0 tickets closed related to bugs that have been fixed
0 tickets closed that were reviewed but not worked on (triage)
5 Total

Short Detail

0:25 [perl #115004] perl 5.17.x can't use @var in regexp, but only $var
1:10 [perl #116569] Re: 5.17.7 breaks rules of assignment
0:30 [perl #116821] B::Deparse warnings during build of 5.17.9
18:17 [perl #116823] Regexp::Grammars broken since 5.17.1
0:30 [perl #117095] Undefined `state` $file_handle when using `$file_handle if 0`
0:10 [perl #117107] [Bug] IO::Socket::INET opens random port
2:55 [perl #117135] v5.17.9-80-g9f351b4 breaks SARTAK/Path-Dispatcher-1.04.tar.gz
1:40 [perl #117205] Infinite regex-engine loop with (??{})
0:30 [perl #117265] [PATCH] e213661 no warnings 'safesyscalls', fatal nul checks
0:30 [perl #117273] [PATCH] e4f39c0 binary safety when dumping svs and ops
0:40 [perl #117305] ISO C forbids braced-groups within expressions (cv.h, inline.h in 5.17.9)

Nicholas Clark writes:

As per my grant conditions, here is a report for the February period.

The first significant thing I worked on in February was a detailed review of Peter Martini's work towards an API for subroutine signatures. In particular, I wondered how much of the existing call checker hooks they could use. In turn, I wondered whether the call checker hooks were robust against some of the torture tests I'd suggested for signatures...

Your signatures code needs some evil test cases. I suspect fun things involving various combination of threads, string evals and closures would stress it.

So starting with simple things like

*) calling a function with signatures from a subthread.
*) creating a function within a string eval

then build up to

*) creating a function within a string eval within subthread
*) creating functions in a child thread, spawning a grandchild thread, child thread exists, grandchild thread runs (returning results to first thread)

the intent being to ensure that data structures are cloned properly, and by using an intermediate child thread that goes away, ensure that no pointers dangle into the freed up memory of their parent.

For both, the implementation will be similar in how it interacts with closures and with ithreads. Closures are implemented by run-time cloning of the compiled subroutine. Spawning an ithread is also a runtime clone.

So I had a good look at the call checker code. There had been bugs related to closures, but this had been discovered and fixed later on, although not all possibilities were tested yet, so I added some tests.

Sadly I concluded that the call checker code only offers half the hooks that signatures need. The call checker runs entirely at compile time, whereas a key part of the proposed signatures API is to add a hook at runtime, which is run instead of the code that creates _ (and all the overhead associated, not just creation, but all the explicit ops run by the user's code to unpack _ back into lexicals). So it's not the solution here.

However, the real fun started soon after thanks to Yves persisting in investigating what he felt was a problem with the hashing implementation in the stable releases. A few days after I wrote my previous monthly report, he sent [email protected] a short list of lowercase ASCII strings which when used as keys for a hash would exhaust memory, without the rehashing mechanism ever kicking in. The rehashing mechanism was meant to protect against attacks - this attack exploits a flaw in its implementation, and was announced as CVE-2013-1667.

I'm still not in a position to say everything, as 5 weeks later not all vendors have shipped updates. However, the fix an its commit message have been public for 3 weeks now, so it's safe to repeat what they reveal.

To be clear - I'll use the term "hash" or "hashes" for the Perl associative array, named with % and implemented as a hash table, and use "hash function" for the many-to-one function that converts a string to a fixed-size numeric value, a value which is needed to implement a hash table.

Hashes are implemented as a power-of-two sized array of linked lists of hash entries. The array holds the linked list heads, and the array entries are often referred to as "buckets". If the hash is working well, the linked lists are short. To keep the lists short requires that the hash is "split" periodically as elements are inserted, doubling the size of the array, reassigning list entries to new positions, and hopefully causing the entries to be spread across more positions, shortening the lists.

The implementation flaw was that a hash split could be triggered if a linked list became "too long". "Too long" was deemed to be 15 - much longer than the 0 to 3 entries of a well behaved hash, and we thought that this would only happen if a program was being purposefully attacked with crafted malicious data. This seemed like a good idea at the time (9.5 years ago), as it meant that an attack would be deflected more quickly (after 15 entries). Previously the split condition was that the number of keys was greater than the array size - ie that the mean chain length was about to exceed 1.0. This is reasonable for a well behaved hash, but would mean that malicious data where all keys in the hash used the same chain would not be detected until the chain had become as long as the size of the array, potentially meaning unhappy O(n) behaviour instead of amortised O(1) for quite a while as the hash grew.

All this went unnoticed for 9.5 years, until Yves tried to test the build by changing hv.h to use a pathologically bad hashing function - return the same constant whatever the string. He (and I) would expect that this would be slow - after all it's invoking an O(n) linked-list walk for every lookup, and loops over keys will go from O(n) to O(n²), but given enough CPU time we would expect it to complete. However, it doesn't. It crashes. It was this unexpected crash that caused him to think that something was wrong, and keep investigating.

What he discovered was that triggering a hsplit due to long chain length allows an attacker to create a carefully chosen set of keys which can cause the hash to use 2 * (2**32) * sizeof(void *) bytes of RAM. This will exhaust the address space of a 32 bit process, and potentially chew 32Gb on a 64 bit system. (However, the copying or even swapping this causes tends to be what actually makes the machine unhappy.) The flaw means that the size of the array becomes massively larger than the number of keys.

Eliminating this check, and only inspecting chain length after a normal hsplit() prevents the attack entirely, and makes such attacks relatively benign. (ie we've returned to the pre-2003 test of total keys > array size)

It turns out that a miss is as good as a mile - we got part of the memory exhaustion problem identified back then in 2003, as this comment shows:

/* Use the old HvKEYS > HvMAX condition to limit bucket splits on a rehashed hash, as we're not going to split it again, and if someone is lucky (evil) enough to get all the keys in one list they could exhaust our memory as we repeatedly double the number of buckets on every entry. Linear search feels a less worse thing to do. */

Whilst the problem is real (and the keys he generated easily demonstrated that "that's a nice 32Gb server - shame if anything were to happen to it"), I couldn't see how it was the cause of the build failure. Once the rehashing mechanism kicks in, the only split test is (as above) the old (and new) keys > array test, so with what I shall term a "chocolate teapot hash" function*, rehashing should trigger for any hash when its 15th keys is inserted, and after that hash splitting should not be subject to "attack" - the array will only grow when the number of keys exceeds the array size.

So I tried to replicate his build of a maint release with the "chocolate teapot hash". It chugged away merrily for more than a day, then failed because a process ran out of RAM. Which, upon re-running with the debugger attached was revealed to be a hash containing about 20 keys, but wanting to have an array with 2**28 entries. Which isn't supposed to happen.

It turns out that there was a second problem with the "long chain" mechanism for triggering a hash split and potential rehashing. This is also solved by published patch addressing CVE-2013-1667. Unlike the other attack, which requires a reasonable understanding of the internals plus several days CPU time to reproduce a set of attack keys, this one is can easily be reproduced as a local proof-of-concept from the text description alone, by re-using code from one of the regression tests. So as several vendors are still leaving their customers exposed, I'm loathe to describe it at this time, other than to say that requires hash re-use (so is harder to exploit), and it is fixed.

Working round this one problem in my maint build with the "chocolate teapot hash" completed, but the tests failed, with SEGVs. This was because some of MRO code assumes that all hashes are using the shared string table. (When the rehashing mechanism kicks in, the hash stops using the shared string table. This is part of what makes rehashing less efficient.) Once I was confident that the problem wasn't caused by hash bugs themselves, I figured that this wasn't worth dealing with. But I see it as more evidence that the rehashing approach is a bodge, rather than a robust solution.

So I turned back to blead (which has full randomised hashes all the time, no rehashing, and everything able to use the shared string table), and tried building it with the "chocolate teapot hash" function. The build eventually completes. After an even longer time, the tests pass. This is good.

I then reviewed the branch Yves had pushed with his proposed hash changes to protect against hash key discovery. As-was, it added another member to XPVHV, the structure used for every hash. As Perl 5 only has one hashes type, it's used for symbol tables, (most) objects, and, well, hashes, as in associative arrays storing user data, accessed by keys. Increase XPVHV and (nearly) every object gets bigger, which isn't ideal if the increase is only needed for iterating the hash. (Most code doesn't poke into objects in order to iterate over their attributes.)

So I looked to find a way to move the member into the on-demand allocated structure which holds iterator state. This revealed some fun - there are actually two functions for "splitting" a hash. A static function, S_hsplit(), which doubles the size of the array, only used by hash key insertion code, and an API function, Perl_hv_ksplit(), which can grow the hash array to "any" size requested (which is rounded up to the next power of two). The former makes simplifying assumptions because it knows that it is always only doubling, whereas the latter can not, as it needs to be able to multiply the array size by any power of two.

It's not clear why there are two functions. S_hsplit() dates back to 5.000. Perl_hv_ksplit() was added in September 1996. Archives exist for Perl5-Porters from 1996, but the patch from back then was actually a reworking of an earlier patch. Unfortunately that patch, and any subsequent discussion, was sent to the list before beginning of recorded time (21st August 1995), so it's a bit of a mystery as to why it duplicated code.

Fortunately, with a bit of massaging, it turned out to be relatively easy to refactor the two functions so that their implementations converged. This was helped by S_hsplit() being a static function, so I had complete control over both its callers, and was able to push some work out to them. (Actually, most usefully, just to one of them. As that work wasn't needed for the other call path. Eliminating work is the only sure form of optimisation.) When I got them close enough, I was able to replace the guts of Perl_hv_ksplit() with a call to S_split(), and the fork was healed.

Once it smoked clean, I merged this change back to blead, as it was independent of the other changes. With the change made, it became a lot easier to fix the original problem of moving the structure member, and Yves subsequently incorporated a variant of my suggested changes into his branch.

Nothing do to with hashing, Brian Fraser wondered something about a function in the tokeniser. The answer wasn't obvious...

Each of the core's C source code files start with a quote from Lord of the Rings. Many are right on the money. toke.c's is:

* 'It all comes from here, the stench and the peril.' --Frodo

This is quite apt, as toke.c is 12127 lines of near-impenetrable magic. Chaim Frenkel's summary is well known:

Perl's grammar can not be reduced to BNF. The work of parsing perl is distributed between yacc, the lexer, smoke and mirrors.

Larry recently offered a more quantitative evaluation:

of the four or five ways a compiler can cheat, Perl 5 uses about eight of them


So the question was concerning the static function S_force_word() in toke.c, which has a fifth argument `allow_initial_tick`, which seemed to be redundant. The function is documented, and the arguments are described as follows:

  • Arguments:
  • char *start : buffer position (must be within PL_linestr)
  • int token : PL_next* will be this type of bare word (e.g., METHOD,WORD)
  • int check_keyword : if true, Perl checks to make sure the word isn't
  • a keyword (do this if the word is a label, e.g. goto FOO)
  • int allow_pack : if true, : characters will also be allowed (require,
  • use, etc. do this)
  • int allow_initial_tick : used by the "sub" lexer only.

The function goes back a long way, with various changes to it and its callers as the result of bug fixes, but these days most of those call sites that had passed TRUE as the fifth argument had been further refactored to avoid calling it. So it seemed that the fifth argument it wasn't needed - ie it was safe to assume that it was always FALSE. Note, there's no way whatsoever from any of the documentation to work out whether this is the case. Although all the individual authors of this code are believed still to be alive and responsive to e-mail, previous attempts at asking simpler questions about code written years or decades ago have always resulted in polite replies to the effect of "I no longer remember". Which really isn't surprising.

Hence, trying to get any better understanding of how pretty much any part of the parser works requires carrying out an investigation like the one I'm about to describe.

So, for starters, there's the obvious first step - what happens if I change the code and do a full clean build and run the tests? Nothing fails. That hints that it might be redundant, but you never can be sure...

Digging around the previous historical versions where S_force_word() had been changed didn't real anything. Even the changes where that parameter has been renamed or code relating to it altered only confirmed that there had been bugs, but the changes fixed those bugs.

The approach that paid off was observing that until 2012 there were two other call sites passing TRUE to the function. So I built the version of blead just before they were refactored, and tried using FALSE instead. With that change, this code stops parsing:

    $ ./miniperl -e "sub 'foo {warn qq{ok}}; &'foo"

That suggests that the argument in question is something to do with disambiguating whether a ' is the start of a single quoted string, or a leading package separator (ie the Perl 4 way of saying ::foo)

With that knowledge, and a bit of trial and error, I was able to figure out that the code in question is needed to parse this correctly:

    $ ./perl -e 'sub one {};' -e "format 'one =" -e 'One!' \
             -e. -e '$~ = "one"; write'

If you change the TRUE to FALSE, this happens:

    $ ./perl -e 'sub one {};' -e "format 'one =" -e 'One!' \
             -e. -e '$~ = "one"; write'
    Undefined format "one" called at -e line 5.

(the parsing of 'format ::one =' is unaffected)

So, finally

  • a) we know what the code was for
  • b) we know how to write a test case
  • c) we can refactor the code to eliminate that argument (Brian Fraser had spotted that the tokeniser has already done the necessary work about a dozen lines earlier, with the result stored in a different variable)

But that was about 12 hours work to figure out 1 argument to 1 function in toke.c. There are 97 functions in toke.c, most have multiple arguments, and the mean function length is double that of S_force_word(). The interactions are staggeringly complex. Roughly 0.1% down, 99.9% to go.

The final significant thing I worked on this month was Unicode code point lookup by name. Perl 5 can convert from name to code point using the "\N{}" escape, which is implemented by automatically loading the charnames pragma. Using /usr/bin/time it's easy to see that loading charnames increases memory use considerably. Compare:

    $ /usr/bin/time -v ./perl -Ilib -le 'print "Hello world\n"'
    Hello world
            Maximum resident set size (kbytes): 6976
    $ /usr/bin/time -v ./perl -Ilib -le 'print "Hello\N{SPACE}world\n"'
          Hello world
Maximum resident set size (kbytes): 30672

Trying to get some idea of where that extra 23 meg came from:

$ /usr/bin/time -v ./perl -Ilib -le ''
            Maximum resident set size (kbytes): 6432
$ /usr/bin/time -v ./perl -Ilib -le 'use strict; use warnings'
            Maximum resident set size (kbytes): 10016
$ /usr/bin/time -v ./perl -Ilib -le 'use charnames ":full"'
            Maximum resident set size (kbytes): 24144
    $ /usr/bin/time -v ./perl -Ilib -le 'use charnames ":full"; print charnames::vianame("SPACE")'
bc.             Maximum resident set size (kbytes): 30736

Just loading strict and warnings allocates another 4M. (Note, "Hello world" was .5M, so nothing is free.) Loading charnames allocates another 14M, and using it allocates a further 6M, presumably as various caches start to get filled.

But all these requests are for things that are already known, just not in a very convenient format for a fast lookup. The main Unicode Data file is 24430 lines, and doesn't include 4 large ranges of CJK unified ideographs, a large range of algorithmically named Hangul syllables, and about 500 aliases. Moreover, any Perl hash you build of this (or the subset that you are interested in), is allocated at runtime, from memory that isn't even going to be shared between threads, let alone between processes.

Karl and I have wondered whether it would be possible to encode the names as a trie structure, with lookup code written in C. The lookup data would be unmodified at runtime, so could be compiled by the C compiler into the constant data section of the executable (or shared library), which will be shared at least between threads, and on *nix (at least) between all processes. So I've made a start at looking at this. By the end of the week my code had reached the point where I have Perl code to parse all the files, and generate suitable data structures, along with Perl code written in a C style which can correctly look up any code point by name, except the CJK ideographs and Hangul names. Which is pleasing.

A more detailed breakdown summarised from the weekly reports. In these:

16 hex digits refer to commits in http://perl5.git.perl.org/perl.gi
RT #... is a bug in https://rt.perl.org/rt3/
CPAN #... is a bug in https://rt.cpan.org/Public/
BBC is "bleadperl breaks CPAN" - Andreas König's test reports for CPAN modules

11.75'foo special case
0.50Devel::PPPort/my $_
PERL_HASH "crap" hash
PERL_HASH "crap" hash (there is an assumption that MRO stays shared)
PERL_HASH hashes
PERL_HASH hashes -- Use the old HvKEYS > HvMAX condition
PERL_HASH hashes. Aha. rehash/off/on/off/on/off
1.75RT #109828
0.25RT #113620
0.25RT #116362
0.25RT #116615
2.50RT #116639
0.25RT #116789
0.25RT #116795
0.75RT #116833
0.25RT #116839
0.75RT #116899
0.75RT #116943
0.50RT #116989
0.25RT #59802
9.00Unicode Names
0.25abolish STRANGE_MALLOC
0.50benchmarking regressions
5.75hv_ksplit/hsplit merge
2.00inversion lists
0.25investigating security tickets
0.25lexical $_
3.25pahole on the interpreter struct
0.25parallel tests
5.50process, scalability, mentoring
39.50reading/responding to list mail
signatures (call checker)
2.50struct re_save_state

153.00 hours total

* For public reporting purposes it's a "chocolate teapot hash". I used a terser term myself in private.

Perl 5 Grant Completion

Ricardo Signes' Perl QA Hackathon grant has been successfully completed and closed. Details regarding the grant may be found on his blog.

Please consider making a donation of any amount to help support projects such as this. Details regarding the Perl 5 Core Maintenance Fund may be found on The Perl Foundation's web site.

About TPF

The Perl Foundation - supporting the Perl community since 2000. Find out more at www.perlfoundation.org.

About this Archive

This page is an archive of entries from April 2013 listed from newest to oldest.

March 2013 is the previous archive.

May 2013 is the next archive.

Find recent content on the main index or look in the archives to find all content.


OpenID accepted here Learn more about OpenID
Powered by Movable Type 6.2.2