Improving Perl 5: Grant Report for Month 17
Mon, 22-Apr-2013 by
Karen Pauley
edit post
_Nicholas Clark writes:_
As per my grant conditions, here is a report for the February period.
The first significant thing I worked on in February was a detailed review of Peter Martini's work towards an API for subroutine signatures. In particular, I wondered how much of the existing call checker hooks they could use. In turn, I wondered whether the call checker hooks were robust against some of the torture tests I'd suggested for signatures...
bq. Your signatures code needs some evil test cases. I suspect fun things involving various combination of threads, string evals and closures would stress it.
bq. So starting with simple things like
bq. *) calling a function with signatures from a subthread.
*) creating a function within a string eval
bq. then build up to
bq. *) creating a function within a string eval within subthread
*) creating functions in a child thread, spawning a grandchild thread, child thread exists, grandchild thread runs (returning results to first thread)
bq. the intent being to ensure that data structures are cloned properly, and by using an intermediate child thread that goes away, ensure that no pointers dangle into the freed up memory of their parent.
For both, the implementation will be similar in how it interacts with closures and with ithreads. Closures are implemented by run-time cloning of the compiled subroutine. Spawning an ithread is also a runtime clone.
So I had a good look at the call checker code. There had been bugs related to closures, but this had been discovered and fixed later on, although not all possibilities were tested yet, so I added some tests.
Sadly I concluded that the call checker code only offers half the hooks that signatures need. The call checker runs entirely at compile time, whereas a key part of the proposed signatures API is to add a hook at runtime, which is run instead of the code that creates @_ (and all the overhead associated, not just creation, but all the explicit ops run by the user's code to unpack @_ back into lexicals). So it's not the solution here.
However, the real fun started soon after thanks to Yves persisting in investigating what he felt was a problem with the hashing implementation in the stable releases. A few days after I wrote my previous monthly report, he sent perl5-security-report@perl.org a short list of lowercase ASCII strings which when used as keys for a hash would exhaust memory, without the rehashing mechanism ever kicking in. The rehashing mechanism was meant to protect against attacks - this attack exploits a flaw in its implementation, and was announced as CVE-2013-1667.
I'm still not in a position to say everything, as 5 weeks later not all vendors have shipped updates. However, the fix an its commit message have been public for 3 weeks now, so it's safe to repeat what they reveal.
To be clear - I'll use the term "hash" or "hashes" for the Perl associative array, named with % and implemented as a hash table, and use "hash function" for the many-to-one function that converts a string to a fixed-size numeric value, a value which is needed to implement a hash table.
Hashes are implemented as a power-of-two sized array of linked lists of hash entries. The array holds the linked list heads, and the array entries are often referred to as "buckets". If the hash is working well, the linked lists are short. To keep the lists short requires that the hash is "split" periodically as elements are inserted, doubling the size of the array, reassigning list entries to new positions, and hopefully causing the entries to be spread across more positions, shortening the lists.
The implementation flaw was that a hash split could be triggered if a linked list became "too long". "Too long" was deemed to be 15 - much longer than the 0 to 3 entries of a well behaved hash, and we thought that this would only happen if a program was being purposefully attacked with crafted malicious data. This seemed like a good idea at the time (9.5 years ago), as it meant that an attack would be deflected more quickly (after 15 entries). Previously the split condition was that the number of keys was greater than the array size - ie that the mean chain length was about to exceed 1.0. This is reasonable for a well behaved hash, but would mean that malicious data where all keys in the hash used the same chain would not be detected until the chain had become as long as the size of the array, potentially meaning unhappy O(n) behaviour instead of amortised O(1) for quite a while as the hash grew.
All this went unnoticed for 9.5 years, until Yves tried to test the build by changing hv.h to use a pathologically bad hashing function - return the same constant whatever the string. He (and I) would expect that this would be slow - after all it's invoking an O(n) linked-list walk for every lookup, and loops over keys will go from O(n) to O(n²), but given enough CPU time we would expect it to complete. However, it doesn't. It crashes. It was this unexpected crash that caused him to think that something was wrong, and keep investigating.
What he discovered was that triggering a hsplit due to long chain length allows an attacker to create a carefully chosen set of keys which can cause the hash to use 2 * (2**32) * sizeof(void *) bytes of RAM. This will exhaust the address space of a 32 bit process, and potentially chew 32Gb on a 64 bit system. (However, the copying or even swapping this causes tends to be what actually makes the machine unhappy.) The flaw means that the size of the array becomes massively larger than the number of keys.
Eliminating this check, and only inspecting chain length after a normal hsplit() prevents the attack entirely, and makes such attacks relatively benign. (ie we've returned to the pre-2003 test of total keys > array size)
It turns out that a miss is as good as a mile - we got part of the memory exhaustion problem identified back then in 2003, as this comment shows:
bq. /* Use the old HvKEYS(hv) > HvMAX(hv) condition to limit bucket splits on a rehashed hash, as we're not going to split it again, and if someone is lucky (evil) enough to get all the keys in one list they could exhaust our memory as we repeatedly double the number of buckets on every entry. Linear search feels a less worse thing to do. */
Whilst the problem is real (and the keys he generated easily demonstrated that "that's a nice 32Gb server - shame if anything were to happen to it"), I couldn't see how *it* was the cause of the build failure. Once the rehashing mechanism kicks in, the only split test is (as above) the old (and new) keys > array test, so with what I shall term a "chocolate teapot hash" function*, rehashing should trigger for any hash when its 15th keys is inserted, and after that hash splitting should not be subject to "attack" - the array will only grow when the number of keys exceeds the array size.
So I tried to replicate his build of a maint release with the "chocolate teapot hash". It chugged away merrily for more than a day, then failed because a process ran out of RAM. Which, upon re-running with the debugger attached was revealed to be a hash containing about 20 keys, but wanting to have an array with 2**28 entries. Which isn't supposed to happen.
It turns out that there was a second problem with the "long chain" mechanism for triggering a hash split and potential rehashing. This is also solved by published patch addressing CVE-2013-1667. Unlike the other attack, which requires a reasonable understanding of the internals plus several days CPU time to reproduce a set of attack keys, this one is can easily be reproduced as a local proof-of-concept from the text description alone, by re-using code from one of the regression tests. So as several vendors are still leaving their customers exposed, I'm loathe to describe it at this time, other than to say that requires hash re-use (so is harder to exploit), and it is fixed.
Working round this one problem in my maint build with the "chocolate teapot hash" completed, but the tests failed, with SEGVs. This was because some of MRO code assumes that all hashes are using the shared string table. (When the rehashing mechanism kicks in, the hash stops using the shared string table. This is part of what makes rehashing less efficient.) Once I was confident that the problem wasn't caused by hash bugs themselves, I figured that this wasn't worth dealing with. But I see it as more evidence that the rehashing approach is a bodge, rather than a robust solution.
So I turned back to blead (which has full randomised hashes all the time, no rehashing, and everything able to use the shared string table), and tried building it with the "chocolate teapot hash" function. The build eventually completes. After an even longer time, the tests pass. This is good.
I then reviewed the branch Yves had pushed with his proposed hash changes to protect against hash key discovery. As-was, it added another member to XPVHV, the structure used for every hash. As Perl 5 only has one hashes type, it's used for symbol tables, (most) objects, and, well, hashes, as in associative arrays storing user data, accessed by keys. Increase XPVHV and (nearly) every object gets bigger, which isn't ideal if the increase is only needed for iterating the hash. (Most code doesn't poke into objects in order to iterate over their attributes.)
So I looked to find a way to move the member into the on-demand allocated structure which holds iterator state. This revealed some fun - there are actually two functions for "splitting" a hash. A static function, S_hsplit(), which doubles the size of the array, only used by hash key insertion code, and an API function, Perl_hv_ksplit(), which can grow the hash array to "any" size requested (which is rounded up to the next power of two). The former makes simplifying assumptions because it knows that it is always only doubling, whereas the latter can not, as it needs to be able to multiply the array size by any power of two.
It's not clear why there are two functions. S_hsplit() dates back to 5.000. Perl_hv_ksplit() was added in September 1996. Archives exist for Perl5-Porters from 1996, but the patch from back then was actually a reworking of an earlier patch. Unfortunately *that* patch, and any subsequent discussion, was sent to the list before beginning of recorded time (21st August 1995), so it's a bit of a mystery as to why it duplicated code.
Fortunately, with a bit of massaging, it turned out to be relatively easy to refactor the two functions so that their implementations converged. This was helped by S_hsplit() being a static function, so I had complete control over both its callers, and was able to push some work out to them. (Actually, most usefully, just to one of them. As that work wasn't needed for the other call path. Eliminating work is the only sure form of optimisation.) When I got them close enough, I was able to replace the guts of Perl_hv_ksplit() with a call to S_split(), and the fork was healed.
Once it smoked clean, I merged this change back to blead, as it was independent of the other changes. With the change made, it became a lot easier to fix the original problem of moving the structure member, and Yves subsequently incorporated a variant of my suggested changes into his branch.
Nothing do to with hashing, Brian Fraser wondered something about a function in the tokeniser. The answer wasn't obvious...
Each of the core's C source code files start with a quote from Lord of the Rings. Many are right on the money. toke.c's is:
bq. * 'It all comes from here, the stench and the peril.' --Frodo
This is quite apt, as toke.c is 12127 lines of near-impenetrable magic. Chaim Frenkel's summary is well known:
bq. Perl's grammar can not be reduced to BNF. The work of parsing perl is distributed between yacc, the lexer, smoke and mirrors.
Larry recently offered a more quantitative evaluation:
bq. of the four or five ways a compiler can cheat, Perl 5 uses about eight of them
"http://irclog.perlgeek.de/perl6/2013-01-10#i_6318090":http://irclog.perlgeek.de/perl6/2013-01-10#i_6318090
So the question was concerning the static function S_force_word() in toke.c, which has a fifth argument `allow_initial_tick`, which seemed to be redundant. The function is documented, and the arguments are described as follows:
* Arguments:
* char *start : buffer position (must be within PL_linestr)
* int token : PL_next* will be this type of bare word (e.g., METHOD,WORD)
* int check_keyword : if true, Perl checks to make sure the word isn't
* a keyword (do this if the word is a label, e.g. goto FOO)
* int allow_pack : if true, : characters will also be allowed (require,
* use, etc. do this)
* int allow_initial_tick : used by the "sub" lexer only.
The function goes back a long way, with various changes to it and its callers as the result of bug fixes, but these days most of those call sites that had passed TRUE as the fifth argument had been further refactored to avoid calling it. So it seemed that the fifth argument it wasn't needed - ie it was safe to assume that it was always FALSE. Note, there's no way whatsoever from any of the documentation to work out whether this is the case. Although all the individual authors of this code are believed still to be alive and responsive to e-mail, previous attempts at asking simpler questions about code written years or decades ago have always resulted in polite replies to the effect of "I no longer remember". Which really isn't surprising.
Hence, trying to get any better understanding of how pretty much any part of the parser works requires carrying out an investigation like the one I'm about to describe.
So, for starters, there's the obvious first step - what happens if I change the code and do a full clean build and run the tests? Nothing fails. That hints that it might be redundant, but you never can be sure...
Digging around the previous historical versions where S_force_word() had been changed didn't real anything. Even the changes where that parameter has been renamed or code relating to it altered only confirmed that there had been bugs, but the changes fixed those bugs.
The approach that paid off was observing that until 2012 there were two other call sites passing TRUE to the function. So I built the version of blead just before they were refactored, and tried using FALSE instead. With that change, this code stops parsing:
bc. $ ./miniperl -e "sub 'foo {warn qq{ok}}; &'foo"
That suggests that the argument in question is something to do with disambiguating whether a ' is the start of a single quoted string, or a leading package separator (ie the Perl 4 way of saying ::foo)
With that knowledge, and a bit of trial and error, I was able to figure out that the code in question is needed to parse this correctly:
bc. $ ./perl -e 'sub one {};' -e "format 'one =" -e 'One!' \
-e. -e '$~ = "one"; write'
One!
If you change the TRUE to FALSE, this happens:
bc. $ ./perl -e 'sub one {};' -e "format 'one =" -e 'One!' \
-e. -e '$~ = "one"; write'
Undefined format "one" called at -e line 5.
(the parsing of 'format ::one =' is unaffected)
So, finally
* a) we know what the code was for
* b) we know how to write a test case
* c) we can refactor the code to eliminate that argument (Brian Fraser had spotted that the tokeniser has already done the necessary work about a dozen lines earlier, with the result stored in a different variable)
But that was about 12 hours work to figure out 1 argument to 1 function in toke.c. There are 97 functions in toke.c, most have multiple arguments, and the mean function length is double that of S_force_word(). The interactions are staggeringly complex. Roughly 0.1% down, 99.9% to go.
The final significant thing I worked on this month was Unicode code point lookup by name. Perl 5 can convert from name to code point using the "\N{}" escape, which is implemented by automatically loading the charnames pragma. Using /usr/bin/time it's easy to see that loading charnames increases memory use considerably. Compare:
bc. $ /usr/bin/time -v ./perl -Ilib -le 'print "Hello world\n"'
Hello world
bc. Maximum resident set size (kbytes): 6976
bc. $ /usr/bin/time -v ./perl -Ilib -le 'print "Hello\N{SPACE}world\n"'
Hello world
bc. Maximum resident set size (kbytes): 30672
Trying to get some idea of where that extra 23 meg came from:
bc. $ /usr/bin/time -v ./perl -Ilib -le ''
Maximum resident set size (kbytes): 6432
bc. $ /usr/bin/time -v ./perl -Ilib -le 'use strict; use warnings'
Maximum resident set size (kbytes): 10016
bc. $ /usr/bin/time -v ./perl -Ilib -le 'use charnames ":full"'
Maximum resident set size (kbytes): 24144
bc. $ /usr/bin/time -v ./perl -Ilib -le 'use charnames ":full"; print charnames::vianame("SPACE")'
32
bc. Maximum resident set size (kbytes): 30736
Just loading strict and warnings allocates another 4M. (Note, "Hello world" was .5M, so nothing is free.) Loading charnames allocates another 14M, and using it allocates a further 6M, presumably as various caches start to get filled.
But all these requests are for things that are already known, just not in a very convenient format for a fast lookup. The main Unicode Data file is 24430 lines, and doesn't include 4 large ranges of CJK unified ideographs, a large range of algorithmically named Hangul syllables, and about 500 aliases. Moreover, any Perl hash you build of this (or the subset that you are interested in), is allocated at runtime, from memory that isn't even going to be shared between threads, let alone between processes.
Karl and I have wondered whether it would be possible to encode the names as a trie structure, with lookup code written in C. The lookup data would be unmodified at runtime, so could be compiled by the C compiler into the constant data section of the executable (or shared library), which will be shared at least between threads, and on *nix (at least) between all processes. So I've made a start at looking at this. By the end of the week my code had reached the point where I have Perl code to parse all the files, and generate suitable data structures, along with Perl code written in a C style which can correctly look up any code point by name, except the CJK ideographs and Hangul names. Which is pleasing.
A more detailed breakdown summarised from the weekly reports. In these:
16 hex digits refer to commits in "http://perl5.git.perl.org/perl.gi":http://perl5.git.perl.org/perl.git
RT #... is a bug in "https://rt.perl.org/rt3/":https://rt.perl.org/rt3/
CPAN #... is a bug in "https://rt.cpan.org/Public/":https://rt.cpan.org/Public/
BBC is "bleadperl breaks CPAN" - Andreas König's test reports for CPAN modules
| Hours | Activity |
| 11.75 | 'foo special case |
| 0.50 | Android |
| 0.50 | Devel::PPPort/my $_ |
| | MAD |
| 1.75 | MVS |
| 24.00 | PERL_HASH |
| | PERL_HASH "crap" hash |
| | PERL_HASH "crap" hash (there is an assumption that MRO stays shared) |
| | PERL_HASH hashes |
| | PERL_HASH hashes -- Use the old HvKEYS(hv) > HvMAX(hv) condition |
| | PERL_HASH hashes. Aha. rehash/off/on/off/on/off |
| 0.50 | PL_interp_size_5_18_0 |
| 7.25 | PL_sv_objcount |
| 1.75 | RT #109828 |
| 0.25 | RT #113620 |
| 0.25 | RT #116362 |
| 0.25 | RT #116615 |
| 2.50 | RT #116639 |
| 0.25 | RT #116789 |
| 0.25 | RT #116795 |
| 0.75 | RT #116833 |
| 0.25 | RT #116839 |
| 0.75 | RT #116899 |
| 0.75 | RT #116943 |
| 0.50 | RT #116989 |
| 0.25 | RT #59802 |
| 0.50 | SvOK |
| 9.00 | Unicode Names |
| 0.75 | WinCE |
| 0.25 | abolish STRANGE_MALLOC |
| 1.25 | android |
| 0.50 | benchmarking regressions |
| 5.75 | hv_ksplit/hsplit merge |
| 2.00 | inversion lists |
| 0.25 | investigating security tickets |
| 0.25 | lexical $_ |
| 3.25 | pahole on the interpreter struct |
| 0.25 | parallel tests |
| 5.50 | process, scalability, mentoring |
| 39.50 | reading/responding to list mail |
| 3.50 | regcomp/setjmp |
| 18.00 | signatures |
| | signatures (call checker) |
| 2.50 | struct re_save_state |
| 0.25 | warnings.pm |
| 4.75 | yves/hv_h_split |
**153.00 hours total**
==
* For public reporting purposes it's a "chocolate teapot hash". I used a terser term myself in private.
==
Category:
(none)
Comments (0)