Improving Perl 5: Grant Report for Month 16

1 Comment

Nicholas Clark writes:

As per my grant conditions, here is a report for the January period.

After catching up with some of the e-mail backlog, I tried having another prod at the awkward clang/ASAN problem described last month.

It turned out that I missed something. Key to replicating the problem was that one must turn on clang's optimiser. Of course, trying to debug a problem in the C code, I'd gone off down the path of trying debugging builds, ie -g, which by default don't enable the optimiser. (Debugging code with the optimiser enabled tends to screw with your head, because the debugger reports execution jumping backwards and forwards.) I hadn't tried compiling with -O on a local machine, and that does replicate the problem. Better still, on that machine, -O -g together still replicates the problem. So, now I can stick gdb on it:

    ok 1 - $^H{foo} doesn't exist initially
    ok 2 - $^H doesn't contain HINT_LOCALIZE_HH initially
    ok 3 - $^H{foo} is now 'a'
    ok 4 - $^H contains HINT_LOCALIZE_HH while compiling
    ok 5 - $^H{foo} is now 'b'
    ok 6 - $^H{foo} restored to 'a'
    ok 7 - $^H{foo} doesn't exist while finishing compilation
    ok 8 - $^H doesn't contain HINT_LOCALIZE_HH while finishing compilation
    Breakpoint 1, 0x000000000042e440 in __asan_report_error ()
    (gdb) up
    #1  0x000000000042f7a7 in __asan_report_load8 ()
    #2  0x00000000004fa030 in Perl_re_op_compile (patternp=<optimized out>,
        pat_count=-1292, expr=0xffffffffb6e, eng=<optimized out>, old_re=0x0,
        is_bare_re=0x7fffffffdba4, pm_flags=<optimized out>,
        orig_rx_flags=<optimized out>) at regcomp.c:5634
    5634        if (   old_re
    (gdb) p old_re
    $1 = (REGEXP * volatile) 0x0
    (gdb) p &old_re
    Address requested for identifier "old_re" which is in register $r8

Odd. How come a volatile variable thinks that it's in a register? And it's an LVALUE, so why can't I take its address? What does the compiler think - let's try some "printf" debugging:

    diff --git a/regcomp.c b/regcomp.c
    index a8b27dc..342e772 100644
    --- a/regcomp.c
    +++ b/regcomp.c
    @@ -5631,6 +5631,7 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_cou
         /* return old regex if pattern hasn't changed */
    +    PerlIO_printf(PerlIO_stderr(), "%p %p\n", old_re, &old_re);
         if (   old_re
             && !recompile
            && !!RX_UTF8(old_re) == !!RExC_utf8

and we stop at the same place:

    ok 1 - $^H{foo} doesn't exist initially
    ok 2 - $^H doesn't contain HINT_LOCALIZE_HH initially
    ok 3 - $^H{foo} is now 'a'
    ok 4 - $^H contains HINT_LOCALIZE_HH while compiling
    ok 5 - $^H{foo} is now 'b'
    ok 6 - $^H{foo} restored to 'a'
    ok 7 - $^H{foo} doesn't exist while finishing compilation
    ok 8 - $^H doesn't contain HINT_LOCALIZE_HH while finishing compilation
    0 7fffffffd760
    0 7fffffffd760
    Breakpoint 1, 0x000000000042e440 in __asan_report_error ()
    (gdb) up
    #1  0x000000000042f7a7 in __asan_report_load8 ()
    (gdb) up
    #2  0x00000000004fa060 in Perl_re_op_compile (patternp=<optimized out>,
        pat_count=-1292, expr=0xffffffffb6e, eng=<optimized out>,
        old_re=<optimized out>, is_bare_re=0x7fffffffdba4,
        pm_flags=<optimized out>, orig_rx_flags=<optimized out>) at regcomp.c:5634
    5634        PerlIO_printf(PerlIO_stderr(), "%p %p\n", old_re, &old_re);
    (gdb) p old_re
    $2 = <optimized out>
    (gdb) p &old_re
    Can't take address of "old_re" which isn't an lvalue.

Can't take the address of it? That's despite the printf clearly showing that its address is 0x7fffffffd760.

So something is clearly going badly wrong with the code generation here. It's probably a bug in clang. But (a) it's not clear how to reduce it to a terser test case (b) that's not actually helpful, as I really need to make this code work on the clang we have here and now, as without this, we can't usefully use ASAN any more to find bugs.

But some things started to come together. Whilst this specific failure is probably a bug in clang, the function isn't exactly innocent. A couple of people had already reported that they could get everything to pass if they marked more variables as volatile. But playing whack-a-mole with volatile is papering over the symptoms, not finding the cause. However, digging further into the blame log for the function revealed a couple of previous commits in the past few years adding volatile qualifiers, to fix compiler warnings due to the addition of a all to setjmp(). So what's that all about?

The setjmp() relates to an optimisation added by commit bbd61b5ffb7621c2 in September 2010. The backstory is that the compiled format for regular expressions differs if it needs to store Unicode. (The documentation refers to the human-readable form as the pattern, and the compiled form as the program.) The Unicode representation for the program is larger, and cannot be matched as efficiently, so it's preferable to use the tighter byte-based format where possible. Unfortunately, it's not always possible to know for sure which is the right decision until midway through parsing. If the pattern contains literal Unicode, it's obvious that the program needs to store Unicode. Otherwise, the parser optimistically assumes that the more efficient representation can be used, and starts sizing on this basis. However, if it then encounters something in the pattern which must be stored as Unicode, such as an \x{...} escape sequence representing a character literal, then this means that all previously calculated sizes need to be redone, using values appropriate for the Unicode representation.

The problem is that the point where the parser discovers that it needs to redo everything from the start is at least 4 functions down in the best case, and possibly a lot further if the "problem" construction is within parentheses groups. (The parser calls back to the top level to parse parentheses.) Before the commit mentioned above, the recalculation was implemented by setting a flag of "really, this needs Unicode", and simply carrying on. The top level call spotted the flag, and restarted the parse. Obviously, this is rather wasteful, particularly if the problem is detected very early in the pattern.

So that commit refactored things so that the top level caller used setjmp() to store a checkpoint, and upon detecting the need for a restart use longjmp() to warp straight back to it. This skips doing needless work, which is good. The problem is the warp bit - it's far more "warp" than "jump", as the C compiler is not expecting the perfectly innocent looking setjmp() function to return twice. C doesn't have continuations, so this doesn't happen. But of course, it does, and it is breaking the rules, introducing control flow that the compiler has no idea about, so you have to start (effectively) lying to the compiler left right and centre, by declaring variables volatile, forbidding it to do any (sane, reasonable, normal) optimisation, so that the non-local control flow doesn't come unstuck. And it's easy to miss one that matters, as the compiler doesn't warn.

So the question then became, can I implement the early return in some way other than longjmp()? It sort of looked plausible from an initial look at the functions - they only return NULL for special cases, and everywhere checks the return value and propagates NULL onward. So hook into that.

All, that is, except this call to S_regbranch():

        if (is_define)
            vFAIL("(?(DEFINE)....) does not allow branches");
        lastbr = reganode(pRExC_state, IFTHEN, 0); /* Fake one for optimizer. */
        regbranch(pRExC_state, &flags, 1,depth+1);

So is that a bug I've just found? Or something more subtle that doesn't matter. What's with the return value of S_regbranch(), and is it safe to ignore? The code in question dates from November 1997, and is part of Ilya's "Jumbo regexp patch" (commit c277df42229d99fe). More confusingly, it added code with two calls from S_reg() to S_regbranch(), one of which carefully checks the return value and generates a LONGJMP node if it returns NULL, the other of which is called in void context, and so both ignores any return value, or the possibility that it is NULL.

So, to try to figure this out...

As documented in pod/perlreguts.pod, the call graph for regex parsing involves several levels of functions in regcomp.c, sometimes recursing more than once.

The top level compiling function, S_reg(), calls S_regbranch() to parse each single branch of an alternation. In turn, that calls S_regpiece() to parse a simple pattern followed by quantifier, which calls S_regatom() to parse that simple pattern. S_regatom() can call S_regclass() to handle classes, but can also recurse into S_reg() to handle subpatterns and some other constructions. Some other routines call call S_reg(), sometimes using an alternative pattern that they generate dynamically to represent their input.

These routines all return a pointer to a regnode structure, and take a pointer to an integer that holds flags, which is also used to return information. After quite a bit of head scratching and figuring things out I have untangled the possible return values from these 5 functions (and related functions which call S_reg()). The return value is either a pointer to the last node added to the program, or NULL. The pointer is to something already accessible elsewhere, so it's just a convenience return. Nothing leaks by ignoring it. But what about those NULL returns?

The obvious cause of NULL returns is when S_reg() encounters a pragma-like construction - an embedded pattern match operator, such as (?i). In this case, it consumes it, acts on it, and sets the flags to TRYAGAIN and returns NULL. But the code seemed to be prepared to cope with other NULL returns, without the TRYAGAIN flag. So, can those happen?

Starting from the top:

S_reg() will return NULL and set the flags to TRYAGAIN at the end of pragma- like constructions that it handles. Otherwise, historically it would return NULL if S_regbranch() returned NULL. In turn, S_regbranch() would return NULL if S_regpiece() returned NULL without setting TRYAGAIN. If S_regpiece() returns TRYAGAIN, S_regbranch() loops, and ultimately will not return NULL.

S_regpiece() returns NULL with TRYAGAIN if S_regatom() returns NULL with TRYAGAIN, but (historically) if S_regatom() returns NULL without setting the flags to TRYAGAIN, S_regpiece() would to. Where S_regatom() calls S_reg() it has similar behaviour when passing back return values, although often it is able to loop instead on getting a TRYAGAIN.

Which gets us back to S_reg(), which can only generate NULL in conjunction with TRYAGAIN. NULL without TRYAGAIN could only be returned if a routine it called generated it. All other functions that these call that return regnode structures cannot return NULL. Hence

1) in the loop of functions called, there is no source for a return value of NULL without the TRYAGAIN flag being set

2) a return value of NULL with TRYAGAIN set from an inner function does not propagate out past S_regbranch()

Hence the only return values that most functions can generate are non-NULL, or NULL with TRYAGAIN set, and as S_regbranch() catches these, it cannot return NULL. The longest sequence of functions that can return NULL (with TRYAGAIN set) is S_reg() -> S_regatom() -> S_regpiece() -> S_regbranch(). Rapidly returning right round the loop back to S_reg() is not possible.

Hence code added by commit c277df42229d99fe to handle a NULL return from S_regbranch(), along with some other code is dead.

More usefully, this means that all the twisted maze of code makes some sense - the only "can happen" NULL return value is with the flags TRYAGAIN. Which means that it is going to be possible to use a NULL return value with a different flag to signal "restart with Unicode", and eliminate the longjmp().

However, a couple of things still stood in the way. One was working out that all other possible return paths from calls to S_reg() were covered. Various routines in the parser for tricky constructions (eg case insensitive Unicode folds) are implemented by building up a pattern that represents the construction in question, and then making a call to S_reg() to parse that as if it were written in place of the original. So I needed to check each code path that could get to one of these, to work out whether it could trigger the restart. Most couldn't, but three or four could, and two of them didn't have tests. They do now.

Finally I killed the longjmp(). Firstly by moving it up into the same function as the setjmp() and then by replacing it with a goto. Yes, unfortunately still evil. But definitely the lesser of two evils. Particularly as clang was now happy, and address sanitiser reports no errors. I pushed it to a smoke-me branch to let it stew.

As part of this, I discovered a recent regression in the SV dumping code. Father Chrysostomos fixed problems with the interaction between regex objects and LVALUE magic by changing the location in the the SV wrapper of the pointer to the pattern. However, no-one noticed that the SV dumping code wasn't aware of this change, hence dumping regexs no longer worked. Dumping regexs is rather useful when you're trying to figure out what the regex parser is doing :-(. So I added tests for the dumper, and fixed the bug.

During all this I'd had a couple of further insights into hashing, so spent some more time investigating them. I sent an updated summary to the security list. There's nothing I want to report publicly, other than the fact that I'm still comfortable with the current state of all stable releases and blead.

Finally, of note this month, I killed the Rhapsody port. I've been forgetting to note that I'd been removing code related to one dead OS each month. This month was Rhapsody, an Apple OS that later evolved into Darwin and Mac OS X. It was initially only released to developers, but later became Mac OS X Server, wit releases in 1999 and 2000. It was obsoleted by Mac OS X 10.0, released in March 2001. That's 142 lines gone, no longer getting in the way:

    Configure               |   2 +-
    Cross/Makefile-cross-SH |   2 +-
    MANIFEST                |   1 -
    Makefile.SH             |   2 +-
    hints/       | 138 ---------------------------------------------
    installperl             |   4 +-
    pod/perldelta.pod       |   6 +--
    t/op/stat.t             |   3 +-
    8 files changed, 8 insertions(+), 150 deletions(-)

A more detailed breakdown summarised from the weekly reports. In these:

16 hex digits refer to commits in
RT #... is a bug in
CPAN #... is a bug in
BBC is "bleadperl breaks CPAN" - Andreas König's test reports for CPAN modules

0.25RT #24689
6.00SVt_REGEXP and sv_dump
1.00investigating security tickets
0.50module evictions
6.25process, scalability, mentoring
26.00reading/responding to list mail
regcomp/setjmp (killed longjmp)
0.25timely destruction

132.25 hours total

1 Comment

wow. I didn't understand it in detail, but I adore the way you tracked down the issue

About TPF

The Perl Foundation - supporting the Perl community since 2000. Find out more at

About this Entry

This page contains a single entry by Karen Pauley published on February 19, 2013 8:50 AM.

YAPC::NA::2014 Call for Location was the previous entry in this blog.

Perl 5 Grant Application Accepted is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.


OpenID accepted here Learn more about OpenID
Powered by Movable Type 6.2.2