Fixing Perl5 Core Bugs: Report for Months 29 & 30

No Comments

Dave Mitchell writes:

As per my grant conditions, here is a report for the July/August period.

I spent a bit of time fixing a few issues causes by my rewriting of the /(?{})/ implementation, then started to look into the last unclosed ticket still attached to the re_eval meta-ticket. This concerns code within (?{}) that modifies the string being matched against, and generally causes assertion failures or coredumps:

my $text = "a"; $text =~ m/(.(?{ $text .= "x" }))*/;

While trying to understand what's going on, I ended up delving into the issue of how and when perl makes a copy of the string buffer in order to make $1, $& etc continue to show the right value even if the string is subsequently changed. It turns out that in some circumstances this can have a huge performance penalty. For example the following code takes several minutes to run, since it mallocs and copies a 1Mb buffer a million times:

$&;
$_ = 'x' x 1_000_000;
1 while /(.)/g;

If you remove the $&, it runs fast (<1s), but this is only because pp_match has a special hack added that says "even if the pattern contains captures, in the presence of /g don't bother copying the string buffer". So the following prints zzz rather than aaa. And if the string buffer gets realloced in the meantime, it could print out garbage:

$_ = 'aaa';
 /(\w+)/g;
$_ = 'zzz';
print "[$1]\n";

Attempts to fix this in the past have tried to implement some sort of Copy-On-Write behaviour, but have come up against the difficulty of making an SV always honour COW in all circumstances and/or not making the SV itself "unusual". Also, the regex engine API itself matches against a string buffer not an SV, so you aren't guaranteed to always have a valid SV to mess with.

My approach to this has been to only copy the substring of the string buffer needed to cover $1,$&, etc. The mechanism (PL_sawampersand) used to detect whether $`,$&,$' have been seen in code has been updated to log each of the three variables separately. The code then uses the index range of any captures, plus which of $`,$&,$' are present, plus the presence or not of /p, to decide what part of the string to copy. In the case of

bc. $_ = 'x' x 1_000_000;
1 while /(.)/g;

(with or without $&), the range is a single byte rather than a Mb that gets copies a million times, and now runs in subsecond time. This means that the hack can be removed, and printing $1 no longer risks a segfault.

It also means that having just $& in your source code may no longer necessarily be the huge performance hog it used to be, although having $` and $' too will drag things down to previous levels.

In summary:

$_ = 'x' x 1_000_000; 1 while /(.)/g;

before: fast and segfaulty
now: fast and non-segfaulty

$&;
$_ = 'x' x 1_000_000; 1 while /(.)/g;

before: slow and non-segfaulty
now: fast and non-segfaulty

This is all working and tested, but hasn't been pushed out for smoking/merging yet, since I haven't yet fixed the original bug yet, i.e. the

my $text = "a"; $text =~ m/(.(?{ $text .= "x" }))*/;

Over the last two months I have averaged 6 hours per week :-(.

As of 2012/08/31: since the beginning of the grant:

129.7 weeks
1353.2 total hours
10.4 average hours per week

There are 343 hours left on the grant.

Report for period 2012/07/01 to 2012/08/31 inclusive

Summary

Effort (HH::MM):

6:25 diagnosing bugs
46:45 fixing bugs
0:00 reviewing other people's bug fixes
0:00 reviewing ticket histories
0:00 review the ticket queue (triage)
-----
53:10 Total

Numbers of tickets closed:

3 tickets closed that have been worked on
0 tickets closed related to bugs that have been fixed
0 tickets closed that were reviewed but not worked on (triage)
-----
3 Total

Short Detail

45:00 [perl #3634] Capture corruption through self-modying regexp (?{...})
3:00 [perl #114242] TryCatch toke error with (??{$any}) $ws \\] )? @
1:00 [perl #114302] Bleadperl v5.17.0-408-g3c13cae breaks DGL/re-engine-RE2-0.10.tar.gz
2:10 [perl #114356] REGEXPs have massive reference counts
2:00 [perl #114378] cond_signal does not wake up a thread

Leave a comment

About TPF

The Perl Foundation - supporting the Perl community since 2000. Find out more at www.perlfoundation.org.

About this Entry

This page contains a single entry by Karen published on September 12, 2012 1:01 PM.

Spanish Localization of the Perl Core Documentation - Grant Report #3 was the previous entry in this blog.

Devel::Cover Grant Report for August is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

OpenID accepted here Learn more about OpenID
Powered by Movable Type 4.38