Dave Mitchell writes:
As per my grant conditions, here is a report for the July/August period.
I spent a bit of time fixing a few issues causes by my rewriting of the /(?{})/ implementation, then started to look into the last unclosed ticket still attached to the re_eval meta-ticket. This concerns code within (?{}) that modifies the string being matched against, and generally causes assertion failures or coredumps:
my $text = "a"; $text =~ m/(.(?{ $text .= "x" }))*/;
While trying to understand what's going on, I ended up delving into the issue of how and when perl makes a copy of the string buffer in order to make $1, $& etc continue to show the right value even if the string is subsequently changed. It turns out that in some circumstances this can have a huge performance penalty. For example the following code takes several minutes to run, since it mallocs and copies a 1Mb buffer a million times:
$&;
$_ = 'x' x 1_000_000;
1 while /(.)/g;
If you remove the $&, it runs fast (<1s), but this is only because pp_match has a special hack added that says "even if the pattern contains captures, in the presence of /g don't bother copying the string buffer". So the following prints zzz rather than aaa. And if the string buffer gets realloced in the meantime, it could print out garbage:
$_ = 'aaa';
/(\w+)/g;
$_ = 'zzz';
print "[$1]\n";
Attempts to fix this in the past have tried to implement some sort of Copy-On-Write behaviour, but have come up against the difficulty of making an SV always honour COW in all circumstances and/or not making the SV itself "unusual". Also, the regex engine API itself matches against a string buffer not an SV, so you aren't guaranteed to always have a valid SV to mess with.
My approach to this has been to only copy the substring of the string buffer needed to cover $1,$&, etc. The mechanism (PL_sawampersand) used to detect whether $`,$&,$' have been seen in code has been updated to log each of the three variables separately. The code then uses the index range of any captures, plus which of $`,$&,$' are present, plus the presence or not of /p, to decide what part of the string to copy. In the case of
bc. $_ = 'x' x 1_000_000;
1 while /(.)/g;
(with or without $&), the range is a single byte rather than a Mb that gets copies a million times, and now runs in subsecond time. This means that the hack can be removed, and printing $1 no longer risks a segfault.
It also means that having just $& in your source code may no longer necessarily be the huge performance hog it used to be, although having $` and $' too will drag things down to previous levels.
In summary:
$_ = 'x' x 1_000_000; 1 while /(.)/g;
before: fast and segfaulty
now: fast and non-segfaulty
$&;
$_ = 'x' x 1_000_000; 1 while /(.)/g;
before: slow and non-segfaulty
now: fast and non-segfaulty
This is all working and tested, but hasn't been pushed out for smoking/merging yet, since I haven't yet fixed the original bug yet, i.e. the
my $text = "a"; $text =~ m/(.(?{ $text .= "x" }))*/;
Over the last two months I have averaged 6 hours per week :-(.
As of 2012/08/31: since the beginning of the grant:
129.7 weeks
1353.2 total hours
10.4 average hours per week
There are 343 hours left on the grant.
Report for period 2012/07/01 to 2012/08/31 inclusive
Summary
Effort (HH::MM):
6:25 diagnosing bugs
46:45 fixing bugs
0:00 reviewing other people's bug fixes
0:00 reviewing ticket histories
0:00 review the ticket queue (triage)
-----
53:10 Total
Numbers of tickets closed:
3 tickets closed that have been worked on
0 tickets closed related to bugs that have been fixed
0 tickets closed that were reviewed but not worked on (triage)
-----
3 Total
Short Detail
45:00 [perl #3634] Capture corruption through self-modying regexp (?{...})
3:00 [perl #114242] TryCatch toke error with (??{$any}) $ws \\] )? @
1:00 [perl #114302] Bleadperl v5.17.0-408-g3c13cae breaks DGL/re-engine-RE2-0.10.tar.gz
2:10 [perl #114356] REGEXPs have massive reference counts
2:00 [perl #114378] cond_signal does not wake up a thread


Leave a comment