Fixing Perl5 Core Bugs: Report for Month 31
Wed, 03-Oct-2012 by
Karen Pauley
edit post
_Dave Mitchell writes:_
As per my grant conditions, here is a report for the September period.
This month I mostly spent my time:
1) Making regexes only copy the needed part of a string.
In general when a regex is executed, and it is seen that captures exist (either explicitly with /(...)/ or implicitly in the presence of $`, $& or $'), then perl makes a copy of the string that was matched against, so that subsequent changes to the string won't affect $1 etc.
That makes things like the following work:
bc. $_ = "abc";
/(\w)/;
$_ = 123";
ok($_ eq "a");
However in the following code, the copying on each match causes the match to go quadratic, copying a 1Mb string a million times:
bc. $_ = 'x' x 1_000_000;
while (/(.)/) { ... }
So in this case perl has always "optimised" that particular construct (//g in scalar context) to skip the string copying. This makes it fast but with potentially incorrect $1 etc values.
In the presence of $& etc, this "optimisation" is skipped, making the code slow but correct.
My work did two things. First, it split the detection of $`, $& and $& into three separate flags, so having just one of the three in your code won't necessarily do as much damage as before; secondly, when the string buffer is copied, only a single substring of it is copied: enough to cover the union of $1, $2, ... and whichever of $`, $& and $' have been seen. With this in place, this code
bc. $_ = 'x' x 1_000_000;
while (/(.)/) { ... }
(with or without $& present) will just copy a single character each time; so it's now both fast and correct. If $` say were present too, then it would be a lot slower, copying an incrementally larger string each time.
2) Making the regex engine handle non-null-terminated strings.
Traditionally, strings created from within perl have a hidden null byte at the end to make it easier to pass them to routines that expect C-style null-terminated strings. However, strings from other sources (such as XS) may not. It turns out that from its beginnings in perl 3, the engine has always worked on the assumption that any string will have an extra zero byte at the end, and relies on this for a number of things, from simply stopping at the right point, to various match types such as \b only working correctly at string ends due to this hidden byte.
I've overhauled the engine so that as far as I'm aware, it no longer accesses any byte after the end of the string for any reason. It turns out that the code which failed this assumption was fairly pervasive, and there's no guarantee that I got everything. I do know however that none of the tests under t/re/ trigger an overrun, since I ran valgrind on all those tests with a modified engine that realloced every string it was about to match against.
Note that both (1) and (2) were necessary prerequisites to fixing the self-modifying string bug:
bc. $s =~ /.({ $s .= 'x' })/
However, I've put off working on a final fix for that for while, as I've got a bit sick and tired of working the regex engine, and turned my attention to a bit of optimising:
3) Turn multiple PAD ops into a single PADRANGE op
Every declaration or use of a 'my' lexical variable involves the execution of a PADSV, PADAV or PADHV op. Since these often come in groups with targets in a consecutive range, it makes sense under some circumstances to collect them all into a single op (which will also carry out some ancilliary tasks). For example, the simple declaration:
bc. my ($a,@b,%c);
is currently compiled into 5 ops:
bc. pushmark
padsv[$a]
padav[@b]
padhv[%c]
list
With my current work, this now compiles into a single op:
bc. padrange[$a;@b;%c]
As well as being more efficient in general by avoiding the overhead of executing several lightweight ops, in this particular case (my declaration in void context), it skips altogether pushing those three vars onto the stack then popping them again, which is what used to pointlessly happen.
Although not done yet, there is scope for further optimisations; for example the 'my' declaration above causes a SAVECLEARSV to be pushed onto the save stack by each of the PAD ops. My padrange code still does 3 separate SAVECLEARSV pushes in a loop, but that could be replaced with a single SAVECLEARSV_RANGE save type, saving time on save and restore.
Over the last month I have averaged 22 hours per week
As of 2012/09/30: since the beginning of the grant:
bq. 134.0 weeks
1442.2 total hours
10.8 average hours per week
There are 258 hours left on the grant.
Report for period 2012/09/01 to 2012/09/30 inclusive
**Summary**
Effort (HH::MM):
bq. 0:00 diagnosing bugs
89:00 fixing bugs
0:00 reviewing other people's bug fixes
0:00 reviewing ticket histories
0:00 review the ticket queue (triage)
-----
**89:00 Total**
**Numbers of tickets closed:**
bq. 0 tickets closed that have been worked on
0 tickets closed related to bugs that have been fixed
0 tickets closed that were reviewed but not worked on (triage)
-----
**0 Total**
**Short Detail**
bq. 57:40 [perl #3634] Capture corruption through self-modying regexp (?{...})
31:20 [perl #114536] Optimize assigning to scalars from @_
Comments (0)