Fixing Perl5 Core Bugs: Report for Month 31

No Comments

Dave Mitchell writes:

As per my grant conditions, here is a report for the September period.

This month I mostly spent my time:

1) Making regexes only copy the needed part of a string.

In general when a regex is executed, and it is seen that captures exist (either explicitly with /(...)/ or implicitly in the presence of $`, $& or $'), then perl makes a copy of the string that was matched against, so that subsequent changes to the string won't affect $1 etc.

That makes things like the following work:

$_ = "abc";
/(\w)/;
$_ = 123";
ok($_ eq "a");

However in the following code, the copying on each match causes the match to go quadratic, copying a 1Mb string a million times:

$_ = 'x' x 1_000_000;
while (/(.)/) { ... }

So in this case perl has always "optimised" that particular construct (//g in scalar context) to skip the string copying. This makes it fast but with potentially incorrect $1 etc values.

In the presence of $& etc, this "optimisation" is skipped, making the code slow but correct.

My work did two things. First, it split the detection of $`, $& and $& into three separate flags, so having just one of the three in your code won't necessarily do as much damage as before; secondly, when the string buffer is copied, only a single substring of it is copied: enough to cover the union of $1, $2, ... and whichever of $`, $& and $' have been seen. With this in place, this code

$_ = 'x' x 1_000_000;
while (/(.)/) { ... }

(with or without $& present) will just copy a single character each time; so it's now both fast and correct. If $` say were present too, then it would be a lot slower, copying an incrementally larger string each time.

2) Making the regex engine handle non-null-terminated strings.

Traditionally, strings created from within perl have a hidden null byte at the end to make it easier to pass them to routines that expect C-style null-terminated strings. However, strings from other sources (such as XS) may not. It turns out that from its beginnings in perl 3, the engine has always worked on the assumption that any string will have an extra zero byte at the end, and relies on this for a number of things, from simply stopping at the right point, to various match types such as \b only working correctly at string ends due to this hidden byte.

I've overhauled the engine so that as far as I'm aware, it no longer accesses any byte after the end of the string for any reason. It turns out that the code which failed this assumption was fairly pervasive, and there's no guarantee that I got everything. I do know however that none of the tests under t/re/ trigger an overrun, since I ran valgrind on all those tests with a modified engine that realloced every string it was about to match against.

Note that both (1) and (2) were necessary prerequisites to fixing the self-modifying string bug:

$s =~ /.({ $s .= 'x' })/

However, I've put off working on a final fix for that for while, as I've got a bit sick and tired of working the regex engine, and turned my attention to a bit of optimising:

3) Turn multiple PAD ops into a single PADRANGE op

Every declaration or use of a 'my' lexical variable involves the execution of a PADSV, PADAV or PADHV op. Since these often come in groups with targets in a consecutive range, it makes sense under some circumstances to collect them all into a single op (which will also carry out some ancilliary tasks). For example, the simple declaration:

my ($a,@b,%c);

is currently compiled into 5 ops:

pushmark
padsv[$a]
padav[@b]
padhv[%c]
list

With my current work, this now compiles into a single op:

padrange[$a;@b;%c]

As well as being more efficient in general by avoiding the overhead of executing several lightweight ops, in this particular case (my declaration in void context), it skips altogether pushing those three vars onto the stack then popping them again, which is what used to pointlessly happen.

Although not done yet, there is scope for further optimisations; for example the 'my' declaration above causes a SAVECLEARSV to be pushed onto the save stack by each of the PAD ops. My padrange code still does 3 separate SAVECLEARSV pushes in a loop, but that could be replaced with a single SAVECLEARSV_RANGE save type, saving time on save and restore.

Over the last month I have averaged 22 hours per week

As of 2012/09/30: since the beginning of the grant:

134.0 weeks
1442.2 total hours
10.8 average hours per week

There are 258 hours left on the grant.

Report for period 2012/09/01 to 2012/09/30 inclusive

Summary

Effort (HH::MM):

0:00 diagnosing bugs
89:00 fixing bugs
0:00 reviewing other people's bug fixes
0:00 reviewing ticket histories
0:00 review the ticket queue (triage)
-----
89:00 Total

Numbers of tickets closed:

0 tickets closed that have been worked on
0 tickets closed related to bugs that have been fixed
0 tickets closed that were reviewed but not worked on (triage)
-----
0 Total

Short Detail

57:40 [perl #3634] Capture corruption through self-modying regexp (?{...})
31:20 [perl #114536] Optimize assigning to scalars from @_

Leave a comment

About TPF

The Perl Foundation - supporting the Perl community since 2000. Find out more at www.perlfoundation.org.

About this Entry

This page contains a single entry by Karen published on October 3, 2012 1:06 AM.

2012Q4 Call for Grant Proposals was the previous entry in this blog.

Improving Perl 5: Grant Report for Month 13 is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

OpenID accepted here Learn more about OpenID
Powered by Movable Type 4.38