_Dave Mitchell writes:_
I spent the majority of my time last month working on a function called Perl_re_intuit_start() in the regex engine.
This function is one of the major optimisations in the regex engine. For each compiled pattern, it is noted what is the longest fixed and floating strings that *must* appear in the string for the match to succeed, along with what char class (if any) the pattern must start with.
Using these heuristics, re_intuit_start() tries to quickly scan the string (using fast Boyer-Moore) to locate these known substrings. This allows us to either quickly reject the string as not matching, or to find a minimum start position in the string, before handing off to the normal NFA automaton.
Unfortunately it's a bit flaky when it comes to handling utf8-encoded strings, especially long ones. There are lots of bits of code that do length calculations on bits of the string, for example along the lines of "if the maximum start position plus the length of the floating string is less than the end of the string less minlen chars, then..." which ends up doing a lot of calculations of utf8 lengths of the substring between two pointers. Which for long strings becomes grindingly slow.
Also, chunks of the code are marked with reassuring comments like:
/* XXXX May be hopelessly wrong for UTF... */
The problem with this function being mainly an optimisation, is that as long as it doesn't falsely reject valid strings or choose to high a starting point, it won't fail any tests: even if it makes the match unnecessarily slow. So errors are hard to spot.
The code's also hard to understand. (Even ignoring 'fail:' and variants, it has 17 separate labels that can be goto'ed).
So I've spent most of last month going through this function, trying to understand how bits of it work, fixing issues, and generally cleaning up, refactoring and better documenting the code. I'm nowhere near finished yet.
bq. 5:32 RT#120426 Strange and unwarranted underflow in string-to-number
0:17 RT#120675 Unexpected tainting via regex using locale
38:33 RT#120692 Slow global pattern match with input from utf8
7:01 fix smoke issues
2:07 look at Compress-Raw-Zlib and -g3 issue
14:14 process p5p mailbox
**67:44 Total (HH::MM)**
bq. 4.4 weeks
67.7 total hours
15.3 average hours per week
As of 2013/12/31: since the beginning of the grant:
bq. 11.3 weeks
183.3 total hours
16.2 average hours per week
There are 217 hours left on the grant.