The changes are extremely valuable and will lead to widespread benefit to Perl programmers. We're quite excited about this grant. There are few people more qualified than Nicholas to do this work and the price is a bargain.
On a personal note, I'm particularly excited about his plan to save the regular expression engine state. I've been bit by bugs caused by the regex engine not being re-entrant and I'm hoping this will fix my bugs. Read on for the full text of the grant application.
Nicholas Clark
Improve Perl 5
Perl 5 maintenance and development is proceeding steadily, but there are a number of stubborn bugs that no volunteer has had the time to work on. This project will ensure that these bugs are resolved, as well as providing resource to develop new features for both 5.8.x and 5.10.
Perl has seen many maintenance releases over the past few years, and improvements have been contributed both for 5.10 to be, and 5.8.x. However, the core Perl development is entirely done by volunteer labour, with no significant time sponsorship by employers, which means that not everything is addressed that would be addressed in an ideal world.
This grant proposal is not for the regular work that will lead to 5.8.9 or 5.8.10. Instead it is intended to resolve long term core bugs that have persisted unfixed for several maintenance releases and affect major Perl modules, as well as completing unglamorous tasks that will benefit all users of Perl 5.10
Last year I restructured how the data is stored in the scalar heads and bodies and reduced the memory usage for regular hashes and regular arrays considerably. On a system with 32 bit integers and pointers AVs and HVs now use 36 bytes, rather than the previous 56 and 60, a saving of over 33%. I also removed 1 level of pointer indirection on various types of lookups, reducing the chance of CPU cache misses, a subtle cause of program slowness.
I believe similar space savings can be made for GVs and CVs (and therefore PVLVs and PVFMs). As well as direct structure rearrangement, both have pointers to C strings which are malloc()ed as duplicates of existing strings, so there is potential to use the shared string infrastructure to save memory. In addition, moving the GP pointer to the SV head union would remove one level of pointer indirection for many common GV operations, which would reduce CPU cache misses. In his ``Illustrated Perl Guts'' Gisle Aas notes Each stash is at least 4 levels deep and each glob is 3 levels, giving at least 24 pointer dereferences to access the data in the $foo::bar::baz variable from defstash. I'd argue that this is part of Larry's original design that is not so well suited to modern CPU trends, and so these changes to GVs would reduce the dereferences in that example by 3.
Based on my experiences with the AV and HV restructuring, I estimate that GVs and CVs would each take about 5 days of full time work.
These changes would only be applicable to 5.10
Currently the location of Perl's library paths has to be specified at Configure time. This is less than flexible for anyone who wants to make a binary distribution that can be installed to an arbitrary location. The favoured work around is to create an installer that patches paths inside the Perl binary. This is not ideal, as it precludes a simple ``untar and go'' package, and doesn't permit the distribution to be moved after installation.
All the code changes to enable the Perl binary to find the libraries relative to itself are already in place, tested and working. However, they aren't yet usable because the Configure script has no means of allowing the user to specify and enable the option.
All that is required to get this working is additional logic in Configure. The changes needed aren't entirely trivial, as they interact with the existing logic to write install to a different path from the final read location. I estimate 3 days are needed, 1 each for design, implementation and testing.
These changes can be merged back to 5.8.x
Currently unreachable code that contains constant expressions that cannot be evaluated will fail at compile time. For example
$ ./perl -le 'if (0) {0/0}' Illegal division by zero at -e line 1.
The consensus is that this is a bug, because constant folding is merely an optimisation, and therefore should not affect program behaviour in any way other than resource usage.
The intent is to wrap the constant folding code in an error handler, so that if an exception is thrown, folding is aborted and the optree retained. Because the implementation of the exception handling has been simplified in 5.9.x, I expect that these changes would only take about 1 day to implement there, but may well take 3 further days to backport into 5.8.x's framework. In the process I should be able to assess how to merge Marcus Holland-Moritz's simplified exception macros back to 5.8.x.
For Perl 5.8.1, Jarkko added code to cache character offset to byte offset
for strings. The caching code is fixed to store at most 2 positions, but the
implementation constrains these to be related, so only substr
can really
benefit from more than one. Although the caching code is an optimisation, it
is not optional, and several bugs have been found in it across several Perl
versions. There is no certainty that all have been found yet, and no easy
way to check.
I propose re-writing the caching code to make it optional. This way, if bugs are found in the field, end users have the option of disabling the cache to get the right answer, instead of being forced to have the wrong answer quickly. The caching code should also be able to run in ``assert'' mode for debugging purposes, where all cached answers are double checked against a full scan. Also, by changing the design, I believe that it will be trivial to allow the cache to be flexible in the number of character/byte pairs that it stores, on a string by string basis.
These changes should take about 5 days, and can be merged back to 5.8.x
In theory the Perl core should already fully support IPv6. However, it's not clear how robust the support is; in particular whether the address buffers used by all the socket API calls are correct. As I have access to 2 IPv6 enabled machines I can test that all socket functions work equally well with IPv6 as with IPv6.
The Socket6 module on CPAN makes reference to the KAME project, which provided an IPv6 stack for BSD variants, which incorporates a Perl source tree (http://www.kame.net/dev/cvsweb.cgi/perl5/), but the Perl 5 Changes files make no mention of any changes being merged back to core. Hence as part of the audit I will check what differences there are between their tree and the version of Perl 5 they branched from.
These changes will be merged back to 5.8.x
Perl 5.6 added the ability to put code references into @INC, which allows
arbitrary code to be run before creating a file handle to pass back to
require
. An additional undocumented feature was the ability to pass back
a source filter as a second return value. The code for this filter exists in
core, with comments that suggest that some bugs are unresolved. The
functionality is also provided by Filter::Simple, which entered the core for
5.8.0.
If the existing filter functionality were migrated to Filter::Simple, then the duplicate code can be removed from the C source, reducing size and maintenance liability.
These changes will be merged back to 5.8.x
The regular expression engine stores its state in global variables, in the
Perl interpreter structure. In turn, engine needs to re-enter the Perl
interpreter at times, for example to load UTF-8 data. It does this by saving
the state of all the globals, variable by variable, inside
Perl_save_re_context
. This means that at scope exit there are 38 or more
separate actions to perform.
The only loop inside Perl_save_re_context
is for saving $1 etc. It should
be possible to replace this with a save of the underlying data that $1 etc
read from. With this changed, the entire regexp state can be stored in one
fixed-size custom save structure, and restored in one action. In itself, this
is not a huge speed or size gain, but it can be immediately followed by
reorganising the interpreter structure to use this state structure rather than
individual variables. The state structure in the interpreter structure can
then be migrated out, with the interpreter structure holding a pointer to the
current global state. The immediate benefit is that re-entering the regexp
engine becomes a much less intensive operation. The broader benefit is that
it will become much clearer what work remains to change the regexp engine to
pass all state around in this fashion, and therefore be refactored from
recursive to iterative.
I estimate that this will take 4 days. The cleanup and the state structure on the save stack can be merged back to 5.8.x
There have been several changes made to 5.9.x that fix bugs, which are non-trivial to integrate to 5.8.x. All require at least 1 day of solid concentration to assess, and that some are over a year old. Given that multiple 5.8.x releases have passed without these fixes being integrated, it is clear that that whenever I have a whole day spare to work on 5.8.x, they are never top priority, because work on them never gets done. Hence I believe that integration will never actually happen unless it is made an explicit objective.
Bug 33734 (https://rt.perl.org/rt3/Ticket/Display.html?id=33734) reports
that various numeric unpack
operators expose the internal representation
of the string (bytes or UTF-8). The bug is valid and tricky to solve in an
acceptably backwards-compatible fashion, given the documented and supported
inconsistencies in pack
's handling of UTF-8.
Ton Hospel supplied a series of patches for pack
to fix this and other
bugs he found, clean up the code in pp_pack.c, and add a new ``W'' format
specifier. However, the changes he made are not fully safe to go into the
maintenance branch. The pack code is effectively currently forked between
5.9.x and 5.8.x, with the result that all subsequent cleanups and fixes in
it haven't been merged to 5.8.x.
Prior to Ton's changes, the pack code was identical in both, with a minimal set of conditional compile directives to provide the functional differences between 5.8 and 5.9's supported pack features. I estimate that it will take 3 days total to assess the best compromise behaviour for 5.8.x, adapt the code from 5.9.x to provide this, and thereby heal the fork.
Bug 3038 (https://rt.perl.org/rt3/Ticket/Display.html?id=3038) is that
qr/(.$)/m
is not the same as /(?m-xis:(.$))/
with /g
. A patch has
been applied to 5.9.x back in November 2004, but as it touches C code that
was changed when $*
was removed, it doesn't integrate into 5.8.x.
To fix several bugs, Dave Mitchell made two significant changes to the
handling of magic in 5.9.x. Change 24942 makes a distinction between
container magic and value magic, so that local
on a variable with magic
works more as the user would expect. Change 26569 adds a local slot to the
magic vtable. There is no obvious reason why these changes are incompatible
with 5.8.x, and it would be beneficial to integrate them to 5.8.x as they
are pre-requisites on several bug fixes. However, it would be imprudent to
merge them without investigation, as it's possible that they introduce subtle
behaviour changes which would be inappropriate for the maintenance branch,
in effect introducing new bugs.
I estimate that it would take 3 days to review the changes, and if appropriate merge them across with any modifications necessary to ensure that binary compatibility is retained.
Changes 25138, 21546, 21664 and 21833 cleaned up the handling of C's
argv[0]
, and how it is updated when the Perl program alters $0
. However,
these changes broke PAR's regression tests, so were never integrated to 5.8.x.
Since then no-one has had the time or desire to dig down to understand why
this was so. I estimate that it will take 2 days to figure this out, fix the
problems, and get these changes integrated.
Bug 34297 (https://rt.perl.org/rt3/Ticket/Display.html?id=34297) reports
a problem with returning Unicode strings via overloaded stringification
The cause of this specific bug is in pp_length
, which is calling
DO_UTF8(sv)
to switch between UTF8 and non-UTF8 code paths, prior to calling
SvPV
. This is a subtle bug, because the UTF8 flag will not be set on
references to objects with overloaded stringification until after it is
called for the first time in SvPV()
. Fixing this particular bug is simple.
However, a cursory inspection of pp.c alone suggests that it is quite
widespread. pp_ucfirst
/pp_lcfirst
, pp_uc
, pp_lc
, pp_index
and pp_rindex
all seem to share the same bug, although it's likely that
in some cases it will only trigger with 8 bit locales.
I propose to spend 3 days reviewing the Perl source code to find all instances of this systematic bug, and fix them. These changes will be merged back to 5.8.x
Bug 34925 (https://rt.perl.org/rt3/Ticket/Display.html?id=34925) reports
that reblessing a reference from a class with overloading into one without
causes problems. This is not an abstract problem, as it affects code using
Class::DBI
. The solution to this would be to store the overloading flag
on the referent rather than the reference. In turn, this involves rejiging
the flag usage on hashes to free up a flag bit. I estimate that this work
would take 3 days, and would probably be suitable for merging to 5.8.x
As part of the Perl 6 project, Larry Wall has been working on a ``Perl 5 to Perl 5'' translator, as the foundation for a Perl 5 to Perl 6 translator. He's been developing against the 5.9.2 release, modifying the tokeniser and the parser as needed to retain the extra information they used to throw away.
In the minutes of the design call, Larry reports that work is reaching the point where it's time to merge it back into the core, and suggesting that it might be more efficient use of time if someone freed him from the burden of doing this. (http://use.perl.org/~luqui/journal/28319)
Larry hasn't yet made his code public. However, there have been many minor changes to the Perl source since 5.9.2 was released, particularly Andy's work on consting. Therefore it's likely that the diff he generates will have significant conflicts against the core, and so most if not all of Larry's changes will need to be hand applied. My guess is that this could take anything up to 5 days.
Robin's implementation of switch
also provided most of the remaining
framework needed for supporting lexical pragmas. %^H
, the hash used to
store the state of user defined pramas, now propagates into eval
. The
major part remaining is that the compile time state of %^H
for a block
is not accessible at run time from that block. Rafael has a patch, but it
fails tests. My guess is that it would take about 3 days to get this passing
all tests and good enough to be considered feature complete.
Changes were made between 5.8.5-RC1 and 5.8.5 to work round a problem with
Tk and the UTF-8 flag. (Changes 23084 and 23085). At the time these were seen
as a late bodge to keep things working for the release, but the changes were
never revisited. The specific problem with the bodge was that it ended up
with SvPOK
being true on a tainted scalar, which defeats subsequent checks
for tainting. This isn't good, but no-one has had the time to revisit the
problem and resolve it properly. I estimate that it would take about 3 days
to investigate, develop and test a proper solution.
Project Schedule Three months full time. (60 working days). I can start immediately. I need to start soon, else my finances require that I get a new full time job to pay the rent. This will not cut into the time I spend volunteering on 5.8.x.
Bio I'm the current Perl 5.8 maintenance pumpking. I have released 7 stable versions of Perl, made many improvements to the Perl core, and fixed many bugs. I work well on my own, and with the other core Perl developers. I care about the Perl 5 core, but on my own don't have sufficient time to work on everything I'd like to.
Amount Requested $11,000. I estimate this gross income just keeps me financially standing still for 3 months. It's less than I was earning full time, and it's less than the equivalent $17,500 per quarter that the Parrot and Perl 6 proposal was budgeted at. I cannot accept a grant for less than this rate, but a shorter grant at the same rate is viable. Clearly this grant proposal and the Ponie proposal are exclusive, in as much as it's not possible for me to do more than 60 days work combined between the two.
Appendix I feel that I should note two areas that aren't in this proposal
Firstly "state variables" and "_ prototype character" are listed in Rafael's roadmap to 5.10. It would seem logical to have included them. However, these require work on the parser, optree generation and probably pads, all of which are areas of the core I'm not familiar with. Therefore I can't give a good estimate of how long the work would take me to do, which doesn't make a good grant proposal. More importantly, I'm not the best person to do these - someone else who already has experience of these areas would get the job done more quickly than I would.
Secondly I make no proposals for threads. Threads are important, and would benefit from work. However, time spent on resolving threads bugs is mostly taken tracking down the true cause, with the actual fixing being a minor component. The time taken to track down a threads bugs is particularly hard to guess, therefore it's not possible to commit to any statement of deliverables better than a vague "work on threads bugs for $n days" with no guarantee of results, which historically hasn't been how TPF prefers to award grants. Similarly optimisation proposals such as copy on write are still unproven, with no clear way of estimating how long they would take to complete, or quantifying the benefits and risks.
I don't think that I'm likely to actually fix any regexp engine state bugs, as I was only planning to (comprehsively) simplify the existing state saving code, without actually changing what gets saved when.
It should make the logic more accessible, and hopefully other(s) will then be able to see where to take it next, which could well lead to bug fixes.
Congrats Nick!
(to someone else... why isn't this mentioned on the main page? I found out about it from use.perl.org)
Just wondering: Are the space savings mentioned ("33%") already implemented in Perl 5.8.8, or only ni the 5.9.x line?