2006 - Q1 Grant Votes

3 Comments

Only one grant was approved this quarter, but it's fantastic. Nicholas Clark, the current 5.8 Pumpking has applied for an $11,000 grant to improve Perl 5. Some of what he intends to do:

  • Fix bugs affecting Unicode handling, Class::DBI, Tk and PAR.
  • Allow relocatable Perl installations, which permits developers to distribute applications complete with a bundled Perl.
  • Support source filters as part of code references in @INC
  • Complete lexical pragmas, a milestone on the way to 5.10
  • Further memory saving for 5.10 - previous work saved >33% for hashes and arrays
  • Merge Larry's p5-p6 work into core, which will allow others to start contributing to it.

The changes are extremely valuable and will lead to widespread benefit to Perl programmers. We're quite excited about this grant. There are few people more qualified than Nicholas to do this work and the price is a bargain.

On a personal note, I'm particularly excited about his plan to save the regular expression engine state. I've been bit by bugs caused by the regex engine not being re-entrant and I'm hoping this will fix my bugs. Read on for the full text of the grant application.

Name

Nicholas Clark


Email

nick@ccl4.org


Project Title

Improve Perl 5


Synopsis

Perl 5 maintenance and development is proceeding steadily, but there are a number of stubborn bugs that no volunteer has had the time to work on. This project will ensure that these bugs are resolved, as well as providing resource to develop new features for both 5.8.x and 5.10.


Benefits to the Perl Community

Perl has seen many maintenance releases over the past few years, and improvements have been contributed both for 5.10 to be, and 5.8.x. However, the core Perl development is entirely done by volunteer labour, with no significant time sponsorship by employers, which means that not everything is addressed that would be addressed in an ideal world.

This grant proposal is not for the regular work that will lead to 5.8.9 or 5.8.10. Instead it is intended to resolve long term core bugs that have persisted unfixed for several maintenance releases and affect major Perl modules, as well as completing unglamorous tasks that will benefit all users of Perl 5.10


Deliverables

  • Fix bugs affecting Unicode handling, Class::DBI, Tk and PAR.
  • Allow relocatable Perl installations, which permits developers to distribute applications complete with a bundled Perl.
  • Support source filters as part of code references in @INC
  • Complete lexical pragmas, a milestone on the way to 5.10
  • Further memory saving for 5.10 - previous work saved >33% for hashes and arrays
  • Merge Larry's p5-p6 work into core, which will allow others to start contributing to it.

Project Details


GV and CV structure shrinking (10 days)

Last year I restructured how the data is stored in the scalar heads and bodies and reduced the memory usage for regular hashes and regular arrays considerably. On a system with 32 bit integers and pointers AVs and HVs now use 36 bytes, rather than the previous 56 and 60, a saving of over 33%. I also removed 1 level of pointer indirection on various types of lookups, reducing the chance of CPU cache misses, a subtle cause of program slowness.

I believe similar space savings can be made for GVs and CVs (and therefore PVLVs and PVFMs). As well as direct structure rearrangement, both have pointers to C strings which are malloc()ed as duplicates of existing strings, so there is potential to use the shared string infrastructure to save memory. In addition, moving the GP pointer to the SV head union would remove one level of pointer indirection for many common GV operations, which would reduce CPU cache misses. In his ``Illustrated Perl Guts'' Gisle Aas notes Each stash is at least 4 levels deep and each glob is 3 levels, giving at least 24 pointer dereferences to access the data in the $foo::bar::baz variable from defstash. I'd argue that this is part of Larry's original design that is not so well suited to modern CPU trends, and so these changes to GVs would reduce the dereferences in that example by 3.

Based on my experiences with the AV and HV restructuring, I estimate that GVs and CVs would each take about 5 days of full time work.

These changes would only be applicable to 5.10


Relocatable Perl binaries (3 days)

Currently the location of Perl's library paths has to be specified at Configure time. This is less than flexible for anyone who wants to make a binary distribution that can be installed to an arbitrary location. The favoured work around is to create an installer that patches paths inside the Perl binary. This is not ideal, as it precludes a simple ``untar and go'' package, and doesn't permit the distribution to be moved after installation.

All the code changes to enable the Perl binary to find the libraries relative to itself are already in place, tested and working. However, they aren't yet usable because the Configure script has no means of allowing the user to specify and enable the option.

All that is required to get this working is additional logic in Configure. The changes needed aren't entirely trivial, as they interact with the existing logic to write install to a different path from the final read location. I estimate 3 days are needed, 1 each for design, implementation and testing.

These changes can be merged back to 5.8.x


Error handling in constant folding (4 days)

Currently unreachable code that contains constant expressions that cannot be evaluated will fail at compile time. For example

    $ ./perl -le 'if (0) {0/0}'
    Illegal division by zero at -e line 1.

The consensus is that this is a bug, because constant folding is merely an optimisation, and therefore should not affect program behaviour in any way other than resource usage.

The intent is to wrap the constant folding code in an error handler, so that if an exception is thrown, folding is aborted and the optree retained. Because the implementation of the exception handling has been simplified in 5.9.x, I expect that these changes would only take about 1 day to implement there, but may well take 3 further days to backport into 5.8.x's framework. In the process I should be able to assess how to merge Marcus Holland-Moritz's simplified exception macros back to 5.8.x.


Make the UTF-8 caching code optional, and more flexible (5 days)

For Perl 5.8.1, Jarkko added code to cache character offset to byte offset for strings. The caching code is fixed to store at most 2 positions, but the implementation constrains these to be related, so only substr can really benefit from more than one. Although the caching code is an optimisation, it is not optional, and several bugs have been found in it across several Perl versions. There is no certainty that all have been found yet, and no easy way to check.

I propose re-writing the caching code to make it optional. This way, if bugs are found in the field, end users have the option of disabling the cache to get the right answer, instead of being forced to have the wrong answer quickly. The caching code should also be able to run in ``assert'' mode for debugging purposes, where all cached answers are double checked against a full scan. Also, by changing the design, I believe that it will be trivial to allow the cache to be flexible in the number of character/byte pairs that it stores, on a string by string basis.

These changes should take about 5 days, and can be merged back to 5.8.x


Ensure that IPv6 support is complete (3 days)

In theory the Perl core should already fully support IPv6. However, it's not clear how robust the support is; in particular whether the address buffers used by all the socket API calls are correct. As I have access to 2 IPv6 enabled machines I can test that all socket functions work equally well with IPv6 as with IPv6.

The Socket6 module on CPAN makes reference to the KAME project, which provided an IPv6 stack for BSD variants, which incorporates a Perl source tree (http://www.kame.net/dev/cvsweb.cgi/perl5/), but the Perl 5 Changes files make no mention of any changes being merged back to core. Hence as part of the audit I will check what differences there are between their tree and the version of Perl 5 they branched from.

These changes will be merged back to 5.8.x


Convert the @INC source filter to Filter::Simple (5 days)

Perl 5.6 added the ability to put code references into @INC, which allows arbitrary code to be run before creating a file handle to pass back to require. An additional undocumented feature was the ability to pass back a source filter as a second return value. The code for this filter exists in core, with comments that suggest that some bugs are unresolved. The functionality is also provided by Filter::Simple, which entered the core for 5.8.0.

If the existing filter functionality were migrated to Filter::Simple, then the duplicate code can be removed from the C source, reducing size and maintenance liability.

These changes will be merged back to 5.8.x


Saving the regular expression engine state (4 days)

The regular expression engine stores its state in global variables, in the Perl interpreter structure. In turn, engine needs to re-enter the Perl interpreter at times, for example to load UTF-8 data. It does this by saving the state of all the globals, variable by variable, inside Perl_save_re_context. This means that at scope exit there are 38 or more separate actions to perform.

The only loop inside Perl_save_re_context is for saving $1 etc. It should be possible to replace this with a save of the underlying data that $1 etc read from. With this changed, the entire regexp state can be stored in one fixed-size custom save structure, and restored in one action. In itself, this is not a huge speed or size gain, but it can be immediately followed by reorganising the interpreter structure to use this state structure rather than individual variables. The state structure in the interpreter structure can then be migrated out, with the interpreter structure holding a pointer to the current global state. The immediate benefit is that re-entering the regexp engine becomes a much less intensive operation. The broader benefit is that it will become much clearer what work remains to change the regexp engine to pass all state around in this fashion, and therefore be refactored from recursive to iterative.

I estimate that this will take 4 days. The cleanup and the state structure on the save stack can be merged back to 5.8.x


Reviewing complex changes for integration to 5.8.x

There have been several changes made to 5.9.x that fix bugs, which are non-trivial to integrate to 5.8.x. All require at least 1 day of solid concentration to assess, and that some are over a year old. Given that multiple 5.8.x releases have passed without these fixes being integrated, it is clear that that whenever I have a whole day spare to work on 5.8.x, they are never top priority, because work on them never gets done. Hence I believe that integration will never actually happen unless it is made an explicit objective.

Migrate pack ``W'' changes to 5.8.x (3 days)

Bug 33734 (https://rt.perl.org/rt3/Ticket/Display.html?id=33734) reports that various numeric unpack operators expose the internal representation of the string (bytes or UTF-8). The bug is valid and tricky to solve in an acceptably backwards-compatible fashion, given the documented and supported inconsistencies in pack's handling of UTF-8.

Ton Hospel supplied a series of patches for pack to fix this and other bugs he found, clean up the code in pp_pack.c, and add a new ``W'' format specifier. However, the changes he made are not fully safe to go into the maintenance branch. The pack code is effectively currently forked between 5.9.x and 5.8.x, with the result that all subsequent cleanups and fixes in it haven't been merged to 5.8.x.

Prior to Ton's changes, the pack code was identical in both, with a minimal set of conditional compile directives to provide the functional differences between 5.8 and 5.9's supported pack features. I estimate that it will take 3 days total to assess the best compromise behaviour for 5.8.x, adapt the code from 5.9.x to provide this, and thereby heal the fork.

Fix bug 3038 in 5.8.x (1 day)

Bug 3038 (https://rt.perl.org/rt3/Ticket/Display.html?id=3038) is that qr/(.$)/m is not the same as /(?m-xis:(.$))/ with /g. A patch has been applied to 5.9.x back in November 2004, but as it touches C code that was changed when $* was removed, it doesn't integrate into 5.8.x.

magic and localisation (3 days)

To fix several bugs, Dave Mitchell made two significant changes to the handling of magic in 5.9.x. Change 24942 makes a distinction between container magic and value magic, so that local on a variable with magic works more as the user would expect. Change 26569 adds a local slot to the magic vtable. There is no obvious reason why these changes are incompatible with 5.8.x, and it would be beneficial to integrate them to 5.8.x as they are pre-requisites on several bug fixes. However, it would be imprudent to merge them without investigation, as it's possible that they introduce subtle behaviour changes which would be inappropriate for the maintenance branch, in effect introducing new bugs.

I estimate that it would take 3 days to review the changes, and if appropriate merge them across with any modifications necessary to ensure that binary compatibility is retained.

$0 changes and PAR (2 days)

Changes 25138, 21546, 21664 and 21833 cleaned up the handling of C's argv[0], and how it is updated when the Perl program alters $0. However, these changes broke PAR's regression tests, so were never integrated to 5.8.x. Since then no-one has had the time or desire to dig down to understand why this was so. I estimate that it will take 2 days to figure this out, fix the problems, and get these changes integrated.


Significant bugs

Systematic problems with UTF-8 and overloaded stringification (3 days)

Bug 34297 (https://rt.perl.org/rt3/Ticket/Display.html?id=34297) reports a problem with returning Unicode strings via overloaded stringification The cause of this specific bug is in pp_length, which is calling DO_UTF8(sv) to switch between UTF8 and non-UTF8 code paths, prior to calling SvPV. This is a subtle bug, because the UTF8 flag will not be set on references to objects with overloaded stringification until after it is called for the first time in SvPV(). Fixing this particular bug is simple. However, a cursory inspection of pp.c alone suggests that it is quite widespread. pp_ucfirst/pp_lcfirst, pp_uc, pp_lc, pp_index and pp_rindex all seem to share the same bug, although it's likely that in some cases it will only trigger with 8 bit locales.

I propose to spend 3 days reviewing the Perl source code to find all instances of this systematic bug, and fix them. These changes will be merged back to 5.8.x

Reblessing references interacts badly with overloading (3 days)

Bug 34925 (https://rt.perl.org/rt3/Ticket/Display.html?id=34925) reports that reblessing a reference from a class with overloading into one without causes problems. This is not an abstract problem, as it affects code using Class::DBI. The solution to this would be to store the overloading flag on the referent rather than the reference. In turn, this involves rejiging the flag usage on hashes to free up a flag bit. I estimate that this work would take 3 days, and would probably be suitable for merging to 5.8.x


Merge Larry's tokeniser changes (5 days)

As part of the Perl 6 project, Larry Wall has been working on a ``Perl 5 to Perl 5'' translator, as the foundation for a Perl 5 to Perl 6 translator. He's been developing against the 5.9.2 release, modifying the tokeniser and the parser as needed to retain the extra information they used to throw away.

In the minutes of the design call, Larry reports that work is reaching the point where it's time to merge it back into the core, and suggesting that it might be more efficient use of time if someone freed him from the burden of doing this. (http://use.perl.org/~luqui/journal/28319)

Larry hasn't yet made his code public. However, there have been many minor changes to the Perl source since 5.9.2 was released, particularly Andy's work on consting. Therefore it's likely that the diff he generates will have significant conflicts against the core, and so most if not all of Larry's changes will need to be hand applied. My guess is that this could take anything up to 5 days.


Complete lexical pragmas (3 days)

Robin's implementation of switch also provided most of the remaining framework needed for supporting lexical pragmas. %^H, the hash used to store the state of user defined pramas, now propagates into eval. The major part remaining is that the compile time state of %^H for a block is not accessible at run time from that block. Rafael has a patch, but it fails tests. My guess is that it would take about 3 days to get this passing all tests and good enough to be considered feature complete.


Taint, UTF-8 and Tk (3 days)

Changes were made between 5.8.5-RC1 and 5.8.5 to work round a problem with Tk and the UTF-8 flag. (Changes 23084 and 23085). At the time these were seen as a late bodge to keep things working for the release, but the changes were never revisited. The specific problem with the bodge was that it ended up with SvPOK being true on a tainted scalar, which defeats subsequent checks for tainting. This isn't good, but no-one has had the time to revisit the problem and resolve it properly. I estimate that it would take about 3 days to investigate, develop and test a proper solution.

Project Schedule Three months full time. (60 working days). I can start immediately. I need to start soon, else my finances require that I get a new full time job to pay the rent. This will not cut into the time I spend volunteering on 5.8.x.

Bio I'm the current Perl 5.8 maintenance pumpking. I have released 7 stable versions of Perl, made many improvements to the Perl core, and fixed many bugs. I work well on my own, and with the other core Perl developers. I care about the Perl 5 core, but on my own don't have sufficient time to work on everything I'd like to.

Amount Requested $11,000. I estimate this gross income just keeps me financially standing still for 3 months. It's less than I was earning full time, and it's less than the equivalent $17,500 per quarter that the Parrot and Perl 6 proposal was budgeted at. I cannot accept a grant for less than this rate, but a shorter grant at the same rate is viable. Clearly this grant proposal and the Ponie proposal are exclusive, in as much as it's not possible for me to do more than 60 days work combined between the two.

Appendix I feel that I should note two areas that aren't in this proposal

    Firstly "state variables" and "_ prototype character" are listed in
    Rafael's roadmap to 5.10. It would seem logical to have included them.
    However, these require work on the parser, optree generation and probably
    pads, all of which are areas of the core I'm not familiar with. Therefore
    I can't give a good estimate of how long the work would take me to do,
    which doesn't make a good grant proposal. More importantly, I'm not the
    best person to do these - someone else who already has experience of these
    areas would get the job done more quickly than I would.
    Secondly I make no proposals for threads. Threads are important, and would
    benefit from work. However, time spent on resolving threads bugs is mostly
    taken tracking down the true cause, with the actual fixing being a minor
    component. The time taken to track down a threads bugs is particularly
    hard to guess, therefore it's not possible to commit to any statement of
    deliverables better than a vague "work on threads bugs for $n days" with
    no guarantee of results, which historically hasn't been how TPF prefers
    to award grants. Similarly optimisation proposals such as copy on write
    are still unproven, with no clear way of estimating how long they would
    take to complete, or quantifying the benefits and risks.

3 Comments

I don't think that I'm likely to actually fix any regexp engine state bugs, as I was only planning to (comprehsively) simplify the existing state saving code, without actually changing what gets saved when.
It should make the logic more accessible, and hopefully other(s) will then be able to see where to take it next, which could well lead to bug fixes.

Congrats Nick!

(to someone else... why isn't this mentioned on the main page? I found out about it from use.perl.org)

Just wondering: Are the space savings mentioned ("33%") already implemented in Perl 5.8.8, or only ni the 5.9.x line?

About TPF

The Perl Foundation - supporting the Perl community since 2000. Find out more at www.perlfoundation.org.

Recent Comments

  • Christian: Just wondering: Are the space savings mentioned ("33%") already implemented read more
  • Mark Clark: Congrats Nick! (to someone else... why isn't this mentioned on read more
  • Nicholas Clark: I don't think that I'm likely to actually fix any read more

About this Entry

This page contains a single entry by Curtis "Ovid" Poe published on February 22, 2006 4:46 PM.

New Grant Committee Members was the previous entry in this blog.

OSDC Israel and pre-event Perl 6 work is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

OpenID accepted here Learn more about OpenID
Powered by Movable Type 4.38