- Name: Eric Wilhelm
- Title: svn.cpan.org - revision control for all of CPAN
- Synopsis: This project will create a universally addressable Subversion space for the Perl community. It will be populated with historical and ongoing CPAN releases. Every CPAN author will be given a public version control repository, and may use a portion of it for any (reasonable) purpose. Alternate views, such as mappings for each distribution, will be handled via the http namespace.
svn.cpan.org - revision control for all of CPAN
This project will create a universally addressable Subversion space for the Perl community. It will be populated with historical and ongoing CPAN releases.
Every CPAN author will be given a public version control repository, and may use a portion of it for any (reasonable) purpose.
Alternate views, such as mappings for each distribution, will be handled via the http namespace.
Benefits to the Perl Community:
Preservation of History
A uniform set of release tags for each distribution will benefit Perl users, CPAN authors, and downstream distributors (e.g. Debian) by providing a structured system for querying changes and performing bisection bug searches. It will also offer convenience to the module author or maintainer by providing rigorous and dependable tags in cases where the original release tag history may have been nonexistent or lost.
This system will give CPAN authors an easy path to start using version control, but no action is required on their part for the system to be useful. They may choose to import their existing tree, copy a tag as a starting point, or simply let the release history continue to automatically accumulate.
The filesystem-like nature of Subversion and the ability to use HTTP namespaces as glue provides a uniformly discoverable tree. A proxy configuration could allow some paths to point at external SVN servers. Even external, non-svn repositories can be found by humans via simply placing instructions in a text file. Formalizations may evolve with use, such as a standard 'see_other.yml' file or other conventions supported by client-side tools.
Rather than downloading a tarball to examine its contents, users will be able to fetch a single file (e.g. META.yml) or obtain a directory listing directly from the svn URL.
Welcoming Experimentation and Collaboration
Giving every CPAN author a subversion space provides convenience to both new and existing authors. It encourages them to publish code early and often, getting feedback from the community as they go. By providing a common middle ground between local changes and CPAN releases, new ideas are more conveniently tried and shared.
The normalized HTTP namespace and historical data can act as a foundation for many new tools and techniques. It is hoped that the system also creates new ways to search, new alpha-release channels for cpan clients and testers, a structured base for documentation patches, and enables various other interesting possibilities for the Perl community's infrastructure.
- Per-author repositories containing a historical record of package releases in "
- Tools for automated, ongoing imports of new releases.
- Apache/etc configuration files.
- Experiments/notes for future HTTP namespace tricks.
- "Burn-in" monitoring, hosting, and administration.
One Repository Per Author
Previous attempts tried to fit the entire CPAN into one repository. It "fits", but the import must be monolithic and the ongoing administration and maintenance seems unwieldy. The repository would be roughly 4GB at over 100k revisions. This layout would be easily abused (accidentally or not) by requesting the full log or checkout. Further, any corruption (via corrupt bytes or the need to scrub a malicious checkin) would require a long dump/load and possibly extended downtime for all authors.
By breaking repositories at the natural "author boundary", most of the linearity of history and "diff compression" benefit is preserved. This scheme also allows for partitioning across multiple servers should the request load ever warrant it. The drawbacks of monolithic import and unwieldy size are traded for some complexity in cross-author issues.
The use of a single HTTP namespace will gloss over these issues for most users. Others may need to understand the boundaries (e.g. "http://svn.cpan.org/E/EW/EWILHELM/..." where the repository is at "EWILHELM" and the rest is in the httpd config.) But I believe that such issues will be minimal and, if necessary, can be handled by client-side tools (which are outside the scope of this proposal.)
Ordered, Tagged Imports from Backpan
The import process mainly involves organizing the tarballs and unpacking them in sequence with some svn actions. Early experiments show this to be I/O bound and suggest that it might yield well to some light clustering. The latest rethink based on per-author repositories helps here.
The initial import scheme has had a few dry-runs, but needs to be taught how to deal with more tarball oddities. It should also be more modularized and needs automated tests.
The import will use author id to identify the source of each tarball. Handling of multi-author distributions is still an open question, but such packages will either be merged based on the "02package.details" index or left to be addressed later.
It appears that the fsfs svn data can be hacked to enter the tarball timestamp as the commit date. If this does not work or there is not a better way, the tarball date will simply go in the commit message.
Iterations and Test Cases
The details of the import may require a few passes through the entire data set to get all of the kinks out of the procedure and get everything to look right. The initial version bailed after 17 hours, and a full load-in was estimated at 36 hours by extrapolating from that. The first task with the import tool is therefore to break the process into logged and restartable pieces so that trouble spots can be incrementally addressed while the run-through continues.
The troublesome packages and edge cases will lead to changes in the code, either being punted as unsupported conditions or fixes to the algorithm. In either situation, the edge cases will be collected into a list (or hash) for feeding the test suite.
Once all of the trouble spots in the backpan of packages have been addressed, the import may need to be run through one full pass to ensure that any changes to the layout have been consistently applied throughout the repositories.
This will share some policy and procedure code with the initial import tools, but will have a different control flow (triggered rather than batch) and needs logging/etc suitable for unattended runs.
HTTP Namespace Experiments/Notes
Configuring per-dist views based on the module index should be relatively straightforward. Adding to such views in a running server might not be particularly elegant, but a 'graceful' restart should suffice. I plan to implement at least this one view and explore other possibilities. The main goal with this portion is to prove the namespace concept and whet the imagination of the community.
POD/HTML documentation about the system and common use-cases will be provided. This will include an explanation of the access policy and recommended directory layout. Administrative/interface tools and APIs will be documented in POD.
Hosting and Maintenance
Initial hosting will be provided by me, though in the event of excessive traffic, I may have to throttle the bandwidth or restrict access to humans who request it. If such issues arise, the project will be a definite success and should easily find a permanent home with adequate bandwidth.
The code, notes, and documentation will be available throughout the project in the following repository and I will blog weekly updates to use.perl.org.
Estimated time: 8-10 weeks
Schedule: Starting around June 1, 2008
- week 1: rework import tool into distributable/restartable scheme svn name&date features run import, log trouble spots
- week 2: workaround or skip troublesome tarballs
- week 3,4: full, clean import , setup pause watch/fetch/import, solicit feedback
- week 5: documentation, automated tests
- week 6-8: monitor usage and triage automatic setup experiment with mod_proxy configurations and 'views'
The schedule is more heavily loaded to the front because the main import will take a large chunk of concentrated work mixed with long run times and because CPAN is a moving target. I plan to work on the project 3-4 full days per week until the automatic imports are running. At that point, the gap between the history and "today" cannot grow and interruptions are not penalized by extended run times.
The final weeks are at a slower pace to allow for communication latency. The documentation will be driven by feedback and questions from the community. The triage and monitoring will be about one hour per day.
By funding this proposal, TPF will be allowing me to focus my development in large enough chunks to get ahead of the continually growing import task. In smaller increments of work, the run time overwhelms the free time per day and development gets postponed.
All modules and utilities will be published under the Perl "GPL 2 / Artistic 1" license.
Michael Schwern helped formulate the initial idea and has provided feedback and realism. Jesse Vincent had a basic version of an import tool and several helpful suggestions. Various members of the Portland Perl Mongers and the Perl community at large have provided useful encouragement and additional feedback, and chromatic says I can quote him as saying "I'm looking forward to it."
Scott Kveton, former director of OSU's Open Source Lab did pledge hosting and bandwidth almost two years ago. It is possible that the OSL is still willing to host this.
I am the author of several CPAN modules, and a contributor to projects which include Module::Build, Test::Harness, Jifty, Moose, PAR, and Inline. I am also an active participant in several areas of the Perl community including wxPerl, parrot, perl-qa, and module-authors.
As president of the Portland Perl Mongers, I have organized monthly meetings (often with a speaker) for the past two years. I frequently give perl-related talks at pdx.pm, the local Linux group, and OSCON.
For Google's SoC 2008, I organized the community volunteers, submitted the org application and will be administering the TPF's 6 projects throughout the summer.
I was the architect and main developer for the dotReader open source book reader, founder of the VectorSection open source graphics format converter, and developer of the Perl-based imaging system responsible for A. Zahner Co's fabrication and installation of the patterned copper cladding on the deYoung museum in San Francisco.
DNS CNAME entry for svn.cpan.org