2008Q2 Grant Proposal - Revision Control for all of CPAN

| 16 Comments
  • Name: Eric Wilhelm
  • Title: svn.cpan.org - revision control for all of CPAN
  • Synopsis: This project will create a universally addressable Subversion space for the Perl community. It will be populated with historical and ongoing CPAN releases. Every CPAN author will be given a public version control repository, and may use a portion of it for any (reasonable) purpose. Alternate views, such as mappings for each distribution, will be handled via the http namespace.

Name:
Eric Wilhelm

Title:
svn.cpan.org - revision control for all of CPAN

Synopsis:

This project will create a universally addressable Subversion space for the Perl community. It will be populated with historical and ongoing CPAN releases.

Every CPAN author will be given a public version control repository, and may use a portion of it for any (reasonable) purpose.

Alternate views, such as mappings for each distribution, will be handled via the http namespace.

Benefits to the Perl Community:

Preservation of History

A uniform set of release tags for each distribution will benefit Perl users, CPAN authors, and downstream distributors (e.g. Debian) by providing a structured system for querying changes and performing bisection bug searches. It will also offer convenience to the module author or maintainer by providing rigorous and dependable tags in cases where the original release tag history may have been nonexistent or lost.

Future History

This system will give CPAN authors an easy path to start using version control, but no action is required on their part for the system to be useful. They may choose to import their existing tree, copy a tag as a starting point, or simply let the release history continue to automatically accumulate.

Discoverability

The filesystem-like nature of Subversion and the ability to use HTTP namespaces as glue provides a uniformly discoverable tree. A proxy configuration could allow some paths to point at external SVN servers. Even external, non-svn repositories can be found by humans via simply placing instructions in a text file. Formalizations may evolve with use, such as a standard 'see_other.yml' file or other conventions supported by client-side tools.

Addressability

Rather than downloading a tarball to examine its contents, users will be able to fetch a single file (e.g. META.yml) or obtain a directory listing directly from the svn URL.

Welcoming Experimentation and Collaboration

Giving every CPAN author a subversion space provides convenience to both new and existing authors. It encourages them to publish code early and often, getting feedback from the community as they go. By providing a common middle ground between local changes and CPAN releases, new ideas are more conveniently tried and shared.

Enabling Innovation

The normalized HTTP namespace and historical data can act as a foundation for many new tools and techniques. It is hoped that the system also creates new ways to search, new alpha-release channels for cpan clients and testers, a structured base for documentation patches, and enables various other interesting possibilities for the Perl community's infrastructure.

Deliverables:

  1. Per-author repositories containing a historical record of package releases in "CPAN/$dist/tags/" directories.
  2. Tools for automated, ongoing imports of new releases.
  3. Apache/etc configuration files.
  4. Experiments/notes for future HTTP namespace tricks.
  5. Documentation.
  6. "Burn-in" monitoring, hosting, and administration.

Project Details:

One Repository Per Author

Previous attempts tried to fit the entire CPAN into one repository. It "fits", but the import must be monolithic and the ongoing administration and maintenance seems unwieldy. The repository would be roughly 4GB at over 100k revisions. This layout would be easily abused (accidentally or not) by requesting the full log or checkout. Further, any corruption (via corrupt bytes or the need to scrub a malicious checkin) would require a long dump/load and possibly extended downtime for all authors.

By breaking repositories at the natural "author boundary", most of the linearity of history and "diff compression" benefit is preserved. This scheme also allows for partitioning across multiple servers should the request load ever warrant it. The drawbacks of monolithic import and unwieldy size are traded for some complexity in cross-author issues.

The use of a single HTTP namespace will gloss over these issues for most users. Others may need to understand the boundaries (e.g. "http://svn.cpan.org/E/EW/EWILHELM/..." where the repository is at "EWILHELM" and the rest is in the httpd config.) But I believe that such issues will be minimal and, if necessary, can be handled by client-side tools (which are outside the scope of this proposal.)

Ordered, Tagged Imports from Backpan

The import process mainly involves organizing the tarballs and unpacking them in sequence with some svn actions. Early experiments show this to be I/O bound and suggest that it might yield well to some light clustering. The latest rethink based on per-author repositories helps here.

The initial import scheme has had a few dry-runs, but needs to be taught how to deal with more tarball oddities. It should also be more modularized and needs automated tests.

The import will use author id to identify the source of each tarball. Handling of multi-author distributions is still an open question, but such packages will either be merged based on the "02package.details" index or left to be addressed later.

It appears that the fsfs svn data can be hacked to enter the tarball timestamp as the commit date. If this does not work or there is not a better way, the tarball date will simply go in the commit message.

Iterations and Test Cases

The details of the import may require a few passes through the entire data set to get all of the kinks out of the procedure and get everything to look right. The initial version bailed after 17 hours, and a full load-in was estimated at 36 hours by extrapolating from that. The first task with the import tool is therefore to break the process into logged and restartable pieces so that trouble spots can be incrementally addressed while the run-through continues.

The troublesome packages and edge cases will lead to changes in the code, either being punted as unsupported conditions or fixes to the algorithm. In either situation, the edge cases will be collected into a list (or hash) for feeding the test suite.

Once all of the trouble spots in the backpan of packages have been addressed, the import may need to be run through one full pass to ensure that any changes to the layout have been consistently applied throughout the repositories.

Ongoing Imports

This will share some policy and procedure code with the initial import tools, but will have a different control flow (triggered rather than batch) and needs logging/etc suitable for unattended runs.

HTTP Namespace Experiments/Notes

Configuring per-dist views based on the module index should be relatively straightforward. Adding to such views in a running server might not be particularly elegant, but a 'graceful' restart should suffice. I plan to implement at least this one view and explore other possibilities. The main goal with this portion is to prove the namespace concept and whet the imagination of the community.

Documentation

POD/HTML documentation about the system and common use-cases will be provided. This will include an explanation of the access policy and recommended directory layout. Administrative/interface tools and APIs will be documented in POD.

Hosting and Maintenance

Initial hosting will be provided by me, though in the event of excessive traffic, I may have to throttle the bandwidth or restrict access to humans who request it. If such issues arise, the project will be a definite success and should easily find a permanent home with adequate bandwidth.

Publishing Results

The code, notes, and documentation will be available throughout the project in the following repository and I will blog weekly updates to use.perl.org.

http://scratchcomputing.com/svn4cpan

Project Schedule:

Estimated time: 8-10 weeks

Schedule: Starting around June 1, 2008

  • week 1: rework import tool into distributable/restartable scheme svn name&date features run import, log trouble spots
  • week 2: workaround or skip troublesome tarballs
  • week 3,4: full, clean import , setup pause watch/fetch/import, solicit feedback
  • week 5: documentation, automated tests
  • week 6-8: monitor usage and triage automatic setup experiment with mod_proxy configurations and 'views'

The schedule is more heavily loaded to the front because the main import will take a large chunk of concentrated work mixed with long run times and because CPAN is a moving target. I plan to work on the project 3-4 full days per week until the automatic imports are running. At that point, the gap between the history and "today" cannot grow and interruptions are not penalized by extended run times.

The final weeks are at a slower pace to allow for communication latency. The documentation will be driven by feedback and questions from the community. The triage and monitoring will be about one hour per day.

By funding this proposal, TPF will be allowing me to focus my development in large enough chunks to get ahead of the continually growing import task. In smaller increments of work, the run time overwhelms the free time per day and development gets postponed.

License:

All modules and utilities will be published under the Perl "GPL 2 / Artistic 1" license.

Acknowledgements:

Michael Schwern helped formulate the initial idea and has provided feedback and realism. Jesse Vincent had a basic version of an import tool and several helpful suggestions. Various members of the Portland Perl Mongers and the Perl community at large have provided useful encouragement and additional feedback, and chromatic says I can quote him as saying "I'm looking forward to it."

Scott Kveton, former director of OSU's Open Source Lab did pledge hosting and bandwidth almost two years ago. It is possible that the OSL is still willing to host this.

Bio:
I am the author of several CPAN modules, and a contributor to projects which include Module::Build, Test::Harness, Jifty, Moose, PAR, and Inline. I am also an active participant in several areas of the Perl community including wxPerl, parrot, perl-qa, and module-authors.

As president of the Portland Perl Mongers, I have organized monthly meetings (often with a speaker) for the past two years. I frequently give perl-related talks at pdx.pm, the local Linux group, and OSCON.

For Google's SoC 2008, I organized the community volunteers, submitted the org application and will be administering the TPF's 6 projects throughout the summer.

I was the architect and main developer for the dotReader open source book reader, founder of the VectorSection open source graphics format converter, and developer of the Perl-based imaging system responsible for A. Zahner Co's fabrication and installation of the patterned copper cladding on the deYoung museum in San Francisco.

Amount/Resources Requested:
$3000.00

DNS CNAME entry for svn.cpan.org

16 Comments

I like it! Two thumbs up!

Some questions that might be important for this proposal though:

How are CPAN authors to gain access to their repository? How are they to setup fine-grained access control on their own (to enable close collaboration for instance)? Can they? While not important for setting up the repositories it seems important to hash these things out so that they can be easily used.

One thing that may be considered for a future proposal or as an extension to this one would be given the svn repositories existence could there be a more direct interface to PAUSE. (i.e. use the repo to publish directly to PAUSE somehow)

I've played around with this kind of thing before and realise it's not trivial. It does sound to me like this is two seperate projects: a subversion repository that people and play with and a subversion repository containing historical CPAN information.

Once this is done, it sounds like a high IO hosting problem. This sounds like a seperate part of the project. Have you considered asking the perl.org admins whether they could host it?

Please restrain from flamewars on this point, but the main perl repository is about to move to Git. Would it not make sense for this to use the same version control system?

Honestly I don't really care whether it's svn or git (although if it's svn it would be compatible with git and svk, but the reverse isn't tru). There are already a lot of existing tools to work with SVN so that's definitely a plus.

But regardless I think this would be a huge win. Having to setup a separate sourceforge or google code project for a little CPAN module seems like overkill to me. And more openness is always good for OS code.

I think this is a terrible idea.

It is an attempt to centralize and normalize the CPAN module process, which should specifically NOT normalized. Svn vs. Git is only one part of this aspect.

Perl needs to diversify and de-standardize, rather than funnel towards being a monoculture.

A lot of people won't use this, simply because they're already happy with their own version control setup (myself included). The end result will be unpredictable - some people in, others out. Those who choose to retain their own setup will be forced to repetitively explain to others that no, they don't use the "central" repository. How tedious.

The idea of free version control for CPAN authors, however, is a good one. A "PerlForge", with good integration with the other standard community tools, would be very helpful for those who don't know how to or can't set up their own repository. (Although people didn't seem to like the idea so much four years ago.) Then again, it would also be a lot of work spent on duplication of an existing, successful system, SourceForge.

Those who have their own setup (myself included) are unlikely to switch to using this. That raises the tedious prospect of having to repeatedly explain to people that no, your repository isn't the central one.

The idea of free version control hosting for CPAN authors (PerlForge?) is a nice one, but has had a lukewarm reception before because it only serves as a lot of work spent on duplicating SourceForge, a successful existing system. The saving benefit might be from integration with RT for CPAN.

Plus, yes, there is the what-version-control-system discussion. Pick svn and the git users will see it as a retrograde step. Pick git and the svn users may be unconvinced about having to put in the time to convert their existing repositories.

The whole thing strikes me as a boondoggle, to be honest.

This idea is flawed, because it attempts to mix release management with development management. Those are two very different things, if you study it in more detail.

CPAN is a distribution network, where complete sets of software are finding their way to end-users. It's about consistency, authority, preservation, maintainability, transport.

SVN is for development communication between authors, its about change, which will only confuse end-users. SVN does not have release management features.

Compare it to a book-shop: the reader of the book really doesn't want to be bothered by the fights between the author and his publisher about the book content (SVN/revisions), but wants to see the book on a bookshelf, have a nice print, be affordable, and with all pages in the right order (CPAN/releases)

'diff uploads/downloads' are useless (that's not the way Perl's installation tools work). 'file extracts' are already available via search.cpan.org/browse. Having old releases in SVN without annotation of the seperate changes is no added value.

IMHO, this project is not an improvement of CPAN. However, it can be useful to be able to start an SVN easily, to work on any perl module. Hm... we have that provided by sourceforge and google. (I agree there is a need for improvement, but you are thinking too small!)

As Andy says, diversity is good. I would like to continue to have people using their own repositories. I would prefer a proposal for a sourceforce/trac-like system for those who can't have their own, but not complicating the CPAN system.

I vote yes, my only point would be to think about changing the one repo per author to one repo per distribution. This could allow a distro to be taken over by another user with out having to copy and fork the history.

Also I see this as a great way to allow a user to track the current progress of any module as well as submit more current patches as needed.

[SVN vs GIT]
I think that eric is right for picking svn , it's generic enough that everything can build from it. From the stand point of this being CPAN you do not gain anything from this being GIT (or anything else). It seems that Eric's goal is to provide a central repo to consolidate BackPAN and provide tools for future development of any module. =IF= you want to pull a test branch then you could, you can also pull a local copy and play in what ever you want. SVN, for all of it's issues, is a flexible enough system to allow for just about anything.

[Andy]
I would argue that CPAN is already a centralized and normalized system. What Eric is proposing is keeping everything that we already have but also adding the ability to use SVN if you choose.

[Earle]
I agree that there might not be a mass adoption, though I do not see that as a hindrance for this project. I see this more as a method to move CPAN to a versioned fs that just becomes a more flexable system if we want to use it. Because all the tools are all at one location, you do not need to create a seperate account, set up a second enviroment, ect. Most importantly, though, is that everyone who has ever submitted anything to CPAN currently would have everything set up for them already. If you want to use it cool, it's there, if not, no worries.

[Mark]
I agree that it could be confusing to a user that starts to look behind the screen, but I do not think that is erics intent. What I am taking away from this is that the current 'trunk' of any dist would be the same as the current tar method. Then any previous 'tag' would become all the previous tars. So I dont see the UI to CPAN needing to change much if at all.

The idea of having all of Backpan, unpacked, imported into a revision system isn't a bad one, but it will perforce fail on many levels when it can't track things like file renaming or movement, the semantics of which are lost to the tarballs stored online.

The idea of offering a per-user repository seems like a nice enough thing to do at first, but it's already offered by plenty of other hosts, who offer options other than one VCS. They also allow per-project permissions, which would be required for collaboration. That would require adding more ACLs to the system to track "projects." Maybe those are distributions, but there are currently no user-dist permission mappings.

There is also the "puppies make bad presents" aspect to this grant request. It covers building something that then must be continually operated and allowed to grow as needed, without specifying any sure backing.

I don't see much benefit, but I see plenty of costs.

The great thing about this system is that it has different things to offer depending on what you want from it.

The unfortunate thing is that it has different things to offer depending on what you want from it. So, some comments say "we don't need this history", some say "we don't need this version control", etc. No single aspect requires complete buy-in from anyone for it to be useful in many ways to many people.

This is not, however, a trac/sourceforge/etc. Those things (and more) could be built on or linked to it, but they are all one layer up. This is only a versioned filesystem hosted on HTTP with some data arranged in a useful way.

The "I have my own repository" issues are addressed (though in not such great detail as earlier revisions.) Yes, even if your repository is git.

Andy: this does not impose anything on the "process". If anything, its potential as an aggregation technology will enable more discoverable decentralization. The issues of linking to external repositories played a bigger role in a previous (too expensive) proposal, so I have left them out in the hope of at least getting *something* started. In any case, I think you will still find it valuable even if the author-writable edge of the sword does nothing beyond providing a sort of registrar for external repositories.

The git vs svn question is a tough one. I have given it much thought. I decided that svn would be a better fit. Nothing in this work precludes creation of a parallel repository built on git. Further, the majority of this work involves unpacking the backpan -- which would also need to be done by anybody who wants to make it out of git. Yet, here I am proposing to get it to a point where it could easily be cloned directly into git.

Nit note: the "-CPAN-/$dist/" bit appears to be suffering from wiki markup (strikethrough) interpolation.

--Eric

I am concerned about long-term maintenance. It seems to me that the work of doing the import once is likely to be less in the long-run than the work of maintaining the up and running repository, answering questions about how to use it, finding hosting, renegotiating hosting periodically, etc.

If someone can line up a promise to take on that work long-term, I'd vote for this. Otherwise I'm planning to vote against it.

I think having an optional central place to version control all the code on CPAN is a good idea.
I don't think there is a need for importing code from backpan though and I don't think there should be a mandated way on tracking releases. Version control and release management are different as Mark has already pointed out.

I have my own SVN repository and what I think is missing is an easy way to allow other CPAN authors to contribute to my projects. AdamK has solved that in his SVN repository, others are using Google Code or Sourceforge.

I think it would be great if I could use my PAUSEid to setup a repository on svn.perl.org and then to easily allow other CPAN authors to commit to my projects.

Gabor

I really love this idea and agree with some of the former commentors that an optional svn repository would probably be optimal.

My biggest problem with other project sites like SF or Google Projects is that they are aimed at larger projects, while CPAN is very modularized. Google for example reserves (IIRC) 100MB svn space for you, and limits you to a handful of projects per account. But for most libraries, I don't need that much space, but a better way to manage a large set of distributions with very different amounts of changes over time.

I also think this is a great idea. If people are
afraid of centralization, just make it optional.
I think many people would be glad to have a standard
rcs for CPAN. Why should people be dependent on other
project hosting services if CPAN could offer it?

There are already more ways to build a module, and
also two ways to install (CPAN, CPANPLUS). While
TIMTOWTDI is good, can't there be a default way for
hosting your CPAN code - the recommended way for those
who would benefit? A standard way doesn't need to
forbid other ways.

CPAN is already a killer app. If people who want to
write a module and don't already know where to host it,
see, that CPAN offers it, it would be even better.

Git would be a much better target for this because;

a) there are already partial conversions underway

b) it's possible to easily import detailed history from other revision control systems to git; the same is simply not possible with svn

c) it is a more flexible way forward.

svn has seen its day, let's move forward please.

Leave a comment

About this Entry

This page contains a single entry by Alberto Simões published on May 1, 2008 9:00 PM.

2008Q2 Grant Proposal - Module Installation Configuration Wizard was the previous entry in this blog.

2008Q2 Grant Proposal - Fixing Bugs in the Archive::Zip Perl Module is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.