2010 Grant Proposal: Enhancing Perl 6 Pattern Matching

| 7 Comments

Enhancing Perl 6 Pattern Matching with Ideas from Snobol4 and Other Sources

Name:

Morris M. Siegel, Ph.D.

Email:

[hidden email]

Amount Requested:

$3000 (negotiable)

The substance of my proposed alternative pattern-matching specification has already been essentially worked out as a self-funded research project, as it were. I felt this project was of such importance that it was worthwhile giving myself a sort of sabbatical to develop it. It has taken rather longer than I originally anticipated, and my personal funds have dropped to an uncomfortably low level. I can no longer afford to keep concentrating my efforts on this project on an unfunded basis.

My grant request is intended to retroactively fund some of my past development, and as well to enable me to continue focusing my efforts on the project. (The main task remaining is to write it up carefully and precisely, but there are some aspects that still need to be thought out.) Without a grant I would have to relegate the project to my limited spare time, and by the time it would be done it could well be too late for serious consideration.

I selected the figure of $3000 since I understand it is the upper range of typical Perl Foundation grants. I think the merits of the project, plus its time requirement (past and future), would justify a larger sum if the money is available. In addition, a larger sum would enable me to spend more time on fleshing out those ideas that still need to be thought out.

Synopsis

One of the chief reasons for Perl's popularity is its regex pattern-matching facility; no other part of Perl has been made into a stand-alone package (PCRE) or borrowed so extensively by other languages. The very name "Perl" alludes to the fundamental nature of pattern matching: the "E" of the acronym "PERL" stands for "extraction," which mostly means pattern matching. Perl 6 pattern matching is substantially more powerful than that of Perl 5.

Snobol4 is arguably the first widely-available language providing a pattern-matching facility, and despite its age, and despite all the new features of Perl 6, there are still some aspects in which Snobol4 pattern matching is more powerful than that of Perl 6.

Aside from the above, the current specification for Perl 6 pattern matching, Synopsis 05 (available as http://svn.pugscode.org/pugs/docs/Perl6/Spec/S05-regex.pod, http://perlcabal.org/syn/S05.html, or http://perl6.cz/wiki/Synopses/S05), is quite complicated to learn and remember, as is evident simply by reading through all of S05. Moreover, many of its multitudinous capture mechanisms lead to code which is brittle, hard to read and maintain, and often non-mnemonic. These complications and problems do not conform to the practical usability which is supposed to be a hallmark of Perl (the "P" of the acronym), and are not a necessary price that must be unfortunately be paid for the power.

The purpose of this project is to formulate an alternative specification for Perl 6 pattern matching that is (1) enhanced by ideas inspired by Snobol4 and other sources but adapted to Perl's idiom, (2) simpler to learn and use, and leading to code which is easier to read and maintain, and (3) at least as powerful, and arguably more powerful, than the current specification.

Benefits to the Perl Community

If pattern matching is enhanced as indicated by the preceding paragraph, then potentially all Perl programmers needing to do non-trivial pattern matching will benefit.

In addition, the very acceptance of Perl 6 in the wider computing community could well be facilitated, since I think it quite probable that many would be put off by the complexity of the current pattern-matching specification. Along these lines, enhanced pattern matching would be more likely to inspire the adaptation of PCRE to Perl 6 and similar imitation by other languages, and thereby benefit not just the Perl community but the entire computing community.

I am well aware that much work has already been done in the implementation of the current specification, and that much development has been done based on the current specification, notably Larry Wall's STD grammar for Perl 6, and also the grammars for the various Parrot-based languages. As such, it is admittedly bold to suggest at this late stage that the specification be significantly revised. However, I believe the advantages afforded by the alternative specification warrant its serious consideration. (In a conversation I had once with Larry Wall, he stated that although he does not agree with all my ideas, he finds it worthwhile to listen to them.)

At YAPC|10 in Pittsburgh, on Jun 24, 2009, I gave a talk entitled "Enhancing Perl 6 Pattern-Matching with Ideas from Snobol4" (http://yapc10.org/yn2009/talk/1988), whereby I intended to present my ideas to the Perl community and get feedback. Unfortunately, I did not time my talk well, and by the time I finished presenting an overview of Snobol4 (to provide background for my ideas) and a sampling of problems with the current specification (to justify revising it), there was not enough time left to actually explain the alternative specification. Conversations with other YAPC participants did reveal interest in hearing my ideas. In particular, in discussions with Patrick Michaud, who is the chief if not sole implementer of the current specification, he (1) acknowledged that the core Perl 6 developers realize that S05 is hard to read (Larry Wall confirmed this), (2) complimented me on my examples illustrating the brittleness and other problems of the current capture mechanism, and (3) stated that if an alternative specification were even better than the current one, he would be happy to implement it.

Technically it would be possible for both the current and the alternative specifications to coexist in the same implementation, so on-going Perl 6 development efforts could proceed unimpeded while the alternative specification was being implemented and refined. If at some point it were decided to actually replace the current specification with the alternative (which is the ultimate intent), I believe the conversion of existing code should not be too laborious, so the initial release of Perl 6 would not be unduly delayed. There is some precedent for this, viz. the two different threading models of Perl 5.

Deliverables

This project should initially result in a document published on the Web presenting (1) an overview of the relevant parts of Snobol4, to help motivate the Snobol4-inspired features of the alternative specification, (2) a discussion of problems with the current specification, and (3) the alternative specification, in fairly complete detail.

After publication, a notice would be emailed to the appropriate mailing lists (perl6-language, yapc, snobol, perhaps others) informing subscribers of the existence of the document and inviting feedback. Based on feedback, the alternative specification might be revised. After a few iterations of this, assuming sufficient interest expressed by the Perl community, the alternative specification should be stable enough to proceed to implementation and further refinement as appropriate.

Project Details

The alternative specification document would assume the reader knows Perl 5 and has read S05 at least cursorily, and would rely on S05 to provide details on those features common to the alternative and current specifications. However, the alternative specification has a sufficiently different flavor from the current one that the document would have to present many ideas from scratch, so it should be reasonably accessible even to someone whose understands pattern matching conceptually but is unfamiliar with S05.

It is difficult to go into more detail on the content of the alternative specification document without summarizing it, which I feel is beyond the scope of a grant proposal. The "inch-stones" listed below are far too terse to give the reader any notion of the content. However, to provide some sort of glimpse of the content, we list the following features of Snobol4 pattern matching that are absent from Perl 6 as it stands now but would be present in alternative pattern matching:

(A) Compile time vs. build time vs. match time

The pattern structures used in Snobol4 pattern matching are not built at compile time. Rather, at run time, a pattern structure is built as a result of evaluating a pattern-valued expression; once built, this structure can be used to do pattern matching, either immediately or later on.

As a result of having two distinct run-time operations, pattern building and pattern matching, the Snobol4 programmer has the ability (1) to chose during which operation to bind the value of pattern components (e.g. LEN(N) vs. LEN(*N)), and (2) to define new pattern-matching functions in a convenient high-level manner, without having to resort to writing macros or low-level code.

Understanding this three-way distinction among compile time, build time, and match time is crucial. On one hand, a careless or novice programmer who conflates compile time and build time can inadvertently write a program that inefficiently reconstructs the same pattern numerous times (although to mitigate this an optimizing compiler can precompute constant patterns or subpatterns). On the other hand, this distinction encourages a mind-set and facilitates a programming style in which the programmer writes pattern-valued functions to effectively extend the language of pattern-matching expressions, since the execution of these functions takes place during build time and does not cost anything at match time. Writing such pattern-valued functions seems at least conceptually easier than writing macros, and the ability to do so enhances the expressiveness of the programming language.

If the equivalent distinction existed in Perl 6, then not only would the expressiveness of pattern notation be increased, but also some of the complexity of the core pattern-matching specification could be offloaded to modules that define pattern-valued functions or methods.

(B) Conditional capture

In the Snobol4 operation of conditional value assignment (binary "."), assignment ("capture," in Perl terminology) takes place only if the value is captured from a subpattern that is part of an ultimately successful match. That is, in the semantics of Snobol4 pattern matching, there is a distinct "conditional" phase following the successful conclusion of a match and prior to the substitution phase, in which conditional value assignments (which may include arbitrary side effects) are carried out. This phase, which currently has no analogue in Perl whatsoever, enables the Snobol4 programmer to write patterns that backtrack without having to undo side effects performed by alternands that initially succeed but are later backtracked out of. Although unrestricted backtracking can result in unacceptably slow performance, limited backtracking can be quite efficient, and reworking a pattern to avoid backtracking entirely can be tedious and result in code that is less natural. If Perl 6 had a similar conditional phase, then the programmer would no longer have to rid his patterns of backtracking in order to avoid performing inappropriate side effects. This would clearly facilitate the task of formulating patterns, especially complex ones.

(C) Miscellaneous primitive patterns

Snobol4 has some useful primitive patterns which cannot easily be emulated in Perl 6: TAB and RTAB, which move the cursor to a given position from the beginning or the end of a string, and (in the Snobol4+ dialect) ATAB, ARTAB, and LEN, which can move the cursor to the left as part of normal pattern matching (not as look-behind). Unlike (A) and (B) above, these do not reflect a fundamental difference between the Snobol4 and Perl 6 pattern-matching models, and thus could be included (if desired) even into the current specification.

A significant part of the challenge of adapting these features for inclusion in Perl 6 lies not merely in altering the notation to conform to the style of Perl 6, but rather in appropriately generalizing the features themselves to harmonize with the rest of Perl 6 pattern matching, and in particular to accord with Perl features absent from Snobol4 such as lexical scope.

Inch-stones and Project Schedule

Experience with my dissertation and other long papers I have written has shown that writing up even already-worked-out ideas is more time-consuming than one anticipates, so I have tried to be conservative in the timing estimates for the inch-stones listed below. The estimates are in units of work days and appear in curly braces following each milestone. The sum total comes to 46 work days, or (allowing for slippage) about 10 work weeks. Taking into account some personal obligations during this period, I believe the essential deliverable -- the initial specification document -- could be completed in three elapsed months, which could begin at once. How much time would be needed after that for revision would obviously depend heavily on the promptness, quantity, and content of feedback.

I mentioned above that although the ideas are essentially worked out, there are still some aspects needing further reflection. They are: (a) verifying conformity with the other Synopses (which is non-trivial, given how voluminous and dense the Synopses are); (b) fleshing out details of a possible Pattern role; and (c) providing additional examples of patterns written according to the alternate specification. These are the issues that, given a larger grant, could get extra attention. Even if the project is expanded to include them, the initial document would not be delayed: I think it important that the Perl community be able to begin considering the alternative specification as soon as feasible, and any expansion of the project could be done thereafter while feedback would be (hopefully) received and considered.

The inch-stones of the specification document are:--

I. Summary of salient parts of Snobol4 {2}

II. Problems with the current specification of Perl 6 pattern matching (S05); justification for considering an alternate specification {2}

III. The alternative specification

0. influence of prior work; disclaimer: possible Perl5ish spirit; perhaps could be simplified {.4}

1. terminology: "pattern", "special form", "subpattern", "subrule", "P6c", "P6a" {.3}

2. overview of model: data structure, with arbitrary embedded values, that acts like code during pattern matching {1}

2.1. pattern code vs. normal code (PE vs. NE [PNE, listNE, numNE] {.2}

2.2. p{PE}, p/PE/, /PE/, p(@args):attrs{PE}; perhaps pat{PE} or pattern{PE}; rule, token {1}

2.3. incorporation (similar to Lisp quasiquotation) {.3}

2.4. compile time vs. build time vs. match time {.2}

2.5. named patterns (declared rather than assigned); build time at UNITCHECK/INIT/BEGIN {.2}

2.6. substantiation, persubstantiation {.2}

3. matching a string, :i etc., :approx (cf. agrep, TRE) {.4}

4. matching a Boolean (True, <null>, False, <fail>), <do> {.4}

5. matching a number, :fuzzy {.3}

6. matching a CharSet (O-O character classes -- <[ ... ]>; cf. Icon charsets) {1.3}

7. matching a [sub]pattern (primitive or composite) {.5}

8. matching a closure {.3}

9. parametrized patterns; <bind> {.5}

10. quantification with separation {.2}

11. scoping [sub]patterns: [ ... ], <LABEL>:[ ... ] {.5}

11.1. unique properties: $/ (pattern-local, not normal-local), emission, conditional emission {.5}

11.2. possible "minor scope": e.g. (:i PE) vs. [:i PE] {.1}

11.3. <my>, <state>, NORMAL:: {.3}

11.4. <abort>, undef {.1}

11.5. <commit>, :: {.2}

11.6. <emit>, @() or @EMIT; uncaptured emissions {.5}

11.7. <yield> {.3}

11.8. @($/.quant) {.2}

11.9. identities and other examples {.5}

12. :P5 {.2}

13. capture (of emission of scoping subpattern) {.5}

13.1. overview: data flow model, "~>" (if not ">>"), target lists, repetition, using coroutine logic to capture to next target not yet processed, :take(n) { 2 }

13.2. passive targets: scalars, arrays, slices, * {.5}

13.3. active targets: functions, code references, (perhaps) $*TAKE {.8}

13.3. active targets: plain blocks, pointy blocks, <do> {.8}

13.4. active targets: p{PE}, [PE], <named_pattern> {.8}

13.5. secondary targets, splitting and joining of data streams {.5}

13.6. chaining of subpatterns {.5}

13.7. examples {1}

14. conditional phase {.5}

14.1. "~>?": conditional capture {.5}

14.2. <do?>: conditional side-effect {.5}

14.3. <confirm> {.5}

14.4. behavior w.r.t. backtracking {.5}

14.5. examples {1}

15. <tell>, <seek>, <at>, :forward, :bidi {.5}

16. <reverse> {.2}

17. :decl (or :par or :parallel), :proc (or :seq or :sequential), (:proc) to establish sequence point {.8}

18. :canon, :quick {.5}

19. generic meaning of <name> and of <op args> {.3}

20. <before PE>, <after PE>, <!before PE>, <!after PE> {.5}

21. <eval>, :memo {.3}

22. {any @arr}, {cat @arr}, {all @arr}, :eval, :lazyeval, :memo {.6}

23. <to>, <from>, <(PE)> {.2}

24. <cut>; <subst> {.7}

25. Rationalized m, M, s {.3}

25.1. m {.4}

25.2. M {.1}

25.3. s {.3}

25.4. possible generalization of m {.3}

25.5. possible generalization of s {.3}

25.6. relation to "~~" {.1}

25.7. dwimmy laxity in placement of attributes for m, s, and p {.3}

26. OO interface {.2}

26.1 m {.4}

26.2 s {.3}

26.3 resumed matching after <yield> (coroutine-style) {.4}

27. :keepall {1}

28. :g -- top-level result is list/array of Match objects {.3}

29. <try>, <catch> {.6}

30. perhaps: <lazy PE> -- like {{ p{PE} }}, but when p{PE} is first evaluated it replaces (memoizingly) the closure { p{PE} } in the pattern structure. {.3}

31. <literal>, :eval {.4}

32. matching a Range {.3}

33. matching an arbitrary object: Pattern role (patternization method) {1}

34. summary of members of $/ {.5}

35. other notational differences {.1}

35.1. :sigspace should retain the colon (m:s, s:s, p:s). (If not, at least let m:s abbreviate to ms, not mm .) {.1}

35.2. {overlay(p,q)} instead of [p & q] or [p && q] {.2}

35.3. {juxta(a,b,c)} instead of [a ~ c b] {.3}

36. :panic {.1}

37. comparison of P6a with P6c; features of P6c not directly present in P6a -- i.e. handled differently or (like ~~ and <prior>) subsumed by other features {2}

38. more examples {3}

39. co-existence with current specification {.8}

40. concluding remarks {1}

Bio

I have a Ph.D. in Computer Science from Cornell University; my dissertation is entitled Proving Properties of Snobol4 Patterns. I have long been interested in regular expressions, context-free and other formal languages, and pattern matching.

As mentioned above, the ideas constituting my proposal are basically already worked out, and there was interest expressed by some participants at YAPC|10 in seeing them. As far as I know, no one else has proposed or intends to propose an alternate pattern-matching specification for Perl, so it would follow that I am the best person to do this.

7 Comments

Way back when MJD and I proposed SNOBOL-style pattern matching for Perl6. (I'll see if I can dig up the relevant document, though I'm not sure I still have it.) As I recall it didn't get a lot of traction, perhaps because we didn't present it in a compelling enough way.

I'm excited to hear that this is back on the table. SNOBOL's pattern matching was lovely to work with, and having something similar in Perl would be great.

Morris clearly feels strongly about his design, working on it for so long on purely personal initiative. The Perl 6 community interacts mainly in #perl6. It would be nice of Morris to appear there to discuss his ideas. Presenting a semi complete concept at YAPC|10 with more Snobol4 details than Perl 6 suggestions would not have been very persuasive.

Most programmers react very positively to the existing Perl 6 grammar and regex design (Synopsis 5). P6ers often rate it as a "killer feature". Support for abandoning it would be nil. So Morris proposes coexistence to enable gradual migration. S05 already allows for :Perl5 or :P5 regex modifiers, so an :Alt modifier should be no problem. But do enough people want it? The average reaction, as reported within the proposal, can be summarised as lukewarm.

The proposal withholds the details, instead touting Snobol4 features.

Feature (A) contrasts with S05 by adding a build time phase. S05 justified omitting one for execution efficiency. There are occasional questions in #perl6 about building patterns at runtime, so it merits
consideration.

Feature (B) contrasts with S05 by executing actions only after match time, eliminating the problem of undesired side effects. If Perl 6 needs that behaviour, it can just specify it in S05 now.

Feature (C) suggests a minor tweak of assertions, which S05 might also be stretched to incorporate, if desired.

I think Perl 6 has an adequate specification already, and critically lacks implementation. The current form of this proposal does not change this. Retroactive funding for unsolicited work is strange, it should be spent on removing project blockers. Since the proposal is negotiable, I recommend that some working proof of concept, preferably implemented in Perl 6 or Perl 5, be added as a deliverable.

Disclaimer: I am not a regex engine developer, and don't know SNOBOL ;)

I just want to highlight this bit of the grant - "My grant request is intended to retroactively fund some of my past development"

I don't think that's how the grant program is supposed to work.

Of course, if the proposed future work is sufficient, this is a non-issue, but it just seems weird.

To me, this is the most exciting grant proposal in the current batch. I've always appreciated Perl's regex abilities. The innovations in Perl6 impressed me greatly. I like the idea of Captures and the ability to unpack regex into a grammar. However, the whole design of regex stills seems somewhat ad-hoc. Though innovative, Perl6 regex do not have the appearance of a complete makeover and resulting tidyness reflected in other aspects of the Perl6 language specification. Regex is a central and extraordinarily complex feature. The community, and Perl6, could benefit from the experienced input of somebody like Siegel.

The language [Lua](www.lua.org) has a great pattern-matching library [LPeg](http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html) which follows too the Snobol tradition.

A module lpeg.re (written with lpeg) supports a more conventional regex syntax (http://www.inf.puc-rio.br/~roberto/lpeg/re.html).

LPeg has formal foundation (PEG : Parsing Expression Grammars) and some papers are available :
- A Text Pattern-Matching Tool based on Parsing Expression Grammars (http://www.inf.puc-rio.br/%7Eroberto/docs/peg.pdf)
- A Parsing Maching for PEGs (http://www.inf.puc-rio.br/%7Eroberto/docs/ry08-4.pdf)
- Parsing Expression Grammars: A Recognition-Based Syntactic Foundation (http://www.brynosaurus.com/pub/lang/peg.pdf)
- Packrat Parsing: Simple, Powerful, Lazy, Linear Time (http://www.brynosaurus.com/pub/lang/packrat-icfp02.pdf)

If there's no interest from the Perl 6 core developers and language designers, this seems like a non-starter.

I think Morris should start by working with them to come up with a proposal that they'd endorse.

I think this grant proposal is quite interesting, but one thing bothers me: It doesn't even mention multi dispatch as an/the alternative way to match non-strings, especially nested data structures.

When last I talked with Larry about tree matching, he said that multi dispatch and subsignatures were the currently intended way to do such a thing. I think it would be very beneficial to investigate a unification of patterns and multi dispatch. The proposed approach moves in the opposite direction, if I understood that correctly.

Leave a comment

About this Entry

This page contains a single entry by Alberto Simões published on February 6, 2010 5:00 PM.

2010 Grant Proposal: Perl Compiler was the previous entry in this blog.

2010Q1 Grant Proposal: perl core memory improvements is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.