internal fragmentation

a personal journal of hacking, science, and technology

Mercurial: extending revision specifiers and a dilemma

Wed, 26 May 2010 15:56 by mpm in mercurial (tagged ) (link)

Right now we’re working on a way to greatly extend the way sets of revisions are specified in Mercurial. Currently most commands can take one or more ‘-r’ arguments that take single revisions and ranges of the form ‘start:end’, and revisions can be specified as a revision number, a relative revision number (eg -2), a changeset hash, a tag, a bookmark, a branch name, or built-ins like ‘.’, ‘null’, and ‘tip’.

This is good enough for many purposes, but we often get requests for more flexibility. These requests usually show up as new options for the log command, which has now grown a small army of them. The more popular of these will also trickle over into other commands. This is less than ideal because it means we have lots of options and potentially lots of duplicated and hairy code.

The new plan is to have a simple query language that lets you easily specify more complex queries anywhere a revision option is taken. The query language looks like this:

  • all changes on a branch:  branch(foo)
  • all descendants of 1.0: descendants(1.0)
  • all ancestors of 1.1 and descendants of 1.0:
    ancestors(1.1 ) and descendants(1.0) or 1.0..1.1
  • all ancestors of the working directory: ancestors(.) or follow()
  • all changesets from May containing the keyword ‘bug’: date(“may”) and keyword(“bug”)
  • all changeset from Bob or Alice: user(“bob”) or user(“alice”)
  • a complicated query:
    (user(“bob”) or user(“alice”)) and not merge() and 1.0..stable and keyword(“foo.py”) and not date(“may 10-may 20”)

This just begins to scratch the surface of what we can do with this scheme. We’ve actually made quite a lot of progress on this in the past week. In fact we now have two nearly complete rival implementations, each with their own tokenizer, parser, evaluator, predicates, and query optimization strategy. The most interesting piece here is the approach to parsing.

My version uses a simple ‘Pratt Parser‘ that I wrote for this purpose. It makes parsing algebraic and query-like expressions very simple. On the upside, this parser is very small: the parser module itself is only 75 lines and can be driven entirely by data rather than by verbose code. On the downside, it’s rather mysterious to the uninitiated. The ‘grammar’ that I pass to the parsing engine looks like this in its entirety:

elements = {
    "(": (20, ("group", 1, ")"), ("func", 1, ")")),
    "..": (17, None, ("dagrange", 17)),
    "not": (15, ("not", 15)),
    ":": (10, None, ("range", 10)),
    "and": (5, None, ("and", 5)),
    "or": (4, None, ("or", 4)),
    ",": (2, None, ("list", 2)),
    ")": (0, None, None),
    "end": (0, None, None),
    "symbol": (0, ("symbol",), None),
    "string": (0, ("string",), None),
}

(The magic numbers here are operator precedence and associativity, with clauses specifying how the symbol works in infix or prefix situations.)

Henrik Stuart has taken a more conventional approach. He’s taken the full-featured Parsing.py LR(1) parser and built a BNF-style grammar based on it. On the upside, everyone who knows anything about parsing knows how BNF works. On the downside, the parser itself weighs in at ~2000 lines of third-party not-recently-updated code, with the Mercurial-specific query grammar adding a few hundred more.

It’s clear that both versions are well up to the task at hand (and the other parsing-related task I’m looking at, hgweb’s templating engine), which means we have to weigh the strangeness and simplicity of the Pratt Parser against the size and conventionality of the Parsing.py approach.

6 Comments

  1. […] This post was mentioned on Twitter by bboissin, Matt Mackall. Matt Mackall said: My blog is back with a post about new #Mercurial revision queries: http://www.selenic.com/blog/?p=613 […]

    Pingback by Tweets that mention Mercurial: extending revision specifiers and a dilemma « internal fragmentation -- Topsy.com — Wed, 26 May 2010 @ 17:17

  2. Using parsing.py seems completely nutty when a hand-written recursive parser for the query language would be so easy (and likely faster, besides).

    Comment by Bryan O'Sullivan — Wed, 26 May 2010 @ 17:48

  3. Yeah, I’m with Bryan — seems like a pretty simple hand-written recursive descent parser would be a better balance of the two approaches.

    Comment by John Mitchell — Wed, 26 May 2010 @ 19:04

  4. The “not-recently-updated” part of Parsing.py seems kind of irrelevant. Parser generators do not usually suffer from feature creep and changed expectations so their code will usually stay stable for years and years.

    @Bryan: And before calling it nutty, it might be relevant to look at top-down parsing versus bottom-up parsing. Also, (LA)LR parser theory is much more prevalently understood and used.

    Granted, a Pratt parser is probably good enough for this domain, it will just require most people to read Pratt’s paper to understand how it works, whereas LALR is taught at most universities.

    In my opinion, both implementations are good, solid and work. The important thing is probably to just pick either of them and work on getting the grammar/syntax just right.

    Comment by Henrik Stuart — Wed, 26 May 2010 @ 23:20

  5. Code age is interesting. Old code may be stale (-) but also be may be low maintenance burden (+). Relevant either way.

    Ironically, I actually had Parsing.py in mind when I set off to write something simpler. Not only is it a fairly big chunk of code, but it’s from the everything-is-a-class school of design which is one of the best ways available to bloat your source code. Several other Python parsers I’ve looked at are from this school. Further, as we discussed on IRC, I really think a Python parser should take its grammar as data, not as a huge mess of helper classes. (For extra points, it can take it in a language it can parse!)

    I think the way forward is:

    a) copy your tokenizer (already done)
    b) use my parser for now
    c) figure out if there are any grammar tweaks we want
    d) decide on the evaluator scheme
    e) add an extra predicates we want in the first cut
    f) start wiring in users

    If we get down the road and decide to switch parsers, we just need to switch to something that can generate a compatible parse tree for the evaluator.

    Comment by mpm — Thu, 27 May 2010 @ 00:11

  6. Is the Parser.py as easily extendible as the Pratt one ? (I don’t know very well how Parser.py works, but I remember that table driven LALR parser were very difficult if not impossible to extend live).

    Note that you can lose a bit of magicness and gain a bit more extendibility by not using number, but a relation (and the relation closure) for defining precedence (used on another project : http://wiki.event-b.org/index.php/Constrained_Dynamic_Parser#Defining_Compatibility_and_Precedences).

    You may also provide helpers used to build the ‘grammar’, such as :
    prefix(‘not’)
    infix(‘..’)

    Comment by matclab — Thu, 27 May 2010 @ 07:51

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.