internal fragmentation

a personal journal of hacking, science, and technology

Unifying the Mercurial wire protocol

Thu, 15 Jul 2010 17:18 by mpm in Uncategorized (tagged ) (link)

One Crazy Summer

The Mercurial ‘wire protocol’ is the set of commands that Mercurial uses to discover changesets to push and pull and actually exchange changesets with remote hosts. It was developed surprisingly rapidly, given that everything else in the project was also brand new:

  • April 20 2005: first Mercurial release
  • April 21 2005: static http pull support
  • May 10 2005: changeset bundle format created
  • May 11 2005: hgweb merged
  • May 12 2005: added smart http protocol to hgweb
  • May 25 2005: hg serve stand-alone server added
  • June 12 2005: crazy tunnel-http-over-ssh push hack
  • July 5 2005: ssh smart server support
  • July 6 2005: real ssh push

And that gets us up to the 0.6b release, less than three months after the 0.1 release.

Steady As She Goes

Because ssh push was good enough for most of our users at the time, not much happened with our wire protocol until about a year later when we finally got around to working on http push:

  • June 10 2006: add capability detection to protocol
  • June 15 2006: new push approach for ssh
  • June 20 2006: real http/https push
  • July 14 2006: add support for –uncompressed clones for fast networks

And again, the wire protocol hasn’t changed much since then. A branchmap command was added in May 2009 to improve named branch support, and pushkey was added a couple weeks ago to support bookmarks. We’ve also been careful to keep the protocol backwards compatible: Mercurial 0.5 can interoperate with even the most recent versions.

Preparing for the New Frontier

But there are a number of big projects on the horizon that will require extending the wire protocol, including:

  • a new, faster discovery algorithm
  • improved compression with parent deltas
  • support for shallow clones
  • support for lightweight copies

Which means its high time that the wire protocol got cleaned up! Currently adding a new command to the protocol means adding code in four different places: the ssh client code, the ssh server code, the http client code, and the http server code. And those interfaces aren’t exactly pretty. Here’s what the old ssh and http server support for lookup looks like:

ssh:
    def do_lookup(self):
        arg, key = self.getarg()
        assert arg == 'key'
        try:
            r = hex(self.repo.lookup(key))
            success = 1
        except Exception, inst:
            r = str(inst)
            success = 0
        self.respond("%s %s\n" % (success, r))

http:
def lookup(repo, req):
    try:
        r = hex(repo.lookup(req.form['key'][0]))
        success = 1
    except Exception, inst:
        r = str(inst)
        success = 0
    resp = "%s %s\n" % (success, r)
    req.respond(HTTP_OK, HGTYPE, length=len(resp))
    yield resp

Fortunately, we’ve been careful to keep the ssh and http ‘calling convention’ very similar. The bulk of the code is similar but there are some basic differences in how arguments are retrieved and how results are returned. Similar things need to be changed on the client side. Adding new commands meant writing one protocol implementation, then copy&pasting it into the other protocol and tweaking it.

Unification

So a long-standing wishlist item has been to unify these implementations, which is what I’ve been working on this week. Now the above command is implemented just once like this:

def lookup(repo, proto, key):
    try:
         r = hex(repo.lookup(key))
         success = 1
    except Exception, inst:
         r = str(inst)
         success = 0
    return "%s %s\n" % (success, r)

Note that the output is now delivered via return and the argument is now passed in as a parameter. A central dispatcher now handles argument unpacking and delivering results and has a table of all the function parameter lists (including support for variable-length argument lists!). Each function also has a ‘proto’ option which gives it access to an abstracted interface for various http- or ssh-specific methods, like doing bulk data transfers.  Similarly, the client-side support is also unified, and both the client and server support is now all in one file: wireproto.py. This will hopefully make future extensions to the wire protocol much less painful.

The tricky part of course with a big change like this is how to break it down into pieces. Simply trying to refactor it all in one go might be possible, but it’s much safer to do it in a step by step process, testing each step along the way. So I started by making a new ssh argument parser, hooking that into a generic command dispatcher, and hooking that dispatcher into the existing ssh infrastructure. Now I could move commands out of the ssh framework and into the generic framework one by one. Once I’d moved a bunch of the simpler ssh commands, I went and tackled merging them on the http side with a similar approach. Overall, the process took a total of 18 changesets.

Mercurial: tracking down a performance regression

Fri, 28 May 2010 16:49 by mpm in mercurial (tagged , , , ) (link)

Today on the Mercurial-devel list, Jason Harris of MacHg noted:

In reference to matching the current version of Mercurial I noticed in my quick
testing that for quick status commands 1.4.2 was some 20% faster than 1.5.3.

That definitely shouldn’t happen. So I hopped over to a handy large repository and tried a quick test. I keep around builds of each Mercurial release for quick testing:

$ for i in `seq 5` ; do time hg153 st --all > /dev/null; done

real	0m1.754s
user	0m1.403s
sys	0m0.333s

real	0m1.749s
user	0m1.467s
sys	0m0.273s

real	0m1.741s
user	0m1.413s
sys	0m0.320s

real	0m1.761s
user	0m1.373s
sys	0m0.360s

real	0m1.772s
user	0m1.437s
sys	0m0.303s

$ for i in `seq 5` ; do time hg143 st --all > /dev/null; done

real	0m1.598s
user	0m1.283s
sys	0m0.297s

real	0m1.427s
user	0m1.070s
sys	0m0.343s

real	0m1.434s
user	0m1.127s
sys	0m0.293s

real	0m1.468s
user	0m1.127s
sys	0m0.307s

real	0m1.436s
user	0m1.100s
sys	0m0.313s

And indeed there’s a performance regression here. But as it happens, our normal performance-measuring tool doesn’t see any significant difference:

$ hg14 perfstatus
! result: 4156
! wall 0.664630 comb 0.660000 user 0.420000 sys 0.240000 (best of 15)
$ hg153 perfstatus
! result: 4156
! wall 0.674376 comb 0.670000 user 0.440000 sys 0.230000 (best of 14)

‘Perfstatus’ is part of contrib/perf.py, which an extension we use to benchmark some of most performance-critical code. It automates the process of choosing a reasonable sample size, much like Python’s timeit module. Note that it reports only the fastest time: practically all sources of noise in performance measurement (scheduling, power-saving, interrupts, cache effects) only make things slower, not faster. Thus picking the fastest result lets you focus on the least noisy measurement.

So something interesting is happening outside the status fast path. And we also get a null result measuring normal startup time via ‘perfstartup’. This calls for running bisect on the Mercurial repo itself:

$ cd ~/hg
$ hg bisect -r  # reset the bisection state
$ hg bisect -b tip  # latest version is bad
$ hg bisect -g 1.4
Testing changeset 10549:b97736016273 (1359 changesets remaining, ~10 tests)
332 files updated, 0 files merged, 27 files removed, 0 files unresolved

After marking a bad and a good revision, Mercurial jumps me to a revision in the middle for me to test. To run a test, I simply hop over to another terminal and run my for loop from earlier, then hop back and report ‘hg bisect -g’ or ‘hg bisect -b’. After about 10 iterations, Mercurial reports:

The first bad revision is:
changeset:   10176:24ce8f0c0a39
user:        Augie Fackler <durin42@gmail.com>
date:        Thu Dec 31 17:19:30 2009 -0600
summary:     dirstate: don't check state of subrepo directories

Hmm, this changeset seems innocent enough. And the repo we’re testing against doesn’t use subrepositories anyway. But closer inspection finds:

+            subrepos = ctx1.substate.keys()

And looking at how substate works, we see that it’s doing:

if '.hgsub' in ctx:
     read('.hgsub')

Hidden in that ‘in’ operation is a call into the context class that demand-loads the manifest for the parent changeset to check if a file is present. This isn’t showing up in the perfstatus benchmark because the first call is priming our internal manifest cache. Reading a manifest is one of the more expensive operations in Mercurial’s core and on large repositories can take a substantial fraction of a second, so we generally try to avoid. Can we avoid it here?

The answer is: yes, sometimes. When subrepos are in use, we do need to consult the committed .hgsub file. But in most cases, we can do a quick check in the dirstate (which we already have to demand-load for status checks) before going deeper:

 if working: # we need to scan the working dir
-            subrepos = ctx1.substate.keys()
+            subrepos = []
+            if '.hgsub' in self.dirstate:
+                subrepos = ctx1.substate.keys()
 s = self.dirstate.status(match, subrepos, listignored,
 listclean, listunknown)

(This patch is now in stable and will be part of the scheduled monthly minor release of 1.5.4 on June 1st.)

Mercurial: extending revision specifiers and a dilemma

Wed, 26 May 2010 15:56 by mpm in mercurial (tagged ) (link)

Right now we’re working on a way to greatly extend the way sets of revisions are specified in Mercurial. Currently most commands can take one or more ‘-r’ arguments that take single revisions and ranges of the form ‘start:end’, and revisions can be specified as a revision number, a relative revision number (eg -2), a changeset hash, a tag, a bookmark, a branch name, or built-ins like ‘.’, ‘null’, and ‘tip’.

This is good enough for many purposes, but we often get requests for more flexibility. These requests usually show up as new options for the log command, which has now grown a small army of them. The more popular of these will also trickle over into other commands. This is less than ideal because it means we have lots of options and potentially lots of duplicated and hairy code.

The new plan is to have a simple query language that lets you easily specify more complex queries anywhere a revision option is taken. The query language looks like this:

  • all changes on a branch:  branch(foo)
  • all descendants of 1.0: descendants(1.0)
  • all ancestors of 1.1 and descendants of 1.0:
    ancestors(1.1 ) and descendants(1.0) or 1.0..1.1
  • all ancestors of the working directory: ancestors(.) or follow()
  • all changesets from May containing the keyword ‘bug’: date(“may”) and keyword(“bug”)
  • all changeset from Bob or Alice: user(“bob”) or user(“alice”)
  • a complicated query:
    (user(“bob”) or user(“alice”)) and not merge() and 1.0..stable and keyword(“foo.py”) and not date(“may 10-may 20”)

This just begins to scratch the surface of what we can do with this scheme. We’ve actually made quite a lot of progress on this in the past week. In fact we now have two nearly complete rival implementations, each with their own tokenizer, parser, evaluator, predicates, and query optimization strategy. The most interesting piece here is the approach to parsing.

My version uses a simple ‘Pratt Parser‘ that I wrote for this purpose. It makes parsing algebraic and query-like expressions very simple. On the upside, this parser is very small: the parser module itself is only 75 lines and can be driven entirely by data rather than by verbose code. On the downside, it’s rather mysterious to the uninitiated. The ‘grammar’ that I pass to the parsing engine looks like this in its entirety:

elements = {
    "(": (20, ("group", 1, ")"), ("func", 1, ")")),
    "..": (17, None, ("dagrange", 17)),
    "not": (15, ("not", 15)),
    ":": (10, None, ("range", 10)),
    "and": (5, None, ("and", 5)),
    "or": (4, None, ("or", 4)),
    ",": (2, None, ("list", 2)),
    ")": (0, None, None),
    "end": (0, None, None),
    "symbol": (0, ("symbol",), None),
    "string": (0, ("string",), None),
}

(The magic numbers here are operator precedence and associativity, with clauses specifying how the symbol works in infix or prefix situations.)

Henrik Stuart has taken a more conventional approach. He’s taken the full-featured Parsing.py LR(1) parser and built a BNF-style grammar based on it. On the upside, everyone who knows anything about parsing knows how BNF works. On the downside, the parser itself weighs in at ~2000 lines of third-party not-recently-updated code, with the Mercurial-specific query grammar adding a few hundred more.

It’s clear that both versions are well up to the task at hand (and the other parsing-related task I’m looking at, hgweb’s templating engine), which means we have to weigh the strangeness and simplicity of the Pratt Parser against the size and conventionality of the Parsing.py approach.