internal fragmentation

a personal journal of hacking, science, and technology

Mercurial Fellowship status for July

Sun, 1 Aug 2010 15:31 by mpm in mercurial (link)

A quick summary of what I’ve been up to in July:

  • 1.6 release
  • reviewed and merged 171 changesets
  • authored 29 changesets
  • 83 mailing list messages
  • major refactoring of Mercurial protocol support
  • in-depth investigation of the notorious issue1327
  • lots of bug triage
  • started 1.7 sprint planning
  • started developing 2010 user survey
  • 4 GSoC meetings
  • daily IRC office hours
  • interview with students from ETNA

Mercurial Fellowship status for June

Thu, 15 Jul 2010 17:53 by mpm in mercurial (link)

As I announced back in April, I’ve been putting together funding to work on Mercurial development and project leadership full-time for the next year without the usual distractions like finding support clients or working on embedded Linux products. That’s more or less come together, and I started working at the beginning of June. You can see most of my awesome sponsors here though there’s still one big one that the paperwork isn’t finished with yet that I’ll hopefully be announcing shortly.

Meanwhile, our project’s nonprofit home, the Software Freedom Conservancy has asked that I post some monthly status reports to give some visibility into what it is I actually do. So here’s a quick summary of what I did in June:

  • 1.5.4 release
  • 1.6 release code freeze
  • reviewed and merged 144 changesets
  • authored 61 changesets
  • introduced two major features (pushable bookmarks and revsets)
  • numerous bug fixes and other improvements
  • some wiki improvements (like the Trust page)
  • 5 weekly meetings with our GSoC students and related mentoring
  • 123 messages to the mercurial-devel list (out of 929)
  • 65 to the main list (out of 909)
  • 3 blog posts
  • daily “office hours” for developers and users on IRC
  • a Mercurial live video interview on FLOSS Weekly

The 1.6 release of course came out right on time on July 1st, and I’m pretty pleased with it.

Extending the Mercurial protocol with pushkey and bookmarks

Mon, 28 Jun 2010 14:17 by mpm in mercurial (link)

Historically, Mercurial can send and receive precisely one type of data when pushing and pulling: changesets. As just about everything that you’d ever want to push and pull is recorded in a project’s history (including tags and branches), this works out very cleanly and simply.

But recently (ok, 2008), we’ve borrowed a concept from Git for lightweight markers that aren’t part of history and leave no mark when they’re added, changed, or removed. Git calls these ‘branches’, though Mercurial already has a notion of branches that’s much more in keeping with the permanent markers used in other systems (and Mercurial’s general philosophy of permanent history), so Mercurial has named these ‘bookmarks‘. Like real bookmarks, Mercurial bookmarks mark your place, but can be moved around or removed without altering the book.

The only problem was that there was no way to push and pull these things in the protocol, so sharing your bookmarks with other users was tedious. And as sometimes happens with open source, we had a plan for a solution, but the implementor became too busy with other things for quite a while. Fortunately, extending Mercurial’s protocol isn’t terribly difficult so a couple weeks ago, I sat down to do it myself. Here’s how I did it.

First, I introduced a generic notion called “pushkeys”. Pushkeys are key/value pairs that can be pushed to the server and listed back. There are multiple namespaces and any service or extension can register a namespace. There’s also a special “namespaces” namespace that shows the registered namespaces. To set a key, you have to also send along the old value to avoid races. The whole of is about 30 lines long and is introduced in this changeset.

Next we add the pushkey capability and pushkey/listkey command support to the local repository, the ssh client and server, and the http client and server. Finally, the debugpushkey lets us test the new support from the command line:

$ hg debugpushkey ~/hg namespaces
bookmarks
namespaces
$ hg debugpushkey ~/hg bookmarks
test-bookmark    ff2704a8ded37fbebd8b6eb5ec733731d725da8a
$ hg debugpushkey ~/hg bookmarks test-bookmark ff2704a8ded37fbebd8b6eb5ec733731d725da8a ''
True
$ hg debugpushkey ~/hg bookmarks
$ 

The above shows off the bookmark namespace service for pushkeys, which is added in the next changeset. Now all that remains is to add client support for bookmarks. This has a few different pieces:

  • default push and pull behavior: update bookmarks that are already present on both sides
  • push and pull -B: copy over specific bookmarks while pulling
  • in and out -B: list unique bookmarks on the client or server

These changes can be found starting here. The improved bookmark support will show up in Mercurial 1.6, due to be released on July 1st, but you can download a copy for testing today. Note that you’ll also need bookmark support enabled on your server (services like Google Code and Bitbucket may take a bit to catch up!).

There’s more to be done with bookmarks support, and there are a lot of other interesting things that can be done pretty easily with the pushkey protocol (for instance, those horrible advisory locks CVS folks always think they want). I’m sure we’ll be seeing some interesting ideas appear here soon.

Mapping import graphs

Sat, 5 Jun 2010 11:42 by mpm in mercurial (tagged ) (link)

Occasionally when adding features to Mercurial, we’ll run into interesting circular dependencies. For instance, repositories want to know about subrepositories, which naturally want to know about repositories. Often this manifests as a traceback when trying to import. And because the main Mercurial tool uses demand-loading to avoid wasting time on unused imports, this often isn’t caught until we run the test suite (which has a couple non-demand-loading scripts) or even later.

Fixing these problems usually involves moving some code around to make the dependencies more hierarchical. But figuring out what the hierarchy looks like can be tricky: there are about 70 modules in the core. But with the help of Graphviz and the following quick helper script, we can start to get a handle on things:

import re, sys
ignore = set("".split())

watch = set()
for fn in sys.argv[1:]:
    watch.add(fn[:-3])

print "digraph G {"
for fn in sys.argv[1:]:
    f = open(fn)
    mod = fn[:-3]
    if mod in ignore:
        continue
    for l in f:
        m = re.match("^\s*import (.*?)( as .*)?$", l)
        if not m:
	    m = re.match("^\s*from (.*?)( import .*)?$", l)
        if m:
            s = m.group(1).split(",")
            for i in s:
                i = i.strip()
		if i in ignore or i not in watch:
                    continue
                print '"%s" -> "%s"' % (mod, i)
print "}"

Here’s what that looks like (click through to zoom in):

Ouch! With a bit of ignoring of some of the leaf nodes and minor modules, we can reduce it to this:

And finally, by using Graphviz’ tred filter (which removes edges that are reachable through other routes) and ignoring explicitly delayed imports, we can get down to something nice and tidy like this:

Now it’s fairly easy to find the trouble spots and think about how to fix them. Ideally all the arrows in this graph would point downward, but we can easily see a couple that point upward for help and templatekw.

Mercurial: tracking down a performance regression

Fri, 28 May 2010 16:49 by mpm in mercurial (tagged , , , ) (link)

Today on the Mercurial-devel list, Jason Harris of MacHg noted:

In reference to matching the current version of Mercurial I noticed in my quick
testing that for quick status commands 1.4.2 was some 20% faster than 1.5.3.

That definitely shouldn’t happen. So I hopped over to a handy large repository and tried a quick test. I keep around builds of each Mercurial release for quick testing:

$ for i in `seq 5` ; do time hg153 st --all > /dev/null; done

real	0m1.754s
user	0m1.403s
sys	0m0.333s

real	0m1.749s
user	0m1.467s
sys	0m0.273s

real	0m1.741s
user	0m1.413s
sys	0m0.320s

real	0m1.761s
user	0m1.373s
sys	0m0.360s

real	0m1.772s
user	0m1.437s
sys	0m0.303s

$ for i in `seq 5` ; do time hg143 st --all > /dev/null; done

real	0m1.598s
user	0m1.283s
sys	0m0.297s

real	0m1.427s
user	0m1.070s
sys	0m0.343s

real	0m1.434s
user	0m1.127s
sys	0m0.293s

real	0m1.468s
user	0m1.127s
sys	0m0.307s

real	0m1.436s
user	0m1.100s
sys	0m0.313s

And indeed there’s a performance regression here. But as it happens, our normal performance-measuring tool doesn’t see any significant difference:

$ hg14 perfstatus
! result: 4156
! wall 0.664630 comb 0.660000 user 0.420000 sys 0.240000 (best of 15)
$ hg153 perfstatus
! result: 4156
! wall 0.674376 comb 0.670000 user 0.440000 sys 0.230000 (best of 14)

‘Perfstatus’ is part of contrib/perf.py, which an extension we use to benchmark some of most performance-critical code. It automates the process of choosing a reasonable sample size, much like Python’s timeit module. Note that it reports only the fastest time: practically all sources of noise in performance measurement (scheduling, power-saving, interrupts, cache effects) only make things slower, not faster. Thus picking the fastest result lets you focus on the least noisy measurement.

So something interesting is happening outside the status fast path. And we also get a null result measuring normal startup time via ‘perfstartup’. This calls for running bisect on the Mercurial repo itself:

$ cd ~/hg
$ hg bisect -r  # reset the bisection state
$ hg bisect -b tip  # latest version is bad
$ hg bisect -g 1.4
Testing changeset 10549:b97736016273 (1359 changesets remaining, ~10 tests)
332 files updated, 0 files merged, 27 files removed, 0 files unresolved

After marking a bad and a good revision, Mercurial jumps me to a revision in the middle for me to test. To run a test, I simply hop over to another terminal and run my for loop from earlier, then hop back and report ‘hg bisect -g’ or ‘hg bisect -b’. After about 10 iterations, Mercurial reports:

The first bad revision is:
changeset:   10176:24ce8f0c0a39
user:        Augie Fackler <durin42@gmail.com>
date:        Thu Dec 31 17:19:30 2009 -0600
summary:     dirstate: don't check state of subrepo directories

Hmm, this changeset seems innocent enough. And the repo we’re testing against doesn’t use subrepositories anyway. But closer inspection finds:

+            subrepos = ctx1.substate.keys()

And looking at how substate works, we see that it’s doing:

if '.hgsub' in ctx:
     read('.hgsub')

Hidden in that ‘in’ operation is a call into the context class that demand-loads the manifest for the parent changeset to check if a file is present. This isn’t showing up in the perfstatus benchmark because the first call is priming our internal manifest cache. Reading a manifest is one of the more expensive operations in Mercurial’s core and on large repositories can take a substantial fraction of a second, so we generally try to avoid. Can we avoid it here?

The answer is: yes, sometimes. When subrepos are in use, we do need to consult the committed .hgsub file. But in most cases, we can do a quick check in the dirstate (which we already have to demand-load for status checks) before going deeper:

 if working: # we need to scan the working dir
-            subrepos = ctx1.substate.keys()
+            subrepos = []
+            if '.hgsub' in self.dirstate:
+                subrepos = ctx1.substate.keys()
 s = self.dirstate.status(match, subrepos, listignored,
 listclean, listunknown)

(This patch is now in stable and will be part of the scheduled monthly minor release of 1.5.4 on June 1st.)

Mercurial: extending revision specifiers and a dilemma

Wed, 26 May 2010 15:56 by mpm in mercurial (tagged ) (link)

Right now we’re working on a way to greatly extend the way sets of revisions are specified in Mercurial. Currently most commands can take one or more ‘-r’ arguments that take single revisions and ranges of the form ‘start:end’, and revisions can be specified as a revision number, a relative revision number (eg -2), a changeset hash, a tag, a bookmark, a branch name, or built-ins like ‘.’, ‘null’, and ‘tip’.

This is good enough for many purposes, but we often get requests for more flexibility. These requests usually show up as new options for the log command, which has now grown a small army of them. The more popular of these will also trickle over into other commands. This is less than ideal because it means we have lots of options and potentially lots of duplicated and hairy code.

The new plan is to have a simple query language that lets you easily specify more complex queries anywhere a revision option is taken. The query language looks like this:

  • all changes on a branch:  branch(foo)
  • all descendants of 1.0: descendants(1.0)
  • all ancestors of 1.1 and descendants of 1.0:
    ancestors(1.1 ) and descendants(1.0) or 1.0..1.1
  • all ancestors of the working directory: ancestors(.) or follow()
  • all changesets from May containing the keyword ‘bug’: date(“may”) and keyword(“bug”)
  • all changeset from Bob or Alice: user(“bob”) or user(“alice”)
  • a complicated query:
    (user(“bob”) or user(“alice”)) and not merge() and 1.0..stable and keyword(“foo.py”) and not date(“may 10-may 20”)

This just begins to scratch the surface of what we can do with this scheme. We’ve actually made quite a lot of progress on this in the past week. In fact we now have two nearly complete rival implementations, each with their own tokenizer, parser, evaluator, predicates, and query optimization strategy. The most interesting piece here is the approach to parsing.

My version uses a simple ‘Pratt Parser‘ that I wrote for this purpose. It makes parsing algebraic and query-like expressions very simple. On the upside, this parser is very small: the parser module itself is only 75 lines and can be driven entirely by data rather than by verbose code. On the downside, it’s rather mysterious to the uninitiated. The ‘grammar’ that I pass to the parsing engine looks like this in its entirety:

elements = {
    "(": (20, ("group", 1, ")"), ("func", 1, ")")),
    "..": (17, None, ("dagrange", 17)),
    "not": (15, ("not", 15)),
    ":": (10, None, ("range", 10)),
    "and": (5, None, ("and", 5)),
    "or": (4, None, ("or", 4)),
    ",": (2, None, ("list", 2)),
    ")": (0, None, None),
    "end": (0, None, None),
    "symbol": (0, ("symbol",), None),
    "string": (0, ("string",), None),
}

(The magic numbers here are operator precedence and associativity, with clauses specifying how the symbol works in infix or prefix situations.)

Henrik Stuart has taken a more conventional approach. He’s taken the full-featured Parsing.py LR(1) parser and built a BNF-style grammar based on it. On the upside, everyone who knows anything about parsing knows how BNF works. On the downside, the parser itself weighs in at ~2000 lines of third-party not-recently-updated code, with the Mercurial-specific query grammar adding a few hundred more.

It’s clear that both versions are well up to the task at hand (and the other parsing-related task I’m looking at, hgweb’s templating engine), which means we have to weigh the strangeness and simplicity of the Pratt Parser against the size and conventionality of the Parsing.py approach.

Mercurial 1.3.1 released

Thu, 23 Jul 2009 15:09 by mpm in mercurial (link)

Finally pushed a .1 release out the door. We had a surprising number of performance regressions related to the cleanups in 1.3, so everyone is encouraged to upgrade.

In the future, we’re going to have to find a way to get more real users on big projects to test pre-release code, which means: easily available nightly builds.

Mercurial 1.3 released!

Wed, 1 Jul 2009 17:59 by mpm in mercurial (link)

In keeping with our time-based release plan, Mercurial 1.3 is out the door. I’m pretty pleased with it.

Exciting things in the queue for 1.4 including an improved wire protocol and possibly a new hashing algorithm. And hopefully support for a couple non-native subrepository types.

Coming Soon!

Thu, 18 Jun 2009 18:55 by mpm in mercurial (link)

Looks like Mercurial: The Definitive Guide has finally hit paper. Looks like it’s set to ship to coincide with Mercurial’s planned 1.3 release on July 1st.

Mercurial on BigTable

Mon, 15 Jun 2009 03:34 by mpm in mercurial (link)

Also worth mentioning there’s now video for the recent Google I/O talk on adapting Mercurial to BigTable. A pretty decent introduction to both Mercurial internals and their BigTable technology (see Hbase from the Hadoop project for an open alternative).

Older Posts »