Performance with binary-heavy repositories

Matt Mackall mpm at selenic.com
Thu Aug 2 11:39:51 CDT 2007


On Thu, Aug 02, 2007 at 09:13:30AM +0200, Christoph.Spiel at partner.bmw.de wrote:
> Hi all!
> 
>         We are evaluating Mercurial for the use in our department.
> Our typical projects contain a medium number of files (5000) and have
> a moderate size (200MB).  However, the projects are _very_ binary
> heavy, this is, around 2000 of the 5000 files contain pure binary
> data.

Mercurial's bdiff algorithm treats all files as strings of bytes and
breaks them on newline characters. For low-entropy "pure binary" files
like JPEGs, those should occur roughly every 256 characters so the
average "line length" for a binary file is a bit longer than for text,
but not outrageously so.

The thing that matters performance-wise is the number of newlines.
The worst-case performance of the algorithm is O(n^2) like most diff
algorithms, so large files (binary or not) may have bad performance.

We've considered adding a fallback to a faster but lower-compression
diff algorithm for larger files, and in fact such a patch just showed
up this past week (look for bsdiff in the list archives).

-- 
Mathematics is the supreme nostalgia of our time.


More information about the Mercurial mailing list