Performance with binary-heavy repositories
Matt Mackall
mpm at selenic.com
Thu Aug 2 13:09:49 CDT 2007
On Thu, Aug 02, 2007 at 10:07:54AM -0700, Jens Alfke wrote:
>
> On Aug 2, 2007, at 9:39 AM, Matt Mackall wrote:
>
> >Mercurial's bdiff algorithm treats all files as strings of bytes and
> >breaks them on newline characters. For low-entropy "pure binary" files
> >like JPEGs, those should occur roughly every 256 characters so the
> >average "line length" for a binary file is a bit longer than for text,
> >but not outrageously so.
>
> Really? I thought, from reading the [excellent] paper on the innards
> of Mercurial, that it used a binary-delta algorithm (the old first
> version of xdelta, IIRC) for binary files.
Well it _is_ binary in the sense that it's completely 8-bit clean and
stores non-human-readable chunk output.
> I would imagine that a line-oriented text diff algorithm would
> achieve pretty poor compression on a binary file, much less than one
> designed for binary data.
Nope. Binary diff algorithms have worse compression! Binary algorithms
basically all trade aggressiveness of match search (O(n^2)) for bounded linear
runtime performance (O(n)) on large files.
--
Mathematics is the supreme nostalgia of our time.
More information about the Mercurial
mailing list