Performance with binary-heavy repositories

Matt Mackall mpm at selenic.com
Thu Aug 2 13:09:49 CDT 2007


On Thu, Aug 02, 2007 at 10:07:54AM -0700, Jens Alfke wrote:
> 
> On Aug 2, 2007, at 9:39 AM, Matt Mackall wrote:
> 
> >Mercurial's bdiff algorithm treats all files as strings of bytes and
> >breaks them on newline characters. For low-entropy "pure binary" files
> >like JPEGs, those should occur roughly every 256 characters so the
> >average "line length" for a binary file is a bit longer than for text,
> >but not outrageously so.
> 
> Really? I thought, from reading the [excellent] paper on the innards  
> of Mercurial, that it used a binary-delta algorithm (the old first  
> version of xdelta, IIRC) for binary files.

Well it _is_ binary in the sense that it's completely 8-bit clean and
stores non-human-readable chunk output.

> I would imagine that a line-oriented text diff algorithm would  
> achieve pretty poor compression on a binary file, much less than one  
> designed for binary data.

Nope. Binary diff algorithms have worse compression! Binary algorithms
basically all trade aggressiveness of match search (O(n^2)) for bounded linear
runtime performance (O(n)) on large files. 

-- 
Mathematics is the supreme nostalgia of our time.


More information about the Mercurial mailing list