Performance with binary-heavy repositories

Jens Alfke jens at mooseyard.com
Thu Aug 2 12:07:54 CDT 2007


On Aug 2, 2007, at 9:39 AM, Matt Mackall wrote:

> Mercurial's bdiff algorithm treats all files as strings of bytes and
> breaks them on newline characters. For low-entropy "pure binary" files
> like JPEGs, those should occur roughly every 256 characters so the
> average "line length" for a binary file is a bit longer than for text,
> but not outrageously so.

Really? I thought, from reading the [excellent] paper on the innards  
of Mercurial, that it used a binary-delta algorithm (the old first  
version of xdelta, IIRC) for binary files.

I would imagine that a line-oriented text diff algorithm would  
achieve pretty poor compression on a binary file, much less than one  
designed for binary data.

The current xdelta, version 3 <http://xdelta.org/>, appears to be the  
state of the art in delta compression, and emits standard VCDIFF [RFC  
3284] format. (Although for my own work I've been using zdelta,  
largely because the license is more flexible.)

--Jens
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://selenic.com/pipermail/mercurial/attachments/20070802/66181c17/attachment-0001.htm 


More information about the Mercurial mailing list