bdiff/bsdiff: comprehensive performance measurements for large files [long, but nice graphics attached]
Ralf Leibold
Ralf.Leibold at nuance.com
Mon Sep 10 02:09:39 CDT 2007
Hi Christoph,
thanks a lot for all your experiments. You are right that bsdiff is a
lot slower than bdiff in most cases - and it takes more memory as well.
Afaik bsdiff was designed to make binary diffs small - but regarding the
implementation in mercurial there are 2 things to keep in mind:
1) bsdiff was designed to be used in conjunction with bzip2 - maybe the
output format was optimized to this compression method.
2) The patches created by original bsdiff algo are incompatible with the
patches made by mercurial. So I do the original bsdiff algo and
afterwards convert this format into the mercurial one. This is
sub-optimal and increases the patch size in some cases quite
significantly.
All in all it might be best to optimize the recurse function of bdiff.
Nevertheless at least in my case bdiff took 90 mins until it crashed
whereas the complete commit with bsdiff took just a couple of minutes.
But that might have been a rare case. That's why my default value to use
bsdiff instead of bdiff is quite high. :-)
Ralf
PS: As I don't own the bsdiff algo neither do I know the details I don't
know whether it's possible to make it faster.
-----Original Message-----
From: Christoph.Spiel at partner.bmw.de
[mailto:Christoph.Spiel at partner.bmw.de]
Sent: Monday, September 10, 2007 7:25 AM
To: Ralf Leibold; mercurial at selenic.com
Cc: bos at serpentine.com
Subject: AW: bdiff/bsdiff: comprehensive performance measurements for
large files [long, but nice graphics attached]
Ralf,
>These are interesting results. As stated in my patch-mail I initially
>wrote this patch as I couldn't commit at all due to internal stack
>overflow in mercurial.
The "recurse" function in the "bdiff" module could be optimized
for a smaller stack footprint. It would make the code a bit
uglier, though.
>So another interesting experiment would be to
>have large files of same size but with increasing number of changed
>lines. I would assume that the internal bdiff algo gets to its
>limits so that at some time the bsdiff algo is faster.
My tests show that bsdiff _could_ have an advantage for files
larger than 32MB. However, most of the time bdiff wins even
for this size.
>Did you also measure the resulting repository size? If so: Did you get
>smaller or bigger repositories with bsdiff?
Yes, I did over the weekend. bdiff and bsdiff produce repositories
of nearly the same sizes. Neither is better. This holds for small
files as well as for large ones (say 64MB).
>Did the patch proved to be stable for you?
Yes, it looks absolutely rock solid.
The big problem of bsdiff is its bad performance. It is an order
of magnitude slower than bdiff. This means that it takes ten times
as many operations to arrive at the same difference set. If a
commit takes an hour instead of just 6 minutes, well your
productivity is heading for the South Pole.
/cls
--
Dr. Christoph L. Spiel
BMW Forschungs- und Innovationszentrum, EA-410
Lauchstaedterstrasse 5, 80995 Muenchen
More information about the Mercurial
mailing list