Question on work-flow for big binary files

Jesse Glick Jesse.Glick at Sun.COM
Wed Jan 7 09:05:27 CST 2009


Peter Arrenbrecht wrote:
>> If we had TrimmingHistory [...] we could just check in and push whatever binary files we needed along with everything
>> else, and not worry about repository size.
> 
> Meaning you'd be happy to pull history starting only at a given
> revision for _all_ files, or just the binaries?

For all files. I would like to be able to do something like

   hg clone http://hg.netbeans.org/main@200810010000

and get everything that existed in October or was subsequently added, but nothing before that.

(Attempts to access prior history could either pretend that the first stored revision was the first and show a warning; abort with an error; or download bits of it on 
demand. I'm not sure which approach is best for various commands - downloading on demand sounds good in principle, but a routine call to 'hg ann some-old-source-file' 
would need at least pieces of very old history, so it would need to be done carefully.)

The binary files are large, but still smaller than true sources, and you need to download the current versions anyway if you want to do a build. The problem with storing 
them in the repo today is that they are fairly frequently replaced (usually with newer versions) and the deltas would not compress well (*); the repo would quickly grow 
to multi-gigabyte proportions and be unwieldy. This problem is made even worse by the fact that file renames double disk storage. Even without storing binaries in our 
repo, it has grown large enough that many casual contributors are reluctant (or even unable) to do a clone, and we have to think about offering a r/o SVN mirror.

(*) The delta compression for binaries actually seems good compared to (say) CVS, but could probably be better. In particular, it would be cool if a preprocessor for 
revlog storage automatically detected ZIP files and tried to handle them specially: retain the original index as is, then append the uncompressed entries in lexicographic 
order, after verifying that recompression produces an identical bytestream to the original ZIP. If this verification step could be shown to succeed for archives created 
by Info-ZIP, libzip, Apache Ant, and other widely used implementations, then incremental changes to ZIP files would compress excellently, even for compressed archives 
with random entry ordering. This would cover a large class of formats: Java JARs, OpenOffice documents, etc.


More information about the Mercurial mailing list