Proposed new "big file" extension
Chad Dombrova
chadrik at gmail.com
Sat Oct 3 18:49:38 CDT 2009
hi all,
first of all, i apologize for getting in on this conversation late. i
was researching git vs. mercurial several months ago for a project
that requires creating hundreds of cloned repositories containing
large binary files. at the time I decided that git was the best
option because mercurial was missing several important features,
primarily shared repositories, submodules, and the efficient handling
of large files. the first two of those limitations has been lifted
with 1.3, so i'm glad to see that the large files problem is still
under consideration, because I'm getting a bit frustrated with git.
the inefficiency of using large files in mercurial is on several levels:
1. time required to compute a delta
2. disk space / copy time required for clone repositories
3. disk space / copy time required for working copies
I believe the second issue can be addressed by using the new share
extension with version 1.3. With this extension there is no need to
have these large binary files stored in every cloned repository :
they all share each others store.
As for deltas, my thought was that this could be solved by forcing
binary files (or specially flagged files, if you want specific
control) to always be stored as uncompressed snapshots instead of as
deltas. Since I've yet to code my own extension for mercurial, I'm
not sure if this can be accomplished in an extension. i'm guessing
not, but there might be a fairly localized bit of surgery to trick
mercurial into generating snapshots. from the mercurial docs:
"Once the cumulative amount of delta information stored since the last
snapshot exceeds a fixed threshold, it stores a new snapshot
(compressed, of course), instead of another delta."
so, we may be able to effectively reduce that threshold to 0 for
certain files. we'll also need to remove the snapshot compression as
well, because in order to solve the last problem, when 'hg update' is
called we need to hard link from the full snapshot of the file to the
working copy.
there are a couple of things to keep in mind with the hard-linking to
the working copy (a feature that we both had in mind)
1. you'll likely want to write a setuid'd app that changes the
permissions and ownership of these "big files" to read-only by all,
except for some special "repouser". that way older revisions of files
that are supposed to be immutable are not accidentally modified by a
user with a hard link in their working copy
2. tape backup systems see each hard linked file as a unique file to
backup, which makes sense since a backup might span many tapes and
each tape needs to have a "real" copy of the file. if using lots of
hard links, you'll either need to exclude working copies from the
backup, or do tape backups from some intermediate de-duped mirror.
alternately, you might be able to use symlinks, as long as you also
taught mercurial to treat symlinks pointing into it's own internal
store (or to the bfiles store, in your case) the same as the real file.
the advantage of my approach is that mercurial continues to work
largely as it normally would. there are no new methodologies to
learn, all the features are there, and it will work with default
'add', 'update', 'push', 'pull' which means it will also work with GUI
frontends like hgTortoise. you've clearly put a lot of thought into
the design of bfiles, so i'm curious what you consider the downside of
my approach.
also, is there somewhere i can grab a copy of bfiles to try out? i'm
quite curious if it will meet our needs.
thanks,
chad
More information about the Mercurial
mailing list