Proposed new "big file" extension

Greg Ward greg-hg at gerg.ca
Sun Oct 4 15:28:29 CDT 2009


On Sat, Oct 3, 2009 at 7:49 PM, Chad Dombrova <chadrik at gmail.com> wrote:
> the inefficiency of using large files in mercurial is on several levels:
>
> 1. time required to compute a delta
> 2. disk space / copy time required for clone repositories
> 3. disk space / copy time required for working copies

You missed one: the overhead of storing multiple revisions of a large
file that is not friendly to compressed deltas.  Then multiply that
overhead by the number of clones you need to maintain.  (Or maybe you
factored that into 2 and 3...whatever.)

And another: Mercurial has a pervasive assumption that it's acceptable
to hold 2 complete revisions of any file under its control entirely in
memory.  For typical source files, that's fine.  Not so for big files.

> I believe the second issue can be addressed by using the new share
> extension with version 1.3. With this extension there is no need to
> have these large binary files stored in every cloned repository :
> they all share each others store.

I'm always careful to say "remote store", because bfiles' view of the
world is old-fashioned, client/server centralized VCS.  The canonical
remote store is *that* directory on *that* server over there.

Also, I've punted on implementing a local cache, because it's not too
important in our environment.  Should be possible to add it later, but
it's not on my critical path.

As a result, multiple clones on the same machine do not save any disk
space.  The good news is that those clones are independent; you cannot
screw up clone A by modifying big files in clone B.  (That was a
teensy little flaw in my original design.)

> As for deltas, my thought was that this could be solved by forcing
> binary files (or specially flagged files, if you want specific
> control) to always be stored as uncompressed snapshots instead of as
> deltas.

Interesting idea.  I've never really considered storing big files in
revlogs because I don't want them cluttering up the "real" repository.
 My view is that it is generally a mistake to put large binary files
under source control.  It's often done by naive users who think of CVS
or Subversion as an infinite data store that imposes zero network or
server load, and who assume that everyone checking out from that
repository has infinite bandwidth and disk space.

The catch with converting that abused CVS/svn repository to Mercurial
is that your 30 MB mistake from 4 years ago will continue to take up
30 MB, in every clone, everywhere, forever.  Even once you fix your
build system to do the right thing and not need the 30 MB mistake
anymore.

So part of my secret covert goal with bfiles is to turn 30 MB mistakes
into 40 byte mistakes (the "standin" file containing the revision
hash).  Once you no longer need the 30 MB file in every checkout, you
delete the 40 byte standin.  The only price you pay is carrying around
the history of that 40 byte standin file forever.  Big deal.

> there are a couple of things to keep in mind with the hard-linking to
> the working copy (a feature that we both had in mind)
> 1. you'll likely want to write a setuid'd app that changes the
> permissions and ownership of these "big files" to read-only by all,
> except for some special "repouser".  that way older revisions of files
> that are supposed to be immutable are not accidentally modified by a
> user with a hard link in their working copy

Hmmm.  It sounds to me like you are thinking of having your "remote
store" and all clones on the same machine.  To me, that's a minor use
case that's not worth optimizing.  I see room for optimizing by 1)
adding a local cache on each client machine and 2) using hard links
between the local cache and clones on the same client.  That's where
my "changes in clone A modify clone B" design flaw came from, and it's
part of the reason I dropped the local cache for now.

It has occurred to me to implement optional hardlinks to the local
cache, but mark the hardlinked files read-only.  (Similar to how
Perforce reminds you that you haven't told it you plan to edit a file.
 Since big files are presumed non-mergeable, it makes some sense to be
a little paternalistic here: if you don't plan to edit this file, you
shouldn't be touching it!)

> the advantage of my approach is that mercurial continues to work
> largely as it normally would.  there are no new methodologies to
> learn, all the features are there, and it will work with default
> 'add', 'update', 'push', 'pull' which means it will also work with GUI
> frontends like hgTortoise. you've clearly put a lot of thought into
> the design of bfiles, so i'm curious what you consider the downside of
> my approach.

The main goal of bfiles is to get big files entirely out of
.hg/store/data.  IIUC, you propose to keep them there, but don't
bother computing deltas.  Also, your proposal doesn't do anything
about Mercurial's tendency to read whole revisions into memory.

> also, is there somewhere i can grab a copy of bfiles to try out?  i'm
> quite curious if it will meet our needs.

Not yet.  Hopefully this week.  Seems to mostly work, but I need to
write some docs.

Greg



More information about the Mercurial mailing list