Proposed new "big file" extension

Chad Dombrova chadrik at gmail.com
Sat Oct 3 18:49:38 CDT 2009


hi all,

first of all, i apologize for getting in on this conversation late. i  
was researching git vs. mercurial several months ago for a project  
that requires creating hundreds of cloned repositories containing  
large binary files.  at the time I decided that git was the best  
option because mercurial was missing several important features,  
primarily shared repositories, submodules, and the efficient handling  
of large files.   the first two of those limitations has been lifted  
with 1.3, so i'm glad to see that the large files problem is still  
under consideration,  because I'm getting a bit frustrated with git.

the inefficiency of using large files in mercurial is on several levels:

1. time required to compute a delta
2. disk space / copy time required for clone repositories
3. disk space / copy time required for working copies

I believe the second issue can be addressed by using the new share  
extension with version 1.3. With this extension there is no need to  
have these large binary files stored in every cloned repository :   
they all share each others store.

As for deltas, my thought was that this could be solved by forcing  
binary files (or specially flagged files, if you want specific  
control) to always be stored as uncompressed snapshots instead of as  
deltas.  Since I've yet to code my own extension for mercurial, I'm  
not sure if this can be accomplished in an extension.  i'm guessing  
not, but there might be a fairly localized bit of surgery to trick  
mercurial into generating snapshots.  from the mercurial docs:

"Once the cumulative amount of delta information stored since the last  
snapshot exceeds a fixed threshold, it stores a new snapshot  
(compressed, of course), instead of another delta."

so, we may be able to effectively reduce that threshold to 0 for  
certain files.   we'll also need to remove the snapshot compression as  
well, because in order to solve the last problem, when 'hg update' is  
called we need to hard link from the full snapshot of the file to the  
working copy.

there are a couple of things to keep in mind with the hard-linking to  
the working copy (a feature that we both had in mind)
1. you'll likely want to write a setuid'd app that changes the  
permissions and ownership of these "big files" to read-only by all,  
except for some special "repouser".  that way older revisions of files  
that are supposed to be immutable are not accidentally modified by a  
user with a hard link in their working copy
2. tape backup systems see each hard linked file as a unique file to  
backup, which makes sense since a backup might span many tapes and  
each tape needs to have a "real" copy of the file.  if using lots of  
hard links, you'll either need to exclude working copies from the  
backup, or do tape backups from some intermediate de-duped mirror.   
alternately, you might be able to use symlinks, as long as you also  
taught mercurial to treat symlinks pointing into it's own internal  
store (or to the bfiles store, in your case) the same as the real file.

the advantage of my approach is that mercurial continues to work  
largely as it normally would.  there are no new methodologies to  
learn, all the features are there, and it will work with default  
'add', 'update', 'push', 'pull' which means it will also work with GUI  
frontends like hgTortoise. you've clearly put a lot of thought into  
the design of bfiles, so i'm curious what you consider the downside of  
my approach.

also, is there somewhere i can grab a copy of bfiles to try out?  i'm  
quite curious if it will meet our needs.

thanks,
chad




More information about the Mercurial mailing list