Proposed new "big file" extension

Chad Dombrova chadrik at gmail.com
Sun Oct 4 17:27:14 CDT 2009


> So part of my secret covert goal with bfiles is to turn 30 MB mistakes
> into 40 byte mistakes (the "standin" file containing the revision
> hash).  Once you no longer need the 30 MB file in every checkout, you
> delete the 40 byte standin.  The only price you pay is carrying around
> the history of that 40 byte standin file forever.  Big deal.

i agree that pruning history is a requirement.  if i were to implement  
my idea, we would have to add a command like perfoce's obliterate -  
which would replace a snapshot with a stub file.  this obliteration  
would be fairly safe ', because we would be storing snapshots and not  
deltas.

One of my requirements is transparency. For example, once a file is  
tracked by big files, why not have push call bfput in the background,  
or why not have status show big files by default? It seems that many  
of the big files commands have existing analogues:

push - bfput
pull/update - bfget
commit - bfrefrsh

IIUC, in your current design, the user must be aware each time a big  
file changes, and manually bfrefresh it.  it would seem to me that  
mercurial detecting when a file has been modified is a pretty  
important feature.  i wonder how many more of these features will be  
nullified by circumventing hg's normal workflow?

the reason that transparency is so important to me is twofold:  more  
than half of our data would be considered 'big files' and i want our  
users to be able to use standard mercurial frontends to control and  
visualize their dataflow.  i realize that our use case is fairly  
fringe, so i know i will have to do some custom work at some point,  
i'm just trying to find the path of least resistance toward an end- 
product that meets all of my requirements.  (git's deltaless data  
store with support for pruning is much better suited for my purposes,  
but the lack of a true api of any sort and the general "designed-by- 
linux-hackers" mentality have been real turnoffs.)

>
>> there are a couple of things to keep in mind with the hard-linking to
>> the working copy (a feature that we both had in mind)
>> 1. you'll likely want to write a setuid'd app that changes the
>> permissions and ownership of these "big files" to read-only by all,
>> except for some special "repouser".  that way older revisions of  
>> files
>> that are supposed to be immutable are not accidentally modified by a
>> user with a hard link in their working copy
>
> Hmmm.  It sounds to me like you are thinking of having your "remote
> store" and all clones on the same machine.  To me, that's a minor use
> case that's not worth optimizing.

well, for us, this is the use case. I am looking into mercurial and  
git to revision control a lot of data, at least half of which is large  
binary assets, and all of which will be revision controlled on a  
large, centralized server.  The repos, each containing gigabytes worth  
of data, will be cloned hundreds of times to provide different users/ 
teams access to read-only working copies of each others work, so it is  
essential that there is no redundancy in either the stores or the  
working copies.  No one at our studio is allowed to do work on their  
local hard drives, and there is little benefit since our server  
performs equally well to a standard HD.

>
>> the advantage of my approach is that mercurial continues to work
>> largely as it normally would.  there are no new methodologies to
>> learn, all the features are there, and it will work with default
>> 'add', 'update', 'push', 'pull' which means it will also work with  
>> GUI
>> frontends like hgTortoise. you've clearly put a lot of thought into
>> the design of bfiles, so i'm curious what you consider the downside  
>> of
>> my approach.
>
> The main goal of bfiles is to get big files entirely out of
> .hg/store/data.  IIUC, you propose to keep them there, but don't
> bother computing deltas.  Also, your proposal doesn't do anything
> about Mercurial's tendency to read whole revisions into memory.

memory consumption with big files IS a big deal, and among the  
critical factors that I sent me toward git in the first place.  what  
is the purpose of this behavior? Is it possible to purge revisions  
from memory or to avoid it on certain commits? what would be the  
ramifications?

so, to summarize, my current idea skeleton:

share extension
force snapshots for bigfiles
obliterate command
read-only, hard-linked checkouts
reduce memory consumption for bigfiles???


>
>> also, is there somewhere i can grab a copy of bfiles to try out?  i'm
>> quite curious if it will meet our needs.
>
> Not yet.  Hopefully this week.  Seems to mostly work, but I need to
> write some docs.

look forward to it.

by the way, do you have any estimates on the difficulty level of some  
of the modifications i'm proposing?  i'm assuming these changes are  
way out of extensions territory.

-chad





More information about the Mercurial mailing list