[ANNOUNCE] An extension to handle big files

Andrei Vermel avermel at mail.ru
Fri Oct 17 04:51:52 CDT 2008


Die to memory and performance limitations big files shouldn't be
stored in a hg repo (hg runs out of memory checking in a 170Mb file on my
2Gb box).
Matt suggests splitting the file in chunks and assembling back during make 
stage.

Here's an extension that uses a different approach. It seems to work, but is
just a proof of a concept yet. I'd like to hear comments.

Big files are not put to hg repo. They are listed in a file called
'.bigfiles',
which also serves as an ignore file similar to .hgignore, so they do not 
clutter output of hg commands. The file also stores check sums of the big
files in a form of comments. File '.bigfiles' is versioned by hg, so each
changeset
knows which big files it uses from the names and checksums.
The file can be diffed and merged, which is nice.
The versions of big files are stored in a dedicated directory, with
checksums
attached to names.
The extension overrides 'hg update', so that it can compare contents of
'.bigfiles' 
before and after the update to remove and fetch appropriate big files.
The directory storing versions of big files can be synced with the remote
one (
the extension doesn't do this, but tells the list of the necessary files).
The versions  
corresponding to old changesets can be removed to save space.

To add a new big file, 'hg add' is used - ignore the size warning.
To remove a tracked big file, just delete it.

'hg bstat' shows added big files as 'A', removed big files as '!', modified
big files as 'M',
files tracked by hg that got too big as 'B', big files listed in '.bigfiles'
but missing from 
the big files versions directory as 'R'.

'hg brefresh' updates '.bigfiles' and versions directory according to the
current state of
working directory. Added big files get forgotten and added to '.bigfiles'
instead.
Removed big files are deleted from '.bigfiles'. Files tracked by hg that got
too big are
removed from hg, and added to '.bigfiles'. Copies of new and modified big
files are 
stored in  versions directory.

'hg bupdate' fetches files from versions directory as recorded in
'.bigfiles', and 
complains about necessary files missing in the version directory.

To setup the extension add to a config file:
[bigfiles]
repo = path/to/versions/dir 

Below is a sample session:

F:\repos\test>hg init
 
F:\repos\test>hg stat
? 3rdParty/lib3rd/SqlServerExpress/SQLEXPR.EXE
? 3rdParty/lib3rd/SqlServerExpress/SQLEXPR32.EXE
? 3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.dll
? 3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.pdb
? 3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip
? 3rdParty/lib3rd/microsoft/dotnet/NetFx64.exe
? 3rdParty/lib3rd/microsoft/dotnet/dotnetfx.exe
 
F:\repos\test>hg add
adding 3rdParty\lib3rd\SqlServerExpress\SQLEXPR.EXE
adding 3rdParty\lib3rd\SqlServerExpress\SQLEXPR32.EXE
adding 3rdParty\lib3rd\acis\bin\NT_VC8_DLLD\SpaACISd.dll
adding 3rdParty\lib3rd\acis\bin\NT_VC8_DLLD\SpaACISd.pdb
adding 3rdParty\lib3rd\aciscatiav5rd\aciscatiav5rd.zip
adding 3rdParty\lib3rd\microsoft\dotnet\NetFx64.exe
adding 3rdParty\lib3rd\microsoft\dotnet\dotnetfx.exe
3rdParty/lib3rd/SqlServerExpress/SQLEXPR.EXE: files over 10MB may cause
memory a
nd performance problems
(use 'hg revert 3rdParty/lib3rd/SqlServerExpress/SQLEXPR.EXE' to unadd the
file)
 
3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.pdb: files over 10MB may cause
mem
ory and performance problems
(use 'hg revert 3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.pdb' to unadd
the
file)
3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip: files over 10MB may cause
memor
y and performance problems
(use 'hg revert 3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip' to unadd
the fi
le)
3rdParty/lib3rd/microsoft/dotnet/NetFx64.exe: files over 10MB may cause
memory a
nd performance problems
(use 'hg revert 3rdParty/lib3rd/microsoft/dotnet/NetFx64.exe' to unadd the
file)
 

F:\repos\test>hg stat
A 3rdParty/lib3rd/SqlServerExpress/SQLEXPR.EXE
A 3rdParty/lib3rd/SqlServerExpress/SQLEXPR32.EXE
A 3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.dll
A 3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.pdb
A 3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip
A 3rdParty/lib3rd/microsoft/dotnet/NetFx64.exe
A 3rdParty/lib3rd/microsoft/dotnet/dotnetfx.exe
 
F:\repos\test>hg bstat
abort: bigfiles.repo path not configured
 
F:\repos\test>echo [bigfiles]>>.hg/hgrc
 
F:\repos\test>echo repo= f:/repos/bigrepo>>.hg/hgrc
 
F:\repos\test>hg bstat
A 3rdParty/lib3rd/SqlServerExpress/SQLEXPR.EXE
A 3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.pdb
A 3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip
A 3rdParty/lib3rd/microsoft/dotnet/NetFx64.exe
 
F:\repos\test>hg bref
forgetting 3rdParty/lib3rd/SqlServerExpress/SQLEXPR.EXE
forgetting 3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.pdb
forgetting 3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip
forgetting 3rdParty/lib3rd/microsoft/dotnet/NetFx64.exe
 
F:\repos\test>hg stat
A 3rdParty/lib3rd/SqlServerExpress/SQLEXPR32.EXE
A 3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.dll
A 3rdParty/lib3rd/microsoft/dotnet/dotnetfx.exe
? .bigfiles
 
F:\repos\test>hg bstat
 
F:\repos\test>cat .bigfiles
3rdParty/lib3rd/SqlServerExpress/SQLEXPR.EXE#3b10e09f2ad52d15d8cabdcdfb197b9
8ffc
78402
3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.pdb#da1b88073a76fa7dc85b8a55e8
9e29
07907bca64
3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip#75b8180bf172332d2418c965c787
e1cc
59522345
3rdParty/lib3rd/microsoft/dotnet/NetFx64.exe#e59cca309463a5d98daeaada83d1b05
fed5
126c5
 
F:\repos\test>g:\test_cygwin\bin\find f:/repos/bigrepo -type f
f:/repos/bigrepo/3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.pdb.da1b88073
a7
6fa7dc85b8a55e89e2907907bca64
f:/repos/bigrepo/3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip.75b8180bf17
23
32d2418c965c787e1cc59522345
f:/repos/bigrepo/3rdParty/lib3rd/microsoft/dotnet/NetFx64.exe.e59cca309463a5
d9
8daeaada83d1b05fed5126c5
f:/repos/bigrepo/3rdParty/lib3rd/SqlServerExpress/SQLEXPR.EXE.3b10e09f2ad52d
15
d8cabdcdfb197b98ffc78402
 
F:\repos\test>hg add
adding .bigfiles
 
F:\repos\test>hg ci -m 1
 
F:\repos\test>hg stat
 
F:\repos\test>hg bstat
 
F:\repos\test>echo qqq>>3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip
 
F:\repos\test>hg bstat
M 3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip
 
F:\repos\test>hg bref
 
F:\repos\test>hg stat
M .bigfiles
 
F:\repos\test>hg diff
diff --git a/.bigfiles b/.bigfiles
--- a/.bigfiles
+++ b/.bigfiles
@@ -1,4 +1,4 @@
 
3rdParty/lib3rd/SqlServerExpress/SQLEXPR.EXE#3b10e09f2ad52d15d8cabdcdfb197b9
8ff
c78402
 
3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.pdb#da1b88073a76fa7dc85b8a55e8
9e2
907907bca64
-3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip#75b8180bf172332d2418c965c78
7e1c
c59522345
+3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip#b0d7e27ba22aa02ff304237278d
d557
bf2882409
 
3rdParty/lib3rd/microsoft/dotnet/NetFx64.exe#e59cca309463a5d98daeaada83d1b05
fed
5126c5
 
F:\repos\test>hg ci -m 2
 
F:\repos\test>hg co -r 0
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
fetching
f:/repos/bigrepo/3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip.75b8
180bf172332d2418c965c787e1cc59522345
 
F:\repos\test>hg co -r 1
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
fetching
f:/repos/bigrepo/3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip.b0d7
e27ba22aa02ff304237278dd557bf2882409
 
F:\repos\test>


# bigfiles.py 
#
# Copyright 2008 Andrei Vermel <andrei.vermel at gmail.com>
#
# This software may be used and distributed according to the terms
# of the GNU General Public License, incorporated herein by reference.

from mercurial.i18n import _
from mercurial.node import *
from mercurial import commands, cmdutil, hg, node, util
import os, stat, shutil

_sha1 = util.sha1

def setup_bigfiles_ignore(ui, repo):
    str = ui.config('ui', 'ignore.bigfiles')
    if str:
        if str != '.bigfiles':
            raise util.Abort(_('ui.ignore.bigfiles is %s, not .bigfiles' %
str))
    else:
        fname = repo.wjoin('.hg/hgrc')
        lines = []
        try:
            fp = open(fname)
            lines = fp.readlines()
            fp.close()
            fp = open(repo.wjoin('.hg/hgrc'), 'w') 
            added = False       
            for str in lines:
                fp.write(str)
                spl =  str.strip().split('#')            
                if not added and spl and '[ui]' in spl[0]:
                    fp.write('\nignore.bigfiles = .bigfiles\n')
                    added = True
            if not added:
              fp.write('\n[ui]\n')
              fp.write('ignore.bigfiles = .bigfiles\n')
            fp.close()
        except:
            fp = open(repo.wjoin('.hg/hgrc'), 'w')        
            fp.write('[ui]\n')
            fp.write('ignore.bigfiles = .bigfiles\n')
            fp.close()
        
    fname = repo.wjoin('.bigfiles')
    try:
        os.stat(fname)
    except:
        open(fname, 'w').close() # create empty file

def parse_bigfiles(repo):
    fname = repo.wjoin('.bigfiles')
    bigfiles = {}
    try:
        for str in open(fname):
            path, hash = str.strip().split('#')
            bigfiles[path] = hash
    except:
        pass
    return bigfiles 

def bigfiles_repo(ui):
    brepo = ui.config('bigfiles', 'repo')
    if not brepo:
        raise util.Abort(_('bigfiles.repo path not configured'))
    try:
        st=os.stat(brepo)
        if not stat.S_ISDIR(st[stat.ST_MODE]):
            raise util.Abort(
               _('specified bigfiles repo %s is not a directory') % brepo)
    except:        
        raise util.Abort(_("can't access bigfiles repo: %s") % brepo)
    return brepo

def _hash(f):
    file = open(f, 'rb')
    s = _sha1("")
    while True: 
        text = file.read(1000000)
        if text=='':
            break
        s.update(text)
    return s.hexdigest()

def _bigstatus(ui, repo, pats, opts):
    MAX_SIZE = 40000000
    brepo = bigfiles_repo(ui)

    tracked_gotbig = [] # not in .bigfiles
    added_big = []      # not in .bigfiles
    modified = []       # already in .bigfiles
    removed = []        # missing, but still in .bigfiles
    gotsmall = []       # still in .bigfiles
    missinginrepo = []  # file recorded in .bigfiles not in bigfiles repo

    node1, node2 = cmdutil.revpair(repo, None)
    mod_all, added_all = repo.status(node1, node2, 
        cmdutil.match(repo, pats, opts), None, None, True)[0:2]
    bigfiles = parse_bigfiles(repo)

    for file in mod_all:
        f=repo.wjoin(file)
        fsize=os.stat(f)[stat.ST_SIZE]
        if fsize > MAX_SIZE:
            tracked_gotbig.append(file)

    for file in added_all:
        f=repo.wjoin(file)
        fsize=os.stat(f)[stat.ST_SIZE]
        if fsize > MAX_SIZE:
            added_big.append(file)

    for file, hash in bigfiles.iteritems():
        f=repo.wjoin(file)
        try:
            st = os.stat(f)
        except OSError:
            removed.append(file)
            continue
        if st[stat.ST_SIZE] <= MAX_SIZE:
            gotsmall.append(file)
            continue
        frepo = "%s/%s.%s" % (brepo, file, hash)
        try:
            st_repo = os.stat(frepo)
            if st[stat.ST_SIZE] == st_repo[stat.ST_SIZE] and \
               st[stat.ST_MTIME] == st_repo[stat.ST_MTIME]:
                continue
        except OSError:
            print "missing frepo:", frepo
            missinginrepo.append(file)
        fhash = _hash(f)
        if fhash != hash:
            modified.append(file)
    return tracked_gotbig, added_big, modified, removed, gotsmall, \
        missinginrepo

def bigstatus(ui, repo, *pats, **opts):
    '''show changed big files in the working directory
    Show status of big files in the repository.

    The codes used to show the status of files are:
    B = tracked by hg, got too big. 
    A = added to hg, too big
    M = modified, tracked by big file repo 
    ! = deleted from working directory, tracked by big file repo 
    S = got small, so can now be tracked by hg
    R = tracked by big file repo, but the data is missing'''

    setup_bigfiles_ignore(ui, repo)
    bst = _bigstatus(ui, repo, pats, opts)
    codes = ('B', 'A', 'M', '!', 'S', 'R')
    for files, code in zip(bst, codes):
       for f in files:
         if opts['no_status']:
             ui.write("%s\n" % f)
         else:
             ui.write("%s %s\n" % (code, f))

def _updatebigrepo(ui, repo, files, brepo, bigfiles):
    for file in files:
        f = repo.wjoin(file)
        hash = _hash(f)
        bigfiles[file] = hash
        rf = "%s/%s.%s" % (brepo, file, hash)
        util.makedirs(os.path.dirname(rf))
        shutil.copy(f, rf)
        shutil.copymode(f, rf)

def bigrefresh(ui, repo, *pats, **opts):
    '''update big files tracking according to the current state of working
directory. 

    Added big files get forgotten and added to '.bigfiles' instead.
    Removed big files are deleted from '.bigfiles'. Files tracked by hg 
    that got too big are removed from hg, and added to '.bigfiles'. Copies 
    of new and modified big files are stored in versions directory.'''
 
    setup_bigfiles_ignore(ui, repo)
    tracked_gotbig, added_big, modified, removed, gotsmall, \
        missinginrepo = _bigstatus(ui, repo, pats, opts)
    for f in added_big:
        ui.write("forgetting %s\n" % f) 
    if not opts['dry_run']:
        repo.forget(added_big)

    for f in tracked_gotbig:
        ui.write("removing %s\n" % f) 
    if not opts['dry_run']:
        repo.remove(tracked_gotbig, unlink=False)

    for f in removed:
        ui.write("recording removal of %s\n" % f) 

    brepo = bigfiles_repo(ui)
    bigfiles = parse_bigfiles(repo)

    if not opts['dry_run']:
        _updatebigrepo(ui, repo, tracked_gotbig + added_big + modified,
           brepo, bigfiles)
        for file in removed:
            del bigfiles[file]

        fp = open(repo.wjoin('.bigfiles'), 'w')
        for f in util.sort(bigfiles.keys()):
            fp.write("%s#%s\n" % (f, bigfiles[f]))
        fp.close()

def bigupdate(ui, repo, *pats, **opts):
    '''fetch files from versions directory as recorded in '.bigfiles'. 
 
    Also complain about necessary files missing in the version directory'''
    setup_bigfiles_ignore(ui, repo)
    tracked_gotbig, added_big, modified, removed, gotsmall, \
        missinginrepo = _bigstatus(ui, repo, pats, opts)
    brepo = bigfiles_repo(ui)
    bigfiles = parse_bigfiles(repo)
    for file in removed:
        f = repo.wjoin(file)
        hash= bigfiles[file]
        rf = "%s/%s.%s" % (brepo, file, hash)
        ui.write("fetching %s\n" % rf) 
        util.makedirs(os.path.dirname(f))
        if not opts['dry_run']:
            shutil.copy(rf, f)
            shutil.copymode(rf, f)

    if missinginrepo:
        ui.write("\nMissing in bigrepo:\n") 
    for file in missinginrepo:
        f = repo.wjoin(file)
        hash = _hash(f)
        bigfiles[file] = hash
        rf = "%s/%s.%s" % (brepo, file, hash)
        ui.write("%s\n" % rf) 

def my_update(ui, repo, node=None, rev=None, clean=False, date=None):
    bigfiles0 = parse_bigfiles(repo)
    res = commands.update(ui, repo, node, rev, clean, date)
    bigfiles1 = parse_bigfiles(repo)
    for file in bigfiles0.keys():
        if file not in bigfiles1:
            utils.unlink(repo.wjoin(file))

    tofetch = {}
    for file, hash in bigfiles1.iteritems():
        if file not in bigfiles0 or bigfiles0[file] != hash:
            tofetch[file] = hash

    if tofetch is not {}:
        brepo = bigfiles_repo(ui)
        for file, hash in tofetch.iteritems():
            f = repo.wjoin(file)
            rf = "%s/%s.%s" % (brepo, file, hash)
            ui.write("fetching %s\n" % rf) 
            util.makedirs(os.path.dirname(f))
            shutil.copy(rf, f)
            shutil.copymode(rf, f)
        
    return res

tmp = commands.table["^update|up|checkout|co"]
my_update.__doc__ = tmp[0].__doc__
commands.table["^update|up|checkout|co"] = (my_update, tmp[1], tmp[2])


cmdtable = {
    '^bigstatus|bstatus':
        (bigstatus,
         [('n', 'no-status', None, _('hide status prefix')),
         ] + commands.walkopts,
        _('hg bigstatus [SOURCE]')),
    '^bigrefresh|brefresh':
        (bigrefresh,
         [('n', 'dry-run', None, _('do not perform actions, just print
output')),
         ] + commands.walkopts,
        _('hg bigrefresh [SOURCE]')),
    '^bigupdate|bup|bigcheckout|bco':
        (bigupdate,
         [('n', 'dry-run', None, _('do not perform actions, just print
output')),
         ] + commands.walkopts,
        _('hg bigupdate [SOURCE]')),
}
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bigfiles.py
Type: application/octet-stream
Size: 9220 bytes
Desc: not available
Url : http://selenic.com/pipermail/mercurial/attachments/20081017/620c860a/attachment.obj 


More information about the Mercurial mailing list