[ANNOUNCE] An extension to handle big files
Andrei Vermel
avermel at mail.ru
Fri Oct 17 04:51:52 CDT 2008
Die to memory and performance limitations big files shouldn't be
stored in a hg repo (hg runs out of memory checking in a 170Mb file on my
2Gb box).
Matt suggests splitting the file in chunks and assembling back during make
stage.
Here's an extension that uses a different approach. It seems to work, but is
just a proof of a concept yet. I'd like to hear comments.
Big files are not put to hg repo. They are listed in a file called
'.bigfiles',
which also serves as an ignore file similar to .hgignore, so they do not
clutter output of hg commands. The file also stores check sums of the big
files in a form of comments. File '.bigfiles' is versioned by hg, so each
changeset
knows which big files it uses from the names and checksums.
The file can be diffed and merged, which is nice.
The versions of big files are stored in a dedicated directory, with
checksums
attached to names.
The extension overrides 'hg update', so that it can compare contents of
'.bigfiles'
before and after the update to remove and fetch appropriate big files.
The directory storing versions of big files can be synced with the remote
one (
the extension doesn't do this, but tells the list of the necessary files).
The versions
corresponding to old changesets can be removed to save space.
To add a new big file, 'hg add' is used - ignore the size warning.
To remove a tracked big file, just delete it.
'hg bstat' shows added big files as 'A', removed big files as '!', modified
big files as 'M',
files tracked by hg that got too big as 'B', big files listed in '.bigfiles'
but missing from
the big files versions directory as 'R'.
'hg brefresh' updates '.bigfiles' and versions directory according to the
current state of
working directory. Added big files get forgotten and added to '.bigfiles'
instead.
Removed big files are deleted from '.bigfiles'. Files tracked by hg that got
too big are
removed from hg, and added to '.bigfiles'. Copies of new and modified big
files are
stored in versions directory.
'hg bupdate' fetches files from versions directory as recorded in
'.bigfiles', and
complains about necessary files missing in the version directory.
To setup the extension add to a config file:
[bigfiles]
repo = path/to/versions/dir
Below is a sample session:
F:\repos\test>hg init
F:\repos\test>hg stat
? 3rdParty/lib3rd/SqlServerExpress/SQLEXPR.EXE
? 3rdParty/lib3rd/SqlServerExpress/SQLEXPR32.EXE
? 3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.dll
? 3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.pdb
? 3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip
? 3rdParty/lib3rd/microsoft/dotnet/NetFx64.exe
? 3rdParty/lib3rd/microsoft/dotnet/dotnetfx.exe
F:\repos\test>hg add
adding 3rdParty\lib3rd\SqlServerExpress\SQLEXPR.EXE
adding 3rdParty\lib3rd\SqlServerExpress\SQLEXPR32.EXE
adding 3rdParty\lib3rd\acis\bin\NT_VC8_DLLD\SpaACISd.dll
adding 3rdParty\lib3rd\acis\bin\NT_VC8_DLLD\SpaACISd.pdb
adding 3rdParty\lib3rd\aciscatiav5rd\aciscatiav5rd.zip
adding 3rdParty\lib3rd\microsoft\dotnet\NetFx64.exe
adding 3rdParty\lib3rd\microsoft\dotnet\dotnetfx.exe
3rdParty/lib3rd/SqlServerExpress/SQLEXPR.EXE: files over 10MB may cause
memory a
nd performance problems
(use 'hg revert 3rdParty/lib3rd/SqlServerExpress/SQLEXPR.EXE' to unadd the
file)
3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.pdb: files over 10MB may cause
mem
ory and performance problems
(use 'hg revert 3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.pdb' to unadd
the
file)
3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip: files over 10MB may cause
memor
y and performance problems
(use 'hg revert 3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip' to unadd
the fi
le)
3rdParty/lib3rd/microsoft/dotnet/NetFx64.exe: files over 10MB may cause
memory a
nd performance problems
(use 'hg revert 3rdParty/lib3rd/microsoft/dotnet/NetFx64.exe' to unadd the
file)
F:\repos\test>hg stat
A 3rdParty/lib3rd/SqlServerExpress/SQLEXPR.EXE
A 3rdParty/lib3rd/SqlServerExpress/SQLEXPR32.EXE
A 3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.dll
A 3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.pdb
A 3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip
A 3rdParty/lib3rd/microsoft/dotnet/NetFx64.exe
A 3rdParty/lib3rd/microsoft/dotnet/dotnetfx.exe
F:\repos\test>hg bstat
abort: bigfiles.repo path not configured
F:\repos\test>echo [bigfiles]>>.hg/hgrc
F:\repos\test>echo repo= f:/repos/bigrepo>>.hg/hgrc
F:\repos\test>hg bstat
A 3rdParty/lib3rd/SqlServerExpress/SQLEXPR.EXE
A 3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.pdb
A 3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip
A 3rdParty/lib3rd/microsoft/dotnet/NetFx64.exe
F:\repos\test>hg bref
forgetting 3rdParty/lib3rd/SqlServerExpress/SQLEXPR.EXE
forgetting 3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.pdb
forgetting 3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip
forgetting 3rdParty/lib3rd/microsoft/dotnet/NetFx64.exe
F:\repos\test>hg stat
A 3rdParty/lib3rd/SqlServerExpress/SQLEXPR32.EXE
A 3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.dll
A 3rdParty/lib3rd/microsoft/dotnet/dotnetfx.exe
? .bigfiles
F:\repos\test>hg bstat
F:\repos\test>cat .bigfiles
3rdParty/lib3rd/SqlServerExpress/SQLEXPR.EXE#3b10e09f2ad52d15d8cabdcdfb197b9
8ffc
78402
3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.pdb#da1b88073a76fa7dc85b8a55e8
9e29
07907bca64
3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip#75b8180bf172332d2418c965c787
e1cc
59522345
3rdParty/lib3rd/microsoft/dotnet/NetFx64.exe#e59cca309463a5d98daeaada83d1b05
fed5
126c5
F:\repos\test>g:\test_cygwin\bin\find f:/repos/bigrepo -type f
f:/repos/bigrepo/3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.pdb.da1b88073
a7
6fa7dc85b8a55e89e2907907bca64
f:/repos/bigrepo/3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip.75b8180bf17
23
32d2418c965c787e1cc59522345
f:/repos/bigrepo/3rdParty/lib3rd/microsoft/dotnet/NetFx64.exe.e59cca309463a5
d9
8daeaada83d1b05fed5126c5
f:/repos/bigrepo/3rdParty/lib3rd/SqlServerExpress/SQLEXPR.EXE.3b10e09f2ad52d
15
d8cabdcdfb197b98ffc78402
F:\repos\test>hg add
adding .bigfiles
F:\repos\test>hg ci -m 1
F:\repos\test>hg stat
F:\repos\test>hg bstat
F:\repos\test>echo qqq>>3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip
F:\repos\test>hg bstat
M 3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip
F:\repos\test>hg bref
F:\repos\test>hg stat
M .bigfiles
F:\repos\test>hg diff
diff --git a/.bigfiles b/.bigfiles
--- a/.bigfiles
+++ b/.bigfiles
@@ -1,4 +1,4 @@
3rdParty/lib3rd/SqlServerExpress/SQLEXPR.EXE#3b10e09f2ad52d15d8cabdcdfb197b9
8ff
c78402
3rdParty/lib3rd/acis/bin/NT_VC8_DLLD/SpaACISd.pdb#da1b88073a76fa7dc85b8a55e8
9e2
907907bca64
-3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip#75b8180bf172332d2418c965c78
7e1c
c59522345
+3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip#b0d7e27ba22aa02ff304237278d
d557
bf2882409
3rdParty/lib3rd/microsoft/dotnet/NetFx64.exe#e59cca309463a5d98daeaada83d1b05
fed
5126c5
F:\repos\test>hg ci -m 2
F:\repos\test>hg co -r 0
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
fetching
f:/repos/bigrepo/3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip.75b8
180bf172332d2418c965c787e1cc59522345
F:\repos\test>hg co -r 1
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
fetching
f:/repos/bigrepo/3rdParty/lib3rd/aciscatiav5rd/aciscatiav5rd.zip.b0d7
e27ba22aa02ff304237278dd557bf2882409
F:\repos\test>
# bigfiles.py
#
# Copyright 2008 Andrei Vermel <andrei.vermel at gmail.com>
#
# This software may be used and distributed according to the terms
# of the GNU General Public License, incorporated herein by reference.
from mercurial.i18n import _
from mercurial.node import *
from mercurial import commands, cmdutil, hg, node, util
import os, stat, shutil
_sha1 = util.sha1
def setup_bigfiles_ignore(ui, repo):
str = ui.config('ui', 'ignore.bigfiles')
if str:
if str != '.bigfiles':
raise util.Abort(_('ui.ignore.bigfiles is %s, not .bigfiles' %
str))
else:
fname = repo.wjoin('.hg/hgrc')
lines = []
try:
fp = open(fname)
lines = fp.readlines()
fp.close()
fp = open(repo.wjoin('.hg/hgrc'), 'w')
added = False
for str in lines:
fp.write(str)
spl = str.strip().split('#')
if not added and spl and '[ui]' in spl[0]:
fp.write('\nignore.bigfiles = .bigfiles\n')
added = True
if not added:
fp.write('\n[ui]\n')
fp.write('ignore.bigfiles = .bigfiles\n')
fp.close()
except:
fp = open(repo.wjoin('.hg/hgrc'), 'w')
fp.write('[ui]\n')
fp.write('ignore.bigfiles = .bigfiles\n')
fp.close()
fname = repo.wjoin('.bigfiles')
try:
os.stat(fname)
except:
open(fname, 'w').close() # create empty file
def parse_bigfiles(repo):
fname = repo.wjoin('.bigfiles')
bigfiles = {}
try:
for str in open(fname):
path, hash = str.strip().split('#')
bigfiles[path] = hash
except:
pass
return bigfiles
def bigfiles_repo(ui):
brepo = ui.config('bigfiles', 'repo')
if not brepo:
raise util.Abort(_('bigfiles.repo path not configured'))
try:
st=os.stat(brepo)
if not stat.S_ISDIR(st[stat.ST_MODE]):
raise util.Abort(
_('specified bigfiles repo %s is not a directory') % brepo)
except:
raise util.Abort(_("can't access bigfiles repo: %s") % brepo)
return brepo
def _hash(f):
file = open(f, 'rb')
s = _sha1("")
while True:
text = file.read(1000000)
if text=='':
break
s.update(text)
return s.hexdigest()
def _bigstatus(ui, repo, pats, opts):
MAX_SIZE = 40000000
brepo = bigfiles_repo(ui)
tracked_gotbig = [] # not in .bigfiles
added_big = [] # not in .bigfiles
modified = [] # already in .bigfiles
removed = [] # missing, but still in .bigfiles
gotsmall = [] # still in .bigfiles
missinginrepo = [] # file recorded in .bigfiles not in bigfiles repo
node1, node2 = cmdutil.revpair(repo, None)
mod_all, added_all = repo.status(node1, node2,
cmdutil.match(repo, pats, opts), None, None, True)[0:2]
bigfiles = parse_bigfiles(repo)
for file in mod_all:
f=repo.wjoin(file)
fsize=os.stat(f)[stat.ST_SIZE]
if fsize > MAX_SIZE:
tracked_gotbig.append(file)
for file in added_all:
f=repo.wjoin(file)
fsize=os.stat(f)[stat.ST_SIZE]
if fsize > MAX_SIZE:
added_big.append(file)
for file, hash in bigfiles.iteritems():
f=repo.wjoin(file)
try:
st = os.stat(f)
except OSError:
removed.append(file)
continue
if st[stat.ST_SIZE] <= MAX_SIZE:
gotsmall.append(file)
continue
frepo = "%s/%s.%s" % (brepo, file, hash)
try:
st_repo = os.stat(frepo)
if st[stat.ST_SIZE] == st_repo[stat.ST_SIZE] and \
st[stat.ST_MTIME] == st_repo[stat.ST_MTIME]:
continue
except OSError:
print "missing frepo:", frepo
missinginrepo.append(file)
fhash = _hash(f)
if fhash != hash:
modified.append(file)
return tracked_gotbig, added_big, modified, removed, gotsmall, \
missinginrepo
def bigstatus(ui, repo, *pats, **opts):
'''show changed big files in the working directory
Show status of big files in the repository.
The codes used to show the status of files are:
B = tracked by hg, got too big.
A = added to hg, too big
M = modified, tracked by big file repo
! = deleted from working directory, tracked by big file repo
S = got small, so can now be tracked by hg
R = tracked by big file repo, but the data is missing'''
setup_bigfiles_ignore(ui, repo)
bst = _bigstatus(ui, repo, pats, opts)
codes = ('B', 'A', 'M', '!', 'S', 'R')
for files, code in zip(bst, codes):
for f in files:
if opts['no_status']:
ui.write("%s\n" % f)
else:
ui.write("%s %s\n" % (code, f))
def _updatebigrepo(ui, repo, files, brepo, bigfiles):
for file in files:
f = repo.wjoin(file)
hash = _hash(f)
bigfiles[file] = hash
rf = "%s/%s.%s" % (brepo, file, hash)
util.makedirs(os.path.dirname(rf))
shutil.copy(f, rf)
shutil.copymode(f, rf)
def bigrefresh(ui, repo, *pats, **opts):
'''update big files tracking according to the current state of working
directory.
Added big files get forgotten and added to '.bigfiles' instead.
Removed big files are deleted from '.bigfiles'. Files tracked by hg
that got too big are removed from hg, and added to '.bigfiles'. Copies
of new and modified big files are stored in versions directory.'''
setup_bigfiles_ignore(ui, repo)
tracked_gotbig, added_big, modified, removed, gotsmall, \
missinginrepo = _bigstatus(ui, repo, pats, opts)
for f in added_big:
ui.write("forgetting %s\n" % f)
if not opts['dry_run']:
repo.forget(added_big)
for f in tracked_gotbig:
ui.write("removing %s\n" % f)
if not opts['dry_run']:
repo.remove(tracked_gotbig, unlink=False)
for f in removed:
ui.write("recording removal of %s\n" % f)
brepo = bigfiles_repo(ui)
bigfiles = parse_bigfiles(repo)
if not opts['dry_run']:
_updatebigrepo(ui, repo, tracked_gotbig + added_big + modified,
brepo, bigfiles)
for file in removed:
del bigfiles[file]
fp = open(repo.wjoin('.bigfiles'), 'w')
for f in util.sort(bigfiles.keys()):
fp.write("%s#%s\n" % (f, bigfiles[f]))
fp.close()
def bigupdate(ui, repo, *pats, **opts):
'''fetch files from versions directory as recorded in '.bigfiles'.
Also complain about necessary files missing in the version directory'''
setup_bigfiles_ignore(ui, repo)
tracked_gotbig, added_big, modified, removed, gotsmall, \
missinginrepo = _bigstatus(ui, repo, pats, opts)
brepo = bigfiles_repo(ui)
bigfiles = parse_bigfiles(repo)
for file in removed:
f = repo.wjoin(file)
hash= bigfiles[file]
rf = "%s/%s.%s" % (brepo, file, hash)
ui.write("fetching %s\n" % rf)
util.makedirs(os.path.dirname(f))
if not opts['dry_run']:
shutil.copy(rf, f)
shutil.copymode(rf, f)
if missinginrepo:
ui.write("\nMissing in bigrepo:\n")
for file in missinginrepo:
f = repo.wjoin(file)
hash = _hash(f)
bigfiles[file] = hash
rf = "%s/%s.%s" % (brepo, file, hash)
ui.write("%s\n" % rf)
def my_update(ui, repo, node=None, rev=None, clean=False, date=None):
bigfiles0 = parse_bigfiles(repo)
res = commands.update(ui, repo, node, rev, clean, date)
bigfiles1 = parse_bigfiles(repo)
for file in bigfiles0.keys():
if file not in bigfiles1:
utils.unlink(repo.wjoin(file))
tofetch = {}
for file, hash in bigfiles1.iteritems():
if file not in bigfiles0 or bigfiles0[file] != hash:
tofetch[file] = hash
if tofetch is not {}:
brepo = bigfiles_repo(ui)
for file, hash in tofetch.iteritems():
f = repo.wjoin(file)
rf = "%s/%s.%s" % (brepo, file, hash)
ui.write("fetching %s\n" % rf)
util.makedirs(os.path.dirname(f))
shutil.copy(rf, f)
shutil.copymode(rf, f)
return res
tmp = commands.table["^update|up|checkout|co"]
my_update.__doc__ = tmp[0].__doc__
commands.table["^update|up|checkout|co"] = (my_update, tmp[1], tmp[2])
cmdtable = {
'^bigstatus|bstatus':
(bigstatus,
[('n', 'no-status', None, _('hide status prefix')),
] + commands.walkopts,
_('hg bigstatus [SOURCE]')),
'^bigrefresh|brefresh':
(bigrefresh,
[('n', 'dry-run', None, _('do not perform actions, just print
output')),
] + commands.walkopts,
_('hg bigrefresh [SOURCE]')),
'^bigupdate|bup|bigcheckout|bco':
(bigupdate,
[('n', 'dry-run', None, _('do not perform actions, just print
output')),
] + commands.walkopts,
_('hg bigupdate [SOURCE]')),
}
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bigfiles.py
Type: application/octet-stream
Size: 9220 bytes
Desc: not available
Url : http://selenic.com/pipermail/mercurial/attachments/20081017/620c860a/attachment.obj
More information about the Mercurial
mailing list