repository disk usage

Matt Mackall mpm at selenic.com
Thu Aug 11 02:32:39 CDT 2005


On Thu, Aug 11, 2005 at 12:21:19PM +0530, Aneesh Kumar wrote:
> Today i compared the disk usage of a git and hg repository. Here is
> what i found
> 
> linux-2.6-git$ du -s -h 
> 335M    .
> linux-2.6-hg$ du -s -h 
> 449M    .
> 
> both top of the tree. 

Packed git is indeed smaller. Almost all of this savings is due to the
fact that it crams everything in a very small number of files.
Consider:

- 18335 files checked in
- Average wasted block space per .d file: 2k average
- Average wasted block space per .i file: ~4k (decreases to 2k in the limit)
- total wasted space: 18335 * 6k = 112M

If you run hg on a ext3 filesystem with 1k blocks, that overhead will
of course shrink to something like 30M.

But packed git still has all the scalability problems of git except
more so. When we fail to find an unpacked blob, we have to look in the
pack files. As the number of pack files increases over time, this
degrades to linear performance in the number of versions. And disk
locality is still more or less worst case (creation order) so
retrieving a given version will have O(files) seeks across the entire
size of the repo.

Mercurial lays things out by filename, so tends to have very nice
monotonic disk locality. Check out performance does not degrade
significantly over time.

More importantly, the extra indices in Mercurial make it a more
powerful system. We can do annotate and get revision graphs of single
files painlessly.

And of course, you never need to 'pack' Mercurial repos.

-- 
Mathematics is the supreme nostalgia of our time.


More information about the Mercurial mailing list