Converting from CVSNT
Michael Haggerty
mhagger at alum.mit.edu
Tue Jun 30 16:28:55 CDT 2009
Greg Ward wrote:
> If I convert to Mercurial following chronological order (as
> cvs2{svn,git} generate), then I get a ~5 GB manifest file. If I let
> Mercurial toposort the way it wants, I get a ~150 MB manifest file (if
> memory serves). (And if I tweak the toposort algorithm to generate a
> more sensible but not quite space-optimal sort, I think I got a ~180
> MB manifest.) That makes the difference between "darn, Mercurial
> looks neat but is unsuitable for us" and "Mercurial wins".
>
> The annoying thing is that this this is all just an implementation
> detail of Mercurial, but it's a *killer* implementation detail.
Please correct me if I'm wrong in inferring that Mercurial doesn't care
whether the commits are in chronological order, but benefits from having
similar revisions near each other because the revlog format computes
diffs between revisions that happen to be adjacent in the file, not
revisions that are adjacent in the DAG. (I always wondered how the
revlog format represents branches, since it seems like such a linear
format. Obviously the answer is that it *doesn't* represent branches
very efficiently, at least in the general case.)
Anyway, if is it not too much work to determine a more advantageous (in
the sense of "efficient for Mercurial") ordering from among all of the
topologically valid choices, maybe cvs2svn could generate the changesets
in that order.
On the other hand, it might make more sense to develop a tool that can
optimize an existing Mercurial repository. It seems to me that such a
tool could usefully be applied to repositories that have been in use for
a while--a kind of "defragmentation" of lines of development, or "hg
repack" by analogy with git. Then the convert extension wouldn't have
to deal with this issue at all.
> On Sun, Jun 28, 2009 at 3:56 PM, Michael Haggerty<mhagger at alum.mit.edu> wrote:
>> By the way, I've done some work (not yet published) on changing cvs2git
>> to generate the revision contents much more efficiently (by using the
>> internal checkout code instead of calling "cvs co" each time). This
>> would not save so much time for cvs2hg because hg-fastimport requires
>> inline blobs
>
> That last bit is no longer true. I fixed hg-fastimport to accomodate
> blob refs months ago:
> http://vc.gerg.ca/hg/hg-fastimport/rev/9e9c215fcbd8 . (That's why I
> mostly use cvs2git for testing hg-fastimport; I pretty much ignore
> cvs2hg. I think the only difference between cvs2hg and cvs2git needs
> to be max number of merge parents, since hg-fastimport does not handle
> octopus merges yet.)
That's good to know. Is this is in a released hg version? If so, then
I would probably change the cvs2hg-example.options file to generate
non-inline blobs by default (but leave the restriction on the max number
of merge parents).
By the way, is the order that revisions are recorded for a single file
required to be the same as the order that changesets are committed to
the repository? Obviously, file revisions have to be topologically
sorted. But it seems to me that efficiency gains could be had by
sorting file revisions for each file separately so as to optimize the
revlog for the file, and separately sorting changesets in such a way to
optimize the revlog for the manifest. Another job for "hg repack" :-)
Michael
More information about the Mercurial
mailing list