tracking third-party sources

Giorgos Keramidas keramida at ceid.upatras.gr
Sat Jan 3 20:08:10 CST 2009


On Sun, 4 Jan 2009 00:48:02 +0000 (UTC), Pierre Asselin <pa at panix.com> wrote:
> I'm looking for a mercurial equivalent to "cvs import".
> Based in part on the wiki I came up with this.
>
> Create the repository normally ...
>     hg init foo-incoming
>
> ... and perform all code drops as follows:
>     tar xjvf foo-x.y.tar.bz2
>     cd foo-x.y
>     ln -s ../foo-incoming/.hg .      # clever !
>     hg addremove --similarity 70
>     hg commit -m foo-x.y
>
> Questions:
>     1)  Am I being *too* clever ?
>     2)  Is the behavior worth supporting officially, e.g.
> 	    hg addremove -s 70 -R ../foo-incoming
> 	(doesn't work as of hg 1.0.2 but the symlink does.)

Hi Pierre,

No you are not being too clever.  At least not if you avoid the symlink
hack.  I am not sure how well the symlink hack will work, but you can
definitely use a single 'incoming' clone for importing upstream sources.
It is, in fact, exactly what the vendor branches in CVS are used for.
But see below for a caveat about the copies 'addremove -s' may record.

I used something similar a few months ago, to collect top(1) snapshots
from all over the network.  I am still working on a Mercurial 'forest'
of repositories with as many top releases as I can find, but the method
used so far is something similar to yours:

I started with a clean import of the oldest release I could find:

    % mkdir release
    % hg init release/3.1
    % cd release/3.1

      ... extract top-3.1.tar.gz in .

    % hg commit -A 'Import a clean source tree of top-3.1 release'
    % hg tag top-3.1

Then I cloned release/3.1 to release/3.2, updated to the tip of the 3.1
repository and run:

    % hg up -C tip
    % hg revert -r null .               # 'remove' workspace files

      ... extract top-3.2.tar.gz in current directory

    % hg addremove -s 90

This worked pretty well, and I am half done collecting the snapshots of
top that were interesting :-) :-)

There is, however, one important deail about using addremove for code
drops.  It may record too many file copies, even with large similarity
percentage values (over 95%).

After every 'addremove' I verified with 'hg diff --git' that there are
no false copies.  Sometimes, when there are small files in the workspace
and they have huge copyright notices but very little "real" content
using --similarity 90 ends up recording a few false positive copies.
These can be manually reverted after addremove, and recorded as a pair
of separate 'hg add' and 'hg remove' operations.

A false positive shows up easily as 'rename from' lines near the start
of the diff for every copied file.  If you are a bit careful you can
selectively revert the addremove copies even if a single source file is
auto-detected as a source of multiple copies, i.e.:

    % hg root
    /tmp/demo

    % cat << EOF > foo
    > /* Large boilerplate comment.  This should be enough to trigger
    >    a false positive when addremove tries to autodetect file moves
    >    with a small enough similarity percentage. */
    >
    > foo
    > EOF

    % hg commit -A -m 'import snapshot #1'
    adding foo

Now let's assume that you received a new code drop that does not include
the file called 'foo', but only two new files: 'bar' and 'quux'.  After
deleting all the workspace files and dropping the new sources into the
workspace, you could end up with something similar to the result of:

    % rm foo

    % cat << EOF > bar
    > /* Large boilerplate comment.  This should be enough to trigger
    >    a false positive when addremove tries to autodetect file moves
    >    with a small enough similarity percentage. */
    >
    > bar
    > EOF

    % cat << EOF > quux
    > /* Large boilerplate comment.  This should be enough to trigger
    >    a false positive when addremove tries to autodetect file moves
    >    with a small enough similarity percentage. */
    >
    > quux
    > EOF

    % hg stat
    ! foo
    ? bar
    ? quux

The 'foo' file is gone from this snapshot.  There are two new files, so
we can run addremove.  The boilerplate comments are a large percentage
of these tiny files, so running 'hg addremove -s 75' at this point
records file 'foo' as the source of *two* copy operations:

    % hg addremove -s 75
    adding bar
    removing foo
    adding quux
    recording removal of foo as rename to bar (97% similar)
    recording removal of foo as rename to quux (97% similar)

Note how the similarity level of the files is pretty high, in spite of
their tiny size.  If you _carefully_ review the diff, and you are
certain that the vendor source copied 'foo' only to 'bar' but _not_ to
'quux' you can partially revert one of the two copies:

    % hg revert quux            # This does not delete the _workspace_
                                # file; it merely undoes the recorded
                                # copy operation.

    % hg add quux               # Now record it as a simple 'new' file.

Then only 'bar' shows up as a copied file in 'hg diff --git' output and
file 'quux' shows up as a new file that was simply added in this source
snapshot (look at the lines marked with '=>' below):

    % hg diff --git
    diff --git a/foo b/bar
=>  copy from foo
=>  copy to bar
    --- a/foo
    +++ b/bar
    @@ -2,4 +2,4 @@
        a false positive when addremove tries to autodetect file moves
        with a small enough similarity percentage. */

    -foo
    +bar
    diff --git a/quux b/quux
    new file mode 100644
    --- /dev/null
    +++ b/quux
    @@ -0,0 +1,5 @@
    +/* Large boilerplate comment.  This should be enough to trigger
    +   a false positive when addremove tries to autodetect file moves
    +   with a small enough similarity percentage. */
    +
    +baz

If you keep this in mind and you verify all the addremove runs by
looking at the 'diff --git' output for file copies, you can definitely
use Mercurial to track upstream source drops :-)

HTH,
Giorgos



More information about the Mercurial mailing list