tracking third-party sources
Giorgos Keramidas
keramida at ceid.upatras.gr
Sat Jan 3 20:08:10 CST 2009
On Sun, 4 Jan 2009 00:48:02 +0000 (UTC), Pierre Asselin <pa at panix.com> wrote:
> I'm looking for a mercurial equivalent to "cvs import".
> Based in part on the wiki I came up with this.
>
> Create the repository normally ...
> hg init foo-incoming
>
> ... and perform all code drops as follows:
> tar xjvf foo-x.y.tar.bz2
> cd foo-x.y
> ln -s ../foo-incoming/.hg . # clever !
> hg addremove --similarity 70
> hg commit -m foo-x.y
>
> Questions:
> 1) Am I being *too* clever ?
> 2) Is the behavior worth supporting officially, e.g.
> hg addremove -s 70 -R ../foo-incoming
> (doesn't work as of hg 1.0.2 but the symlink does.)
Hi Pierre,
No you are not being too clever. At least not if you avoid the symlink
hack. I am not sure how well the symlink hack will work, but you can
definitely use a single 'incoming' clone for importing upstream sources.
It is, in fact, exactly what the vendor branches in CVS are used for.
But see below for a caveat about the copies 'addremove -s' may record.
I used something similar a few months ago, to collect top(1) snapshots
from all over the network. I am still working on a Mercurial 'forest'
of repositories with as many top releases as I can find, but the method
used so far is something similar to yours:
I started with a clean import of the oldest release I could find:
% mkdir release
% hg init release/3.1
% cd release/3.1
... extract top-3.1.tar.gz in .
% hg commit -A 'Import a clean source tree of top-3.1 release'
% hg tag top-3.1
Then I cloned release/3.1 to release/3.2, updated to the tip of the 3.1
repository and run:
% hg up -C tip
% hg revert -r null . # 'remove' workspace files
... extract top-3.2.tar.gz in current directory
% hg addremove -s 90
This worked pretty well, and I am half done collecting the snapshots of
top that were interesting :-) :-)
There is, however, one important deail about using addremove for code
drops. It may record too many file copies, even with large similarity
percentage values (over 95%).
After every 'addremove' I verified with 'hg diff --git' that there are
no false copies. Sometimes, when there are small files in the workspace
and they have huge copyright notices but very little "real" content
using --similarity 90 ends up recording a few false positive copies.
These can be manually reverted after addremove, and recorded as a pair
of separate 'hg add' and 'hg remove' operations.
A false positive shows up easily as 'rename from' lines near the start
of the diff for every copied file. If you are a bit careful you can
selectively revert the addremove copies even if a single source file is
auto-detected as a source of multiple copies, i.e.:
% hg root
/tmp/demo
% cat << EOF > foo
> /* Large boilerplate comment. This should be enough to trigger
> a false positive when addremove tries to autodetect file moves
> with a small enough similarity percentage. */
>
> foo
> EOF
% hg commit -A -m 'import snapshot #1'
adding foo
Now let's assume that you received a new code drop that does not include
the file called 'foo', but only two new files: 'bar' and 'quux'. After
deleting all the workspace files and dropping the new sources into the
workspace, you could end up with something similar to the result of:
% rm foo
% cat << EOF > bar
> /* Large boilerplate comment. This should be enough to trigger
> a false positive when addremove tries to autodetect file moves
> with a small enough similarity percentage. */
>
> bar
> EOF
% cat << EOF > quux
> /* Large boilerplate comment. This should be enough to trigger
> a false positive when addremove tries to autodetect file moves
> with a small enough similarity percentage. */
>
> quux
> EOF
% hg stat
! foo
? bar
? quux
The 'foo' file is gone from this snapshot. There are two new files, so
we can run addremove. The boilerplate comments are a large percentage
of these tiny files, so running 'hg addremove -s 75' at this point
records file 'foo' as the source of *two* copy operations:
% hg addremove -s 75
adding bar
removing foo
adding quux
recording removal of foo as rename to bar (97% similar)
recording removal of foo as rename to quux (97% similar)
Note how the similarity level of the files is pretty high, in spite of
their tiny size. If you _carefully_ review the diff, and you are
certain that the vendor source copied 'foo' only to 'bar' but _not_ to
'quux' you can partially revert one of the two copies:
% hg revert quux # This does not delete the _workspace_
# file; it merely undoes the recorded
# copy operation.
% hg add quux # Now record it as a simple 'new' file.
Then only 'bar' shows up as a copied file in 'hg diff --git' output and
file 'quux' shows up as a new file that was simply added in this source
snapshot (look at the lines marked with '=>' below):
% hg diff --git
diff --git a/foo b/bar
=> copy from foo
=> copy to bar
--- a/foo
+++ b/bar
@@ -2,4 +2,4 @@
a false positive when addremove tries to autodetect file moves
with a small enough similarity percentage. */
-foo
+bar
diff --git a/quux b/quux
new file mode 100644
--- /dev/null
+++ b/quux
@@ -0,0 +1,5 @@
+/* Large boilerplate comment. This should be enough to trigger
+ a false positive when addremove tries to autodetect file moves
+ with a small enough similarity percentage. */
+
+baz
If you keep this in mind and you verify all the addremove runs by
looking at the 'diff --git' output for file copies, you can definitely
use Mercurial to track upstream source drops :-)
HTH,
Giorgos
More information about the Mercurial
mailing list