Issue552

Title Rollback can race with pull
Priority bug Status chatting
Superseder Nosy List DomQ, ThomasAH, alexis, brendan, mpm, tonfa
Assigned To mpm Topics

Created on 2007-05-01.16:26:21 by ThomasAH, last changed 2008-05-07.00:29:21 by mpm.

Messages
msg6016 (view) Author: mpm Date: 2008-05-07.00:29:19
Changing subject/priority.
msg4751 (view) Author: ThomasAH Date: 2007-12-30.18:50:30
Assigning to mpm to coordinate further fixing.
msg3185 (view) Author: ThomasAH Date: 2007-06-09.09:21:36
With current crew tip (b51a8138292a) I get the errors reported in msg3043 and
msg3047 (see superseded issue585).
msg3052 (view) Author: mpm Date: 2007-05-02.13:58:37
While this is a good patch, it's by no means complete.

Pulls can take an arbitrarily long time. If we begin a pull, start reading
changelogs, manifests, and files, and another user does rollback (by failed
commit or otherwise) somewhere in the middle, it will potentially hit files that
pull is reading.

The only way to be sure this hasn't happened is to make sure the tip at the
start of the pull still exists in the revlog at the end.

Actually, even this isn't foolproof. A rollback that removes a commit followed
by an identical commit can cause a pull that appears to be complete but is
missing some file revisions.
msg3051 (view) Author: alexis Date: 2007-05-02.07:12:57
We may indeed have some problems if somebody does a pull from a repo that is in
the middle of a rollback, but this is not the problem here - the rollback
happens in the local repo.

As a fix for issue132, we've started rereading the changelog and the manifest
when we grab the repo lock.  To avoid unnecessary work, we check whether it
looks like these revlogs have changed.  Right now this check consists of doing a
stat and comparing st_dev, st_ino, st_mtime and st_ctime with the values we
saved earlier.  If they all match, we skip the reloading (this is revlog.load in
mercurial/revlog.py).

With Brendan's patch, the commit and the rollback end up happening in the same
second, and so we're unable detect any change by looking just at mtime and ctime.

Changing revlog.load to also compare st_size (see below) fixes this case and the
2-integrity-errors case from crew-stable.  I haven't seen the 4-integrity-errors
version.

BTW, another way to make this easier to reproduce is to change the sleeps to
something less than a second (you'll need a non-ancient version of coreutils).
Changing the hook sleep to 0.3 and the other one to 0.2 makes it work for me
60-70% of the time (which is expected).

diff -r 15289406f89c mercurial/revlog.py
--- a/mercurial/revlog.py	Tue May 01 13:23:32 2007 -0700
+++ b/mercurial/revlog.py	Wed May 02 04:07:01 2007 -0300
@@ -347,7 +347,8 @@ class revlog(object):
                 if (oldst and st.st_dev == oldst.st_dev
                     and st.st_ino == oldst.st_ino
                     and st.st_mtime == oldst.st_mtime
-                    and st.st_ctime == oldst.st_ctime):
+                    and st.st_ctime == oldst.st_ctime
+                    and st.st_size == oldst.st_size):
                     return
                 self.indexstat = st
             if len(i) > 0:
msg3050 (view) Author: mpm Date: 2007-05-01.22:20:59
The diagnosis is that rollback races with pull. As pull is a reader, it takes no
locks, so it's hard to prevent.

One solution is to have pull check that the changelog tip didn't vanish at the
end, and if it did, abort the changegroup.
msg3049 (view) Author: mpm Date: 2007-05-01.22:15:13
Brendan managed to make it immediately reproduceable with this patch:

diff -r e33ad7cea15f tests/test-clone-pull-corruption
--- a/tests/test-clone-pull-corruption        Wed Apr 25 13:14:01 2007 -0700
+++ b/tests/test-clone-pull-corruption        Tue May 01 14:13:51 2007 -0700
@@ -15,8 +15,9 @@ hg ci -m 'change foo'
 
 # Add a hook to wait 5 seconds and then abort the commit
 cd ../corrupted
+mkfifo pwatch
 echo '[hooks]' >> .hg/hgrc
-echo 'pretxncommit = sleep 5; exit 1' >> .hg/hgrc
+echo 'pretxncommit = echo foo > pwatch; while ! test -s log; do true; done;
exit 1' >> .hg/hgrc
 
 # start a commit...
 touch bar
@@ -24,8 +25,8 @@ hg ci -m 'add bar' &
 hg ci -m 'add bar' &
 
 # ... and start a pull while the commit is still running
-sleep 1
-hg pull ../source 2>/dev/null
+cat pwatch > /dev/null
+hg pull ../source > log 2> /dev/null &
 
 # see what happened
 wait
msg3048 (view) Author: mpm Date: 2007-05-01.20:51:28
Can you try adding a sleep before the verify? If it is a race here, this might
make it disappear.

Also, can you try on tmpfs? This should rule out a large set of filesystem issues.
msg3047 (view) Author: ThomasAH Date: 2007-05-01.20:43:08
I still get failures with 47ef2d228375 (which holds lock on verify):

ERROR: test-clone-pull-corruption output changed
--- Expected output
+++ Test output
@@ -7,10 +7,15 @@
 adding changesets
 adding manifests
 adding file changes
-added 1 changesets with 1 changes to 1 files
-(run 'hg update' to get a working copy)
+added 1 changesets with 1 changes to 1 files (+1 heads)
+(run 'hg heads' to see heads, 'hg merge' to merge)
 checking changesets
+incorrect link (2) for changeset revision 1
+unpacking changeset 6fa9e873119c: list index out of range
 checking manifests
+bad manifest link (2) at revision 1
 crosschecking files in changesets and manifests
 checking files
+foo:4c6c04b2bf00 points to unexpected changeset 2
 1 files, 2 changesets, 2 total revisions
+4 integrity errors encountered!

ERROR: test-clone-pull-corruption failed with error code 256
msg3046 (view) Author: mpm Date: 2007-05-01.19:12:01
Perhaps rollback is racing with verify? Verify should hold the write lock to
ensure the repo is quiescent when we're checking it.

--- a/mercurial/verify.py       Sun Apr 29 12:33:24 2007 -0500
+++ b/mercurial/verify.py       Tue May 01 14:13:40 2007 -0500
@@ -16,6 +16,8 @@ def verify(repo):
     errors = [0]
     warnings = [0]
     neededmanifests = {}
+
+    lock = repo.lock()

     def err(msg):
         repo.ui.warn(msg + "\n")
msg3045 (view) Author: ThomasAH Date: 2007-05-01.18:58:22
<mpm> Failure happened on 2.6.16-xen SMP with Python 2.4 here.
<mpm> 256MB of RAM in dom0.
msg3044 (view) Author: ThomasAH Date: 2007-05-01.18:57:52
mpm hit this problem (4 integrity errors on ext3 on 2.6.16) when generating
artificial load in 1/6400 runs of this test.

Setting to need-eg until there are new ideas or other boxes where it fails more
than once in a million.
msg3043 (view) Author: ThomasAH Date: 2007-05-01.17:58:57
The previous runs were on XFS (Linux kernel 2.6.18), the following happened with
ext3 on the same machine (this time python 2.4 instead of python 2.3, but the
other problems occured with both versions, too):

ERROR: test-clone-pull-corruption output changed
--- Expected output
+++ Test output
@@ -7,10 +7,17 @@
 adding changesets
 adding manifests
 adding file changes
-added 1 changesets with 1 changes to 1 files
-(run 'hg update' to get a working copy)
+added 1 changesets with 1 changes to 1 files (+1 heads)
+(run 'hg heads' to see heads, 'hg merge' to merge)
 checking changesets
+incorrect link (2) for changeset revision 1
+unpacking changeset 4d5da2c89859: list index out of range
 checking manifests
+bad manifest link (2) at revision 1
+reading delta for manifest d8265428f0b7: list index out of range
 crosschecking files in changesets and manifests
 checking files
+foo: 1:4c6c04b2bf00 not in manifests
+foo:4c6c04b2bf00 points to unexpected changeset 2
 1 files, 2 changesets, 2 total revisions
+6 integrity errors encountered!

ERROR: test-clone-pull-corruption failed with error code 256
msg3042 (view) Author: ThomasAH Date: 2007-05-01.17:56:19
The last of the three output changes is because I changed sleep 5 in the test to
sleep 2, please ignore.
msg3040 (view) Author: ThomasAH Date: 2007-05-01.16:26:17
Probably related to issue132:

Wenn running 100 tests (all or only test-clone-pull-corruption doesn't matter) I
get the following problem in crew (tested with a6436e9333db) in 0-8% of the runs:

ERROR: test-clone-pull-corruption output changed
--- Expected output
+++ Test output
@@ -7,10 +7,15 @@
 adding changesets
 adding manifests
 adding file changes
-added 1 changesets with 1 changes to 1 files
-(run 'hg update' to get a working copy)
+added 1 changesets with 1 changes to 1 files (+1 heads)
+(run 'hg heads' to see heads, 'hg merge' to merge)
 checking changesets
+incorrect link (2) for changeset revision 1
+unpacking changeset 6f47a25083d9: list index out of range
 checking manifests
+bad manifest link (2) at revision 1
 crosschecking files in changesets and manifests
 checking files
+foo:4c6c04b2bf00 points to unexpected changeset 2
 1 files, 2 changesets, 2 total revisions
+4 integrity errors encountered!

ERROR: test-clone-pull-corruption failed with error code 256


With 80d3f6f0d8e5 (crew-stable) I get different problems even less common
(between 0.1% and 1% of all runs):

ERROR: test-clone-pull-corruption output changed
--- Expected output
+++ Test output
@@ -11,6 +11,9 @@
 (run 'hg update' to get a working copy)
 checking changesets
 checking manifests
+reading delta for manifest d8265428f0b7: list index out of range
 crosschecking files in changesets and manifests
 checking files
+foo: 1:4c6c04b2bf00 not in manifests
 1 files, 2 changesets, 2 total revisions
+2 integrity errors encountered!

ERROR: test-clone-pull-corruption failed with error code 256

ERROR: test-clone-pull-corruption output changed
--- Expected outputĀ·
+++ Test outputĀ·
@@ -2,15 +2,10 @@
 pulling from ../source
 abort: pretxncommit hook exited with status 1
 transaction abort!
+searching for changes
 rollback completed
-searching for changes
-adding changesets
-adding manifests
-adding file changes
-added 1 changesets with 1 changes to 1 files
-(run 'hg update' to get a working copy)
 checking changesets
 checking manifests
 crosschecking files in changesets and manifests
 checking files
-1 files, 2 changesets, 2 total revisions
+1 files, 1 changesets, 1 total revisions
History
Date User Action Args
2008-05-07 00:29:21mpmsetpriority: urgent -> bug
nosy: mpm, ThomasAH, tonfa, brendan, alexis, DomQ
messages: + msg6016
title: Errors in test-clone-pull-corruption -> Rollback can race with pull
2007-12-30 18:50:31ThomasAHsetnosy: mpm, ThomasAH, tonfa, brendan, alexis, DomQ
messages: + msg4751
assignedto: ThomasAH -> mpm
2007-12-08 02:43:29mpmsetstatus: in-progress -> chatting
nosy: mpm, ThomasAH, tonfa, brendan, alexis, DomQ
2007-10-19 10:44:28DomQsetnosy: + DomQ
2007-06-22 20:30:42mpmunlinkissue132 superseder
2007-06-22 20:30:17mpmlinkissue132 superseder
2007-06-09 09:21:36ThomasAHsetnosy: mpm, ThomasAH, tonfa, brendan, alexis
messages: + msg3185
2007-06-09 09:19:23ThomasAHlinkissue585 superseder
2007-05-02 13:58:38mpmsetnosy: mpm, ThomasAH, tonfa, brendan, alexis
messages: + msg3052
2007-05-02 07:12:58alexissetnosy: mpm, ThomasAH, tonfa, brendan, alexis
messages: + msg3051
2007-05-01 22:21:20mpmsetstatus: need-eg -> in-progress
nosy: mpm, ThomasAH, tonfa, brendan, alexis
2007-05-01 22:20:59mpmsetnosy: mpm, ThomasAH, tonfa, brendan, alexis
messages: + msg3050
2007-05-01 22:15:16mpmsetnosy: mpm, ThomasAH, tonfa, brendan, alexis
messages: + msg3049
2007-05-01 20:51:28mpmsetnosy: mpm, ThomasAH, tonfa, brendan, alexis
messages: + msg3048
2007-05-01 20:43:22ThomasAHsetstatus: testing -> need-eg
nosy: mpm, ThomasAH, tonfa, brendan, alexis
2007-05-01 20:43:08ThomasAHsetnosy: mpm, ThomasAH, tonfa, brendan, alexis
messages: + msg3047
2007-05-01 19:12:01mpmsetstatus: need-eg -> testing
nosy: mpm, ThomasAH, tonfa, brendan, alexis
messages: + msg3046
2007-05-01 18:58:22ThomasAHsetnosy: mpm, ThomasAH, tonfa, brendan, alexis
messages: + msg3045
2007-05-01 18:57:52ThomasAHsetstatus: chatting -> need-eg
nosy: mpm, ThomasAH, tonfa, brendan, alexis
messages: + msg3044
assignedto: ThomasAH
2007-05-01 17:58:57ThomasAHsetnosy: mpm, ThomasAH, tonfa, brendan, alexis
messages: + msg3043
2007-05-01 17:56:19ThomasAHsetstatus: unread -> chatting
nosy: mpm, ThomasAH, tonfa, brendan, alexis
messages: + msg3042
2007-05-01 16:26:21ThomasAHcreate