Created on 2007-12-19.21:11:33 by jglick, last changed 2008-06-29.11:47:14 by kirr.
| msg6437 (view) |
Author: kirr |
Date: 2008-06-29.11:47:13 |
|
Guys, I understand there are technical challenges in this issue, but maybe
Something Could Be Done?
I think this issue should be one in the major list -- people usually convert
their svn repos with hg and git and compare sizes to see which DVCS to use.
And you know, because of this issue hg often looses.
|
| msg5807 (view) |
Author: Omnifarious |
Date: 2008-04-02.14:26:58 |
|
Oh, better idea for write...
Pass in an optional external data handler on write. If there is one it should
be able to provide the data for the base of the revision for diff purposes, and
it should be able to provide a cookie that will be given to the external data
handler for read.
That way the external data handler doesn't have to remember any associations
between the revision and the data. It will be able to the revlog to hand it the
cookie which will allow it to fetch those.
|
| msg5806 (view) |
Author: Omnifarious |
Date: 2008-04-02.14:22:45 |
|
I understand that. Perhaps instead of moving that much work up the revlog could
be given an external data handler when you asked it for data. And for the write
side you could give it an optional argument with the data for revlog to use as
the base for whatever diffing algorithm it might choose to use.
The contract would be that the external data handler you passed on read would be
able to retrieve that base for any revision in which you passed on such a base
on write.
|
| msg5794 (view) |
Author: mpm |
Date: 2008-04-01.19:59:54 |
|
You're missing the first conceptual hurdle: if we change what we're storing in
the revlog, we change the hashes. Revlog is a self-contained black box. You hand
it "data", it hands you back an identifier hash. If we change our data from
"copy + full revision" to "copy + delta", revlog will hand us back a different
identifier. Thus, old and new clients will disagree about the hash for "file x
containing X, copied from y@z".
To get past this, we would need to hoist both the hash calculation and checking
up out of revlog into filelog (and changelog, and manifest). Then when we
checked in a copy, we'd have to first calculate the hash for "copy + full
revision", then calculate the delta, then tell revlog "please store 'copy +
delta' but with the hash for 'copy + full revision'".
To recover a revision, we'd have to get "copy + delta", look up the copy,
reconstruct that revision, apply the delta to get the full revision, then
calculate the hash of "copy + full revision" and compare it with the identifier
we were asked to retrieve.
On pull over the existing wire protocol, we'd have to do the above, and then
take our reconstructed "copy + full revision" and turn it into a delta (usually,
but not always, against an empty file).
|
| msg5788 (view) |
Author: Omnifarious |
Date: 2008-04-01.16:18:11 |
|
My feeling is that it's possible to make this happen without changing the
essential meaning of either the index or data files.
One rather unsubtle and probably bad idea would be do allow index files to
reference other data files via a combination of numerical linkrev (referencing a
changeset) and filerev hash (referencing a manifest entry in that changeset).
If the filerev hash were null then the information would be ignored. If not
they would be taken as a 'base' on which to build the current file image, along
with the delta range stuff from the main data file that's already there.
Keeping the wire protocol unaffected after doing so will be tricky but I
definitely think it's doable. If the wire protocol is unchanged though,
divining the need for the new way of storing references to other data files for
incoming changesets is going to be a pain. Incoming changes will have to be
scanned for copies.
|
| msg5745 (view) |
Author: mpm |
Date: 2008-03-28.00:13:04 |
|
Incompatibility with old clients is a non-starter, so viewing it as two problems
is as well.
Current clients have file revision hashes that include the current metadata for
the copy info. If we change what we store, we break the hash -> old clients
break. So we've either got to fake the contents (and destroy the concept of
revlog id = hash of contents) or break compatibility.
|
| msg5744 (view) |
Author: TakeyMcTaker |
Date: 2008-03-27.21:21:03 |
|
@mpm: I would argue that the two problems -- revlog index cross-references, and
the wire protocol -- could be viewed as 2 completely separate problems.
One of the main problems right now in Mercurial seems to be a lack of viable
cross-path-rev referencing method, in the revlog index scheme. If the index
scheme was allowed to reference URI's from other paths (internal or external),
instead of just revlog data with a matching name, that would be a simple fix for
a whole list of issues.
This reminds me of the discussion in the mailing list about combining
HistoryTrimming, PartialCloning, Overlays, and Obliterate methods. An in-place
replacement of revlog data with its hash value, and a "reason for missing data"
that includes a URI for a third-party data source, could be a combined fix for
all of these features/issues. That "third-party data source URI" could just as
easily reference paths and revs inside the same repository, as external
repository URI's.
Now, separating the wire protocol, so that older clients get what they expect,
rather than what data is actually held locally by the revlog, is not necessarily
easy. It is possible, provided all the requested data is online *somewhere*.
Attempts to push-pull revlog data that isn't available online could be a defined
failure condition, for the "old client" wire protocol. So I would say that
internal repository reference URI's are probably the easiest, to interpret into
this "old client" wire protocol.
Does Mercurial already have any way of signaling current repository version,
and/or available extensions, on each end of a push-pull connection? That would
be an easy way of signaling which wire protocol can be used optimally, in any
given transfer. If it doesn't already exist, maybe a push/pull flag or attribute
could be added, like a "wire protocol version specifier"?
|
| msg5498 (view) |
Author: mpm |
Date: 2008-03-09.22:48:17 |
|
Ok, here's a proposed fix and the problems that subsequently crawl out from
under the rock:
In filelog, override revlog.revision. Add metadata that says "the revision
returned by revlog is not a full revision as promised but a revision of file
x@rev + the body here treated as a delta." Then filelog.revision can instantiate
a temporary filelog object for x, get the specified revision, and apply the
delta. Do the appropriate steps in filelog.add to make this work.
Now with a little luck, getting the -next- revision from the filelog will just
work. Otherwise, we'll need to hack revlog.revision to call itself (and thereby
filelog.revision) to grab the base revision.
So now we've got a scheme that mostly does away with the layering violations as
revlog doesn't have to have any special knowledge about other revlogs (it's all
in the filelog class, which already knows how to find and open revlog from a
pathname). It even gets the case where c@z is a copy of b@y which is a copy of
a@x right automatically.
But we've also got a huge compatibility problem. An old client can't just pull
this data and expect it to work. Instead, we've got to add a new version of the
wire protocol that allows us to send these sorts of deltas to new clients, but
sends full revisions to old clients. And a new client would like to take old
client data and deltify the copies, which may not be possible at pull time (for
instance, if the destination revlog is sent before the source revlog). Also,
hashes at the revlog layer and at the filelog layer no longer agree. Ouch.
In short: not an easy problem.
Marking deferred.
|
| msg5058 (view) |
Author: vadiml |
Date: 2008-01-31.16:38:30 |
|
In response to mps's:
"Currently fixing the renaming issue would present a layering
violation. That is, individual revlogs have no knowledge of any other
revlog. So when we ask a revlog to retrieve version <x> of some file,
it has to have all the data internally."
Actually we can store in revlog a reference to generic external object,
identified by some kind of "url" and (maybe) hash
Initially it can be used to implement renames and copies but it can evolve
into some kind of super svn:external facility later (like hg repo which
retrieves file directrly form extrenal svn or git)
|
| msg4656 (view) |
Author: ThomasAH |
Date: 2007-12-20.16:15:41 |
|
Generally having support for referencing other revlogs could allow for other
usages, too, e.g. splitting revlogs if they grow to big, either to circumvent fs
or backup limitations, or to prevent new changes breaking hard links for already
huge revlogs.
|
| msg4649 (view) |
Author: jglick |
Date: 2007-12-19.21:11:31 |
|
When files or dirs are renamed in Hg, repository size is increased, I guess by
about the compressed size of those files:
$ hg init
$ cp /boot/vmlinuz-2.6.22-14-generic f
$ hg add f
$ hg ci -m 1
$ du --si
1.8M ./.hg/store/data
1.8M ./.hg/store
1.8M ./.hg
3.6M .
$ hg ren f g
$ hg ci -m 2
$ du --si
3.5M ./.hg/store/data
3.5M ./.hg/store
3.5M ./.hg
5.3M .
$ ls -Rl .hg/store/data
.hg/store/data:
total 3328
-rw-r--r-- 1 jglick jglick 1692145 2007-11-09 05:42 f.d
-rw-r--r-- 1 jglick jglick 64 2007-11-09 05:42 f.i
-rw-r--r-- 1 jglick jglick 1692204 2007-11-09 05:42 g.d
-rw-r--r-- 1 jglick jglick 64 2007-11-09 05:42 g.i
For a repository which is already hundreds of megabytes, doing major source
reorganizations is out of the question for this reason. This is a serious
drawback compared to Subversion; or even arguably to CVS, where moving a dir
means you only pay a penalty in history, not future usage.
mpm has written regarding implementation:
"Currently fixing the renaming issue would present a layering
violation. That is, individual revlogs have no knowledge of any other
revlog. So when we ask a revlog to retrieve version <x> of some file,
it has to have all the data internally."
|
|
| Date |
User |
Action |
Args |
| 2008-06-29 11:47:14 | kirr | set | nosy:
+ kirr messages:
+ msg6437 |
| 2008-04-24 02:52:26 | tksoh | set | nosy:
+ tksoh |
| 2008-04-02 14:26:59 | Omnifarious | set | nosy:
mpm, Omnifarious, ThomasAH, kupfer, pmezard, mathieu.clabaut, jglick, djc, abuehl, vadiml, TakeyMcTaker messages:
+ msg5807 |
| 2008-04-02 14:22:45 | Omnifarious | set | nosy:
mpm, Omnifarious, ThomasAH, kupfer, pmezard, mathieu.clabaut, jglick, djc, abuehl, vadiml, TakeyMcTaker messages:
+ msg5806 |
| 2008-04-01 20:00:00 | mpm | set | nosy:
mpm, Omnifarious, ThomasAH, kupfer, pmezard, mathieu.clabaut, jglick, djc, abuehl, vadiml, TakeyMcTaker messages:
+ msg5794 |
| 2008-04-01 16:18:12 | Omnifarious | set | nosy:
mpm, Omnifarious, ThomasAH, kupfer, pmezard, mathieu.clabaut, jglick, djc, abuehl, vadiml, TakeyMcTaker messages:
+ msg5788 |
| 2008-03-28 00:13:06 | mpm | set | nosy:
mpm, Omnifarious, ThomasAH, kupfer, pmezard, mathieu.clabaut, jglick, djc, abuehl, vadiml, TakeyMcTaker messages:
+ msg5745 |
| 2008-03-27 21:21:04 | TakeyMcTaker | set | nosy:
+ TakeyMcTaker messages:
+ msg5744 |
| 2008-03-09 22:48:19 | mpm | set | status: chatting -> deferred nosy:
+ mpm messages:
+ msg5498 |
| 2008-02-14 18:49:58 | mathieu.clabaut | set | nosy:
+ mathieu.clabaut |
| 2008-02-14 18:29:30 | abuehl | set | nosy:
+ abuehl |
| 2008-02-05 08:41:49 | djc | set | nosy:
+ djc |
| 2008-02-05 00:24:24 | kupfer | set | nosy:
+ kupfer |
| 2008-01-31 16:38:31 | vadiml | set | nosy:
+ vadiml messages:
+ msg5058 |
| 2007-12-25 19:32:38 | pmezard | set | nosy:
+ pmezard |
| 2007-12-24 21:45:19 | Omnifarious | set | nosy:
+ Omnifarious |
| 2007-12-20 16:15:45 | ThomasAH | set | status: unread -> chatting nosy:
+ ThomasAH messages:
+ msg4656 |
| 2007-12-19 21:11:33 | jglick | create | |
|