Fwd: Using a DVCS to distribute files across a cluster
Bill Barry
after.fallout at gmail.com
Tue Oct 21 12:50:03 CDT 2008
John McGowan wrote:
> My #3 option below was assuming we still do everything the same way we
> do it now with respect to SVN and ant, but instead of just using scp
> to copy the new files to the master server, we would copy them to the
> master server, and then issue the commands to "commit" those files.
> After commiting those files, we would call another script that would
> trigger the "update" on each of the slave servers.
>
This sounds almost exactly like what I am doing.
> My core question is whether or not I can expect significant
> performance increase by doing an "hg pull" from each of the slave
> servers as opposed to an rsync. It seems to me that since rsync has
> to do checksums, etc... to determine what files to copy, that the
> rsync agorithms is O(n + m) where N is the number of files to analyze
> and M is the amount of changes to copy over the network.
>
The performance of hg pull is roughly comparable to rsync (at least in
my very poor wall clock tests; somebody on this list probably has decent
performance comparisons). The advantages to using an SCM for this is not
the performance, it is the available workflows (and having history in
case somebody makes a mistake or somebody compromises the server).
Instead of all slaves rsyncing to the server (many file comparisons can
be IO intensive and potentially even CPU intensive, we noticed that all
of the file operations were impacting our webapp performance when it
needed to do file intensive operations) you can update 1 slave and
offload the majority of the file operations to it. Additionally you can
use parallel algorithms to distribute the rsync/update work out to
multiple machines to ensure that no machine is heavily impacted by IO
tasks). A VCS also provides the ability to do more ving mcomplex server
setups, like haultiple write heads (whereas rsync wouldn't be able to
figure out things like if a file was deleted on one server or added on a
different one).
> Is it safe to say that Mercurial, would be O(m) only. Is the cost of
> determining what files have to be changed/pulled negligable? Even
> with large numbers of files?
>
I have no idea if the cost is negligible or not with such large number
of files. I also don't know if either Hg or rsync happens to be
optimized for such sparse changes (I'd imagine that neither has any
built in optimizations for a task like this, but due to designs such
optimizations may be pointless).
> And now for a quick hg question. Is there an hg command to prune the
> history from a repository. If we started versioning everyting in the
> web tree, every time a graphic designer updates an image, we're going
> to have a history of that. Which is good, until we start running out
> of diskspace. I'd just like to know if it's easy to delete history
> older than a certain revision number, or based on date.
>
I am pretty sure you could prune history with a tool like Tailor, but it
is probably easier for what you want to do to simply delete the .hg
directory on the master and everything on the slaves, initialize a new
repository (perhaps keeping a backup of the previous one somewhere)
committing and then pulling the new repository (which would now only
have 1 revision in it). Both of these are rather manual processes and
would likely require at a minimum turning off the ability to modify
these files on the server for the duration while taking the servers
offline one at a time until all are with the new repository.
More information about the Mercurial
mailing list