Fwd: Using a DVCS to distribute files across a cluster

Bill Barry after.fallout at gmail.com
Tue Oct 21 12:50:03 CDT 2008


John McGowan wrote:
> My #3 option below was assuming we still do everything the same way we
> do it now with respect to SVN and ant, but instead of just using scp
> to copy the new files to the master server, we would copy them to the
> master server, and then issue the commands to "commit" those files.
> After commiting those files, we would call another script that would
> trigger the "update" on each of the slave servers.
>   
This sounds almost exactly like what I am doing.
> My core question is whether or not I can expect significant
> performance increase by doing an "hg pull" from each of the slave
> servers as opposed to an rsync.  It seems to me that since rsync has
> to do checksums, etc... to determine what files to copy, that the
> rsync agorithms is O(n + m) where N is the number of files to analyze
> and M is the amount of changes to copy over the network.
>   
The performance of hg pull is roughly comparable to rsync (at least in 
my very poor wall clock tests; somebody on this list probably has decent 
performance comparisons). The advantages to using an SCM for this is not 
the performance, it is the available workflows (and having history in 
case somebody makes a mistake or somebody compromises the server). 
Instead of all slaves rsyncing to the server (many file comparisons can 
be IO intensive and potentially even CPU intensive, we noticed that all 
of the file operations were impacting our webapp performance when it 
needed to do file intensive operations) you can update 1 slave and 
offload the majority of the file operations to it. Additionally you can 
use parallel algorithms to distribute the rsync/update work out to 
multiple machines to ensure that no machine is heavily impacted by IO 
tasks). A VCS also provides the ability to do more ving mcomplex server 
setups, like haultiple write heads (whereas rsync wouldn't be able to 
figure out things like if a file was deleted on one server or added on a 
different one).
> Is it safe to say that Mercurial, would be O(m) only.   Is the cost of
> determining what files have to be changed/pulled negligable?  Even
> with large numbers of files?
>   
I have no idea if the cost is negligible or not with such large number 
of files. I also don't know if either Hg or rsync happens to be 
optimized for such sparse changes (I'd imagine that neither has any 
built in optimizations for a task like this, but due to designs such 
optimizations may be pointless).
> And now for a quick hg question.  Is there an hg command to prune the
> history from a repository.  If we started versioning everyting in the
> web tree, every time a graphic designer updates an image, we're going
> to have a history of that.  Which is good, until we start running out
> of diskspace.  I'd just like to know if it's easy to delete history
> older than a certain revision number, or based on date.
>   
I am pretty sure you could prune history with a tool like Tailor, but it 
is probably easier for what you want to do to simply delete the .hg 
directory on the master and everything on the slaves, initialize a new 
repository (perhaps keeping a backup of the previous one somewhere) 
committing and then pulling the new repository (which would now only 
have 1 revision in it). Both of these are rather manual processes and 
would likely require at a minimum turning off the ability to modify 
these files on the server for the duration while taking the servers 
offline one at a time until all are with the new repository.



More information about the Mercurial mailing list