Using a DVCS to distribute files across a cluster

Bill Barry after.fallout at gmail.com
Tue Oct 21 09:27:03 CDT 2008


I am doing something very much like this. Our sites have a "forms" 
section that users can upload documents to for clients. When an upload 
occurs (we only ever have 1 primary server, so it is guaranteed to 
always be at tip, secondaries are all read only) we commit it. Currently 
we have that directory served via hgweb and pulled from all the other 
servers:
Primary (user uploads to)
secondary 1 (pulls from primary every 5 mins)
secondary 2 (pulls from primary every 5 mins)
secondary 3 (pulls from primary every 5 mins)
secondary 4 (pulls from primary every 5 mins)
...

but we are considering:
Primary (user uploads to)
secondary 1 (pulls from primary every 5 mins)
secondary 2 (pulls from secondary 1 every 5 mins)
secondary 3 (pulls from secondary 1 every 5 mins)
secondary 4 (pulls from secondary 1 every 5 mins)
...

just to reduce load on the primary server.

If we had more than one primary server we would have to do a little more 
work to ensure that the repo was up to date before committing (perhaps 
before loading the forms contents as well).

John McGowan wrote:
> Hi,
>
> I'm brand new to this list, In fact, I just heard about this project
> yesterday.  Mercurial came up because I was discussing a problem I
> have to come up with a solution for, and a VCS isn't the first thing
> we though of.  When we started to think about it though, it seemed
> like *maybe* a vcs could be a really slick solution to the problem we
> have.  Here's the problem.
>
> We have a cluster of web servers, and we currently use rsync to keep
> all the content across them synchronized.  Right now, our rsync cron
> job is "dumb" it just syncs everything every 5 minutes or so.  This
> worked nicely when we had 3 servers in the cluster (1 primary, and 2
> secondary) but now that we have 9 servers in the cluster, I'm not so
> happy with the rsync solution.  The CPU cost of doing the rsyncs is
> high.  The usability of the cluster is going down, since we can't do
> too many simultaneous rsyncs, it now takes over a 1/2 hour before a
> new file makes it to all the servers.
>
> We thought about solving this a couple of ways.
>
> 1. use a centralized file server  - I don't want to compicate the
> setup and worry about network traffic, or adding another potential
> failure point/bottleneck to the equation.
> 2. continue to use rsync, but be smarter about it.  Don't ever rsync
> the entire tree, just rsync individual files as you find out they are
> updated.
> 3. use a (D)VCS to hold all of the files that are getting synced, and
> do commits/updates (pushes/pulls?) to keep the other machines in the
> cluster up to date
>
> I don't know anything about Mercurial.  I could imagine a way to do
> what I need to do with SVN, because I'm familiar with SVN.  However,
> I'm under the impression that Git and Mercurial (the new "cool" kids
> on the block) are designed with speed, second only to correctness.
> I'm not really interested in working with Git, so here I am.
>
> A side effect of using the VCS would of course be that changes to the
> tree would be versioned.  Which is pretty cool.  Of course, we use VCS
> on our code, that gets added to the web tree, but versioning the tree
> itself, would give us some functionality we've never had in the
> past....  Client calls and says " uh... i accidentally deleted a
> directory full of images using ftp, or the CMS... can you get it back
> for me?...."
>
>
> The directory tree I'm thinking about doing this with is around 65 GB,
> around (320,000 files).  I could also create a separate repo for each
> of the sites in that tree if I needed to.
>
> Thoughts?  Is this a feasible solution to this problem?  Is anybody
> doing anything like this now?
>
>
> /John
> _______________________________________________
> Mercurial mailing list
> Mercurial at selenic.com
> http://selenic.com/mailman/listinfo/mercurial
>
>   



More information about the Mercurial mailing list