Can one do RAID 1 over a network? Alternatively, one might ask “Has any progress been made in file systems in the 20 years since AFS?” We’re trying to build a more capacious and reliable photo sharing system for photo.net and would like to avoid giving a lot of money to EMC (since we don’t have any money) for one of those fancy shared fiber channel disk arrays. http://www.photo.net/doc/design/photodb-arch lays out our goals. Comments from sysadmin geniuses would be appreciated!
29 thoughts on “Can one do RAID 1 over a network?”
Comments are closed.
“””If a disk drive dies in the live cluster, we do not want rsync noticing that a whole series of directories have disappeared and deleting those from the backup!”””
Can’t you make that a non-problem by keeping stubs for deleted photos, so they don’t resemble file-not-found photos or drive-momentarily-unmounted photos? E.g. put a 7-byte file that says ‘deleted’ where the image was. Then you can turn off –delete on rsync, and it will clear the backup only when the original was deliberately deleted.
Disclaimer: I’m not a full-time system administrator, I’m a developer with too many servers at my hands 😉
Like you mention in your document, there’s DBRD, but that only provides fail-over, no concurrent reads or writes (writes wouldn’t matter though, I guess), but it’s probably not for you. The fastest way to go is probably csync2, but only if the slight possibility of data loss or congestion (conflicts arising from not-fully-copied files) due to network outages is acceptable to you. You’d also have to write some code to account for files that are not-yet-replicated. It might be a very good match for your backup-copy-tool-requirement as it correctly identifies removed files.
CODA improves on AFS2 and might be a good match. You should check it out.
If you want the best recovery possibilities, run ext3 in full data journal mode. If speed is an issue look at ReiserFS (the stable one, the new FS might be in trouble since Hans Reiser might go to jail), it’s still faster than ext3 (of course, I’m talking about ext3 in writeback-mode, ReiserFS3 doesn’t support ‘ordered’ journaling). I’d say, RFS is not worth the trouble. If you can afford a replicating file-system performance-wise (writing to disk *will* be slower than before), I’d guess that the journaling won’t really impact your performance numbers.
Linux-HA gives you all the tools (heartbeat in particular) that you should need to keep the site up and running.
Looks like MogileFS is what you need. It was developed and used for the same purpose by the LiveJournal (now Six Apart) folks.
NetApp has some offerings that could be useful (accessible via NFS or FC). They also have a subsidiary called StoreVault that offers a more basic version of their product–not sure whether replication is available though: you would have to ask one of their resellers. The machines tend to be very reliable.
Another options is to use Sun Solaris’s ZFS which is designed to keep data safe using inexpensive disks. Sun’s X4500 (which is overkill for you), gives you two dual-core AMD CPUs and 12TB or 24TB in 4U. Solaris 10 does support NFSv4 (as well as 2, 3). Solaris 10 runs on quite a bit on other’s (IBM, Dell, HP) hardware if want something more “normal”, though it would be prudent to check the HCL.
So while you may not go with Sun’s hardware, the power of ZFS (on any hardware) allows you to use cheap disks without having to worry about losing data. Keeping two machines sync’d with ZFS is a matter of running the command ‘zfs send volume/myfs | ssh othermachine zfs recv othervol/otherfs’ periodically.
Just a couple observations.
Solaris ZFS is a pretty interesting piece of work.
http://www.sun.com/software/solaris/zfs.jsp
Sun sells a mass storage box at claimed low cost per gigabyte.
http://blogs.sun.com/jonathan/date/20060710
Sun even offers a “Try and Buy”.
Of course, if your time is of no value, then engineering your own solution is probably better.
I just realized that I should leave you with this link to the limitations of the CODA file system. Reading through it, CODA might not match your expectations. I have no experience with it, unfortunately.
Brad Fitz and co at SixApart/Danga are doing interesting things in this space.
If you’re willing to treat photos as objects, rather than insisting on POSIX file semantics, then scalable stuff is cheap.
All the code is Open Source (good)
All the code is perl (ymmv)
The mailing list appears to be alive (good)
http://www.danga.com/mogilefs/
As far as I know all the large media for LiveJournal runs on this code, pictures and audio mostly.
Here they keep buying more FC disk and shovelling tapes into robotic libraries, but thats the difference between HPC and commodity web.
Good luck
Your proposed architecture sounds very similar to what we use at the UK Mirror Service (see http://www.mirrorservice.org/) — a set of moderately smart frontend load-balancing machines, and another set of redundant backend storage machines that the load-balancers talk HTTP or our SFTP-like filesystem protocol to.
The backends have a few big cheap ATA disks in a RAID0, and each backend machine is completely replicated so we can take one out of service without users needing to care (usually to swap disks, although we’ve had significantly fewer disk failures since moving away from our old SCSI arrays). Replicating content around disks (including rebuilding disks when one’s replaced) is handled by our mirroring software.
There are some slides for a talk I gave a couple of years ago about our old setup here; we’ve since implemented the “making it cheaper” plan described above:
http://www.mirrorservice.org/help/ukms-seminar-redux.pdf
1) Have a look at Lustre from CFS. It’s a clustered filesystem product. The product is open source, you pay for support. You can do Lustre on direct-attached storage, though you should probably wait until their networked/Luster RAID is available, or do RAID6 locally if you’re not going to do fiber channel.
2) Have a look at the Linux network block device. It’s entirely possible to do RAID 1 across the network, but I don’t think the supporting tools are really all that evolved.
3) Back in the clustered filesystem space, you might have a look at PVFS2 or GFS. They’re both open source and coming along well from what I hear. I don’t know about their ability to use direct-attached storage, though.
You might want to check out MogileFS (http://www.danga.com/mogilefs/) from the folks who created/run LiveJournal. It seems to be geared toward a similar problem: it stores immutable files in a distributed way to eliminate single-points-of-failure. All servers can be involved in actively serving files, and it’s all open source.
I think it was originally designed as a back-end for photo hosting needs, so it even has features that could be right up your alley: for example, it can be set up to know the difference between the original photo and the thumbnail, and you can say things like, “store the original photo on three different physical servers, but you only need to keep the thumbnail on one physical server because if that server dies we can re-generate the thumbnail from the original.”
I believe Linux provides DRBD implementing a remote block device over a network link that could be configured as part of a mirrored disk arrangement.
The direct answer is: yes raid1 is possible over a network (see http://www.drbd.org/)
However I think that this is a PERFECT place to inpliment perlbal (see http://www.danga.com/perlbal/) and mogilefs (see http://www.danga.com/mogilefs/)
I would think that with a solution involving these you will be able to store well beyond 2TB worth of photos in commodity hardware. MogileFS has a sense of redundancy built in via minimum number of duplicate copies within the storage cluster.
Layer A [ Physical Network (WAN) ]
Layer B [ Front Facing Infrastructure (FW/HA/Load Balancer) ]
Layer C [ Physical Network (DMZ) ]
Layer D [ Traditional Web Servers (application ]+[ Perlbal|MogileFS (static content) ]
Layer E [ Physical Network (LAN) ]
Layer F [ Database Servers ]
You might consider asking Kevin Closson. He knows Oracle and he knows storage.
http://kevinclosson.wordpress.com/
Another option would be ATAoE (ATA over Ethernet). It does not fit exactly in what you propose but it is a decent alternative to SAN and these EMC thingy you don’t really want to spend money for.
Cluster file systems still have a serious performance hit, but iSCSI is coming of age. If you have a limited number of servers, a bunch of Apple XServe RAIDs with a cheap Fibre Channel switch is comparatively cheap (I’m assuming you don’t have the manpower to deal with the unreliability of white-box PC storage).
Consider also the Sun X4500 server, $36K list for 16TB of storage (48 SATA drives). This puppy will break 1GBps of I/O throughput without breaking a sweat, and when combined with ZFS, will give you all the manageability of an EMC apart from remote mirroring (and even that could be done using the iSCSI target in OpenSolaris).
Fazal: I went over to the Sun Web site. That X4500 was closer to $70,000 list when fully populated with hard drives and RAM. It did look like pretty much the right thing and that ZFS seems kind of nice.
CoRAID makes kernel drivers for Linux and Solaris that make the OS think it is talking to a local ATA drive when in fact it is talking to an ATA drive over Ethernet (*raw* Ethernet actually). The advantages are that you can do all kinds of mirroring if you want. Speed is as fast as your network card will go. Note that this is a very different solutions than most NAS oriented ones because it operates at a such a low level…You get a lot of stuff for free and don’t have to deal with complicated software. I know some people that work there and if you want I could probably hook you guys up with a trial rack.
If you want to know how ZFS was designed, there’s a series of videos available on Sun’s site:
http://www.sun.com/software/solaris/zfs_learning_center.jsp
Two other features of Solaris 10 could be useful in addition to ZFS: DTrace and zones.
I’ll probably get torn apart for suggesting this, but Dell makes reasonable SAN/NAS devices and they are going to be a lot cheaper than Sun / EMC.
The cool kids – including MySpace – are using Isilon (http://www.isilon.com) gear for clustered storage these days.
Sorry about that. I guess that was a promotional price that’s no longer available except to educational customers.
If you Google around, you can still see references to the 12TB model for $32,995K that was introduced in July, fully populated with 48x 250GB drives. That was *not* a promotional price. I guess Sun decided it was too cheap and too popular for them to meet demand, or perhaps continued availability of 250GB SATA drives is becoming a problem.
Even if it is no longer available as a standard off-the-shelf configuration (the minimum is now 24TB), you may still be able to order it from a VAR. Or complain to Chief Blogging Executive Officer Jonathan Schwartz. For many apps, I/O throughput is much more important than storage capacity and 2 Thumpers with 12TB each are way preferable to one with 24TB.
ZFS on a Sun Thumper (http://www.sun.com/servers/x64/x4500/)
I should add, talk to the guys over at Joyent (http://joyent.com) if you want opinions on the Thumpers. They’ve launched a product around them (http://bingodisk.com).
I would love to see a cheap (free), mature, clustered file system. It doesn’t exist. Lustre is the closest, but I don’t think it’s fully baked yet. There are some decent commercial products from panasas and isilon… but they aren’t cheap.
Solving Phil’s problem doesn’t require a clustered file system. It needs some sort of cluster persistent state store which is used by web applications. This is commonly implemented as application level services and a “file system” like set of interfaces that are provided to the web apps via a library. Examples of this include the previously mentioned mogileFS, Google’s File system, hadoop dfs, and I would bet what’s sitting under Amazon’s S3.
Several people have mentioned ZFS / thumper. ZFS is a great *local* file system. but it doesn’t let you spread data across multiple systems. Some simple scripts would let you perform basic replication between machines which could give you a write master / read-only copies provided you seperated the read from write paths to the data.
In the research community there have been a number of interesting projects such as Farsite from MSR which takes Castro’s work on Byzantine fault tolerance to the next level, OceanStore from Berkeley, and Dstore from Stanford. None of these are available for general use.
One final note. I haven’t taken the time to relook at S3’s pricing… but when I looked at the pricing when they first announced S3. At that time it was clear that it would be possible to store the data cheaper that their pricing, but the bandwidth charges (especially if I had a bursty application) looked like a pretty good deal which would make up for the storage costs.
Stay the heck away from Linux and use Solaris plus ZFS on a supported SATA controller. Export everything you want via NFS v3 or V4. Use “zfs send” and related tools (including rsync and zsync) to back up the system.
For 2 TB, get 6x500GB drives ($1200) and place in 2U system as mentioned in article. ZFS “RAIDZ” (similar to RAID5) on 5x500GB drives = 2TB space, with an extra 500GB drive serving as hot spare or boot disk.
I can’t stand all these Linux weenies who think 3x the hardware needed plus Perl programs which run 3x slower equals a “solution”.
I don’t see why you can’t consolidate your current 2 racks of hardware into about 30U total of space and give back the second rack, saving yourself some $4-6K per year right there in un-needed rack costs.
Given what I know of your setup, I could fit all the servers you actually describe in your link, in about 8U of space, as follows:
1U of space: mini 1U firewall plus mini 1U front end proxy ($1000 total)
1U gigabit switch (unsure of price)
2U : dual-Opteron, dual-core system with 16GB or 32GB of RAM and aforementioned SATA setup configured as server1.photo.net
2U : identical system configured as server2.photo.net
2x 1U database servers set up with cluster or failover code (not sure how to do with Oracle but am sure it can be done)
That is 8U right there.
add space for monitor, keyboard, remote reboot switch, separate machine to do log processing or other tasks, etc. etc. and you can keep it under half a rack (21U) – except you will need enough power to drive everything so you may as well buy a full rack.
– Sun’s x4500 is a very nice “storage brick” and if you would run Linux on 2 or more of them with Lustre, you would have a very scalable solution however you will need to wait if you want their “LAID-1” (Lustre-RAID-1) to be fully implemented. From a purely financial point of view, it’s very hard to beat and the performance we’ve seen from our tests in-house show that this box breaks some of our assumptions 🙂 (note: talk to Jason Hoffman from TextDrive if you need more information about ZFS/x4500, they’ve been using them on production for a while now)
– iSilon: has all the features you are looking for, redundancy, scalability and ease of administration. It’s not the cheapest on $/GB compared to x4500’s + Lustre however it’s a different beast.
– Lustre: I don’t stop to be impressed by this filesystem, not only the fact that it’s open source but the company/developers working on it are very clued in. Colleagues of mine working at other facilities have been using Lustre for several years now and their experience is very positive. We haven’t used it internally except for proof-of-concept yet but we are in the process of evaluating it as part of our storage infrastructure.
Hope this helps with your research. 🙂
/B
Here’s a simple software based Raid 1 that can backup live across the network, called mirrorfolder
http://www.techsoftpl.com/backup/
I’ve tried and it works pretty well.
one more idea, a replicated mysqlfs FUSE rig. nice thing would be that existing admin skill and practices would be leveraged if you already have other mysql components in the system. backup, failover, recovery should be analogous. you wouldn’t be the first person to try it:
http://www.nabble.com/Re:-An-idea:-rsyncfs,-an-rsync-based-real-time-replicated-filesystem-p6508554.html
however, I don’t believe this would be update-synchronous. also, I don’t have a lot of experience with FUSE…
lustre sounds interesting…
I don’t know if I am a sysadmin ‘genius’ but I deal with about 200TB of storage every day. While I am a software developer by training, I’ve learned enough about being a sysadmin to have a lot of respect for what it entails.
First of all, what kind of staffing are you planning for photo.net? As in, how many full-time developers/sysadmins do you plan on keeping, because that has a lot of bearing on what kind of solution you should choose. Once you get over a few TBs of storage, and you want fault tolerance, and you insist on managing the storage and infrastructure yourself, you’re really looking at one full-time developer/sysadmin just to keep everything happy, if you can find someone competant at both.
I’m not sure the architecture and analysis you linked to factors in all the costs of managing the storage yourself, including the power, colocation, replacement drives (about 30-50 pecent failure rate per year for heavily-used SATA drives) and the manpower to keep everything operational. For your storage and bandwidth needs, I’d forget about the headache of managing the storage yourself and go with S3. Amortized over a few years, you will pay more if you do it yourself, and end up with a less reliable and well-performing system. SmugMug seems to be very happy with their decision to use S3.