We’ve bought a couple of very nice computers from Silicon Mechanics recently. They are cheap, compact, and hold lots of hard drives. We bought them mostly for scratchpad NFS servers and also figured that we’d use them somehow for our photo database. Trying to save some $$, we configured each with only 4 GB of RAM. As it happens, we were able to replace a very complex collection of multiple computers, each with multiple processes, with just a single Silicon Mechanics machine with eight hard drives in rat-simple hardware RAID 5 and two dual-core Opteron CPUs. The machine does pretty much nothing except run 8 lighttpd processes, each of which runs a single thread and serves multiple Web clients via async I/O. Jin set all of this up based mostly on comments here in this Weblog on an earlier posting.
Now that we’ve disabled the operating system writing “most recently read” times for every file served (“atime”), the server is acceptably fast. However, it is running disk-bound, with the RAID drives at 65 percent utilization, i.e., not much room for growth.
Our old SQUID server had 8 GB of RAM and was limited to serving thumbnail images. I suspect that if did obvious simple thing and filled this new computer up with 16 or 32 GB of RAM, we would have superb performance. But I’m wondering if we can’t save a few $$ and cheat a bit by controlling the Linux Virtual File System (VFS) cache. Right now we are not doing any caching in lighttpd, nor does the underlying ext3 system cache. This is by design. We want to keep only one copy of any given photo in RAM cluster-wide. I.e., we want the only copy to be in Linux VFS. We think we could get by with less RAM if we prevented Linux from pushing thumbnails out of the cache when pulling a medium or large JPEG from the RAID. In other words, we’d like to say “preserve 2 GB of the machine’s RAM for caching files of the form *-sm.jpg”, in effect having two caches for files, one for thumbnails and one for all other files.
Can this be done without massive amounts of C-code hacking and making it hard to upgrade the OS version?
I seriously doubt double-guessing the Linux implementation and splitting the buffer cache in two pools, one for thumbnails and one for large images is going to help. If there is one thing I remember from sleeping through queueing theory classes, it’s that a single fast queue offers better performance than two small ones. If thumbnails really are accessed frequently, the cache’s LRU would ensure they will stay high in the freshness rankings and not be very likely to be replaced.
Now, if you’re afraid the Linux cache implementation is weak or that memory fragmentation may become an issue, that’s another matter, but that would suggest experimenting with other OSes like FreeBSD or Solaris, not tinkering with split caches.
A couple things off the top of my head-
1. Create a ~2G ramdisk and copy the thumbnails there – this assumes all of the thumbnails will fit in that space, and that the URLs can be trivially remapped to that part of the filesystem. Of course you’d want to make sure you had another copy of the thumbnail elsewhere, since the ramdisk goes away when the server restarts.
2. Apache has the functionality through mod_file_cache to cache files at start time, however, I’d imagine this would not be dynamic enough for your site. (webserver must be reloaded to cache new files). All of my experience is with apache so I don’t know if there’s a lighthttp analogue.
3. I’ve recently encountered some performance problems on linux, especially with SATA drives on certain controllers. (3ware in particular, maybe more). What kind of disk throughput is giving you 65% utilization of the array? In in scattered read/write situations here, we only manage about 2.5MB/second throughput to our 4-disk raid5 arrays at 100% disk utilization. In some real-world application benchmarks, we had a machine running Nexenta turn in about 5x the IO throughput as a debian linux machine. (Nexenta is an opensolaris-based operating system with debian style package management)
-A
Philip,
I don’t have any clever answers for the LVFS question, but I was curious to know what vendor selection process you went through to pick Silicon Mechanics. I hadn’t heard about them until now, and in the briefest of preliminary Google searches to see if anyone had written up a review of them and their machines, I discovered Microsoft’s TerraServer-USA site is also backed by Silicon Mechanics machines:
http://www.terraserver.microsoft.com/About.aspx?n=AboutTechSilMech
Is there anything you or Jin could tell us about the decision to buy from Silicon Mechanics?
Thanks in advance!
I’m just a Unix weenie, not a kernel hacker, but I suspect that would be pretty hard: the cache is should be working with inodes directly rather than accessing files by name (e.g., because you only want to cache a hard-linked file once), so a change like that would be big and ugly and touch several distant places in the kernel: bad idea.
Slightly better might be to check in the actual file contents for a JPEG magic number and, if present and the size is small enough, make it stickier. (Here’s where we get into stuff I don’t know about, like whether there’s even a way to make some pages stickier than others, so I’ll stop.)
Use a RAM disk.
Hi Phil. Have you confirmed that most of your db io traffic is reads and not writes? No amount of caching will help you if your traffic is writes. I’m guessing you checked iostat/sar, but I wanted to make sure.
If a single cache server is good enough to hold what you want then a squid or apache node is the fastest, cheapest, easiest to maintain answer.
Where memcached shines is the ability to cheaply collect multiple machines together, treating them as a single logical cache (both for performance and availability).
Only by creating a tmpfs volume and using that as a small cache. Playing with the kernel block cache isn’t easy. You can have atime turned on for the tmpfs volume and purge the least recently used thumbs to keep some room.
Another approach would be to use memcached to keep a networked cache of thumbnails. That’s what I would try.
hth
Try valuing the developer time at a reasonable rate and compare and contrast those costs to the cost of using something like Amazon S3. I know this was brought up the last time you asked about solutions but it might be time for another look.
You can repurpose or sell your hardware on ebay. Try to look at this from a business perspective instead of a technical one.
It’s not exactly what you’re after buy might memcached do the trick?
what you are looking for is probably Direct I/O threshold. I think you want small files (thumbnails) to go through system buffers but not bigger files. You would set it to slightly more than your typical thumbnail size. I’m not sure the exact tunable for ext3 but for vxfs it’s discovered_direct_iosz. btw these are bread and butter questions for performance engineers 🙂
I don’t know the answer you’re looking for specifically, but I’ll try to reframe the question in case you already know part of the answer.
What if you used one fileserver to serve thumbnails, loaded it up with RAM and turned caching on, and used the other fileserver to serve full-sized images? That would leave you with an imbalance of load (in volume and pattern) on the two servers, but it sure should be easy to setup and try, and if you can keep most of the thumbnails in RAM, the imbalance of load shouldn’t hurt you (as I presume, like most sites, you serve more bits of thumbnails over the wire than you do full images).
Fazal: Good point on the single queue. You are probably right that we just need to suck it up and get more RAM.
Amiller: We don’t want to use mod_file_cache because now we need twice as much of the RAM that we can’t afford, since we will have one copy in the Linux file system RAM cache and one in Apache’s memory image, no? What kind of throughput are we getting? The Seagate Web site says a single disk drive should do 300 MB/second all by itself. Jin used hdparm to get 6 MB/second on our array max. Our actual load is 2.5-3 MB/second.
Dan: Mike Sisk, an old friend an Unix expert, told us that Wikipedia was running on Silicon Mechanics and recommended that we try them. The packaging is much better than Dell, at least for us. Instead of a box and a separate disk array with expensive SCSI disks, we get a smallish box with room for a bunch of cheap SATA disks. Everyone that we’ve dealt with at Silicon Mechanics seems to know a lot more about hardware than we do and they are very accessible.
Neil: Shocked to see that you’re still alive and hacking Unix. It was an even mix of reads and writes until we turned off the atime updating. Now it is truly all reads.
Phil: Noted that you just LOVE Amazon S3! I think we need to keep at least one copy of the user data, though. Maybe not. If we can ever finish cleaning up our own little cluster of Unix horrors, we will turn our attention to rebuilding everything to use S3 more extensively. I think we should get our house in order first, though. Remember that one day Amazon may discontinue S3.
Actually, the max theoretical load on the array is 125-140 MB/s according to hdparm -t. Or rather, a second identical but quiescent machine. That 6MB/s mentioned above was on the same machine that was getting hammered by reads, so not a very accurate measure. We were seeing maybe 2-4MB/s at 100% utilization with a lot of scattered reads and writes, which sounds about like all we can hope for according to amiller above.
Jim. Ideally you want to place large files on an unbuffered filesystem (i.e. the direct io filesytem) and small files on a normal one. Unfortunately, from what I’m reading, the direct io option on linux is system wide and there doesn’t seem to be a way to enable direct io only for large reads. see http://www.linux.com/howtos/SCSI-Generic-HOWTO/dio.shtml
Can you do some trick with mounting filesystem via a raw device ? http://www.linux.com/howtos/SCSI-2.4-HOWTO/rawdev.shtml
1. Seeing that you have two systems, just take 2GB (you should have gotten 1GB DIMMs and you have at least 8 RAM slots in each system) from one system and bump one of the machines up to 6GB and see if there is a performance difference.
2. I am buying RAM for my Opterons at about $85/GB. Even assuming $100/GB you can see that it may be cost effective to just take each system to 8GB or even 16GB.
3. My guess is that you are throwing away a lot of disk bandwidth in terms of reads, since for a small thumbnail of say 5K you may actually be reading up to 64KB or 128KB or even larger due to stripe size. Doublecheck the stripe size on your RAID5 controller; and check the “stride” parameter on your ext3fs filesystem.
4. You have a very large RAID5 setup that looks to the OS like 1 disk, correct? Read the mke2fs man page very carefully, e.g, http://www.die.net/doc/linux/man/man8/mkfs.ext3.8.html and (probably) remaking the filesystem with different options than the default (see dir_index for example) will increase performance.
5. You don’t say which Linux kernel version you are running. You may wish to try a custom-compile of a newer kernel (my guess is you are running on 2.6.9 at the latest). You can copy the .config file for the current kernel , almost eliminating the need to know anything about Linux kernel details.
6. Did you create the RAID5 as 2 separate 4 disk sets, or as one large 7-with-one-spare or 8-no-spare RAID? If you have 3Ware, you are probably running into a performance limitation with their hw raid cards.
7. The SGI-contributed XFS will probably give you better performance than ext3.
8. There are different I/O schedulers available in the Linux kernel. Try using a different one than the default (all this requires is a reboot to test). See http://www.wlug.org.nz/LinuxIoScheduler for some more info.
9. You have an excellent opportunity to test a different OS since you have 2 identical hardware systems and a known load. Why not throw FreeBSD or even Solaris 10 on one system and see how it does vs. Linux?
You might be able to simulate a separate pool of memory for caching a set of files by implementing the cache semantics in user-mode. This would be done by doing mmap() and mlock() on a file to bring it into the pool; and munlock(), madvise(.., .., MADV_DONTNEED) and munmap() to release it. You’d also need to implement the cache replacement policy.
Caveats: you’d be limited to 3GB of cache space in a 32-bit user mode (or you could spread the cache across multiple processes), and the MADV_DONTNEED may have strange interactions if you’re writing to the files in the cache.
Why RAID5? Do you need the extra disk space that much?
Chris: Why RAID5? We thought we had 99% reads, so write performance wasn’t an issue. We wanted data security so we needed some kind of RAID. We have no money so we wanted to milk as much as possible from one box (cheap to buy, cheap to admin) with SATA disks.
Patrick: We do have a 3Ware 9650 RAID controller. We let Silicon Mechanics set up the RAID with no spare. I think it is just one big one. The RAM is (4 x 1GB) DDR2-667. Where can we get more of this at $85/GB? The cheapest that I’ve seen is around $150.
hmm. How many requests per second are you serving? Are you limited by seek rate or data rate?
One quick hack to achieve what you propose would be to open the non-thumbnail files O_DIRECT (probably feasible by a trivial hack to lighttpd), but that screams out “bad solution” to me….
Try to use XFS. It was developed at SGI to server mutlimedia content.
“””The machine does pretty much nothing except run 8 lighttpd processes, each of which runs a single thread and serves multiple Web clients via async I/O.”””
Note that you might well have been confused about what “AIO” means in lighttpd (http://illiterat.livejournal.com/2989.html), this is very likely not helping you and could easily be causing you harm (eating some of your RAM with extra buffering).
Before you spend money on the problem try using the normal “Linux sendfile” and tweaking the number of lighttpd procs. you have running, to cancel out any NFS latency. This might not help enough, but it’s worth trying.
There is more complicated better solution, but I can’t fit it in the margin ;).
I would think it would be cheaper to buy lots of ram (e.g. 32gb) than to spend time trying to get by w/ less ram, when you factor in the engineering costs.