Any VMware experts reading this blog?

In the primitive old days we would get a new server with a 400 MHz CPU, 512 MB of RAM, install Unix, Oracle, and AOLserver, and be up and running after one long miserable day of system administration. Fortunately we live in the modern age. Two months ago, I bought a server with 4 GB of RAM, a processor with multiple gerbils running at the speed of light, and handed it over to a couple of young whiz kids. They laughed at my idea of simply installing Linux, Oracle, and AOLserver. “You’re going to use this box for multiple development servers,” they said. I replied that Unix was all set up for this with users, groups, and file/directory permissions. We would just create one group per development server and add people to the group. This response resulted in peals of laughter. Didn’t I know about VMware? They would create a virtual machine for each development server and entirely separate user account bases for each virtual machine. The poor little pizza box would now be burdened with running four copies of Linux, one for the underlying machine and one for each of the three development servers. This seemed like a waste of the gift that the brilliant hardware engineers had given us of 4 GB of RAM, but isn’t it the job of programmers to render worthless the accomplishments of hardware engineers?

I let them do it their way. Two months later, the box still isn’t up and running. What are the issues? We have one minor issue with time keeping. VMware supposedly lets you have the underlying box look up the time from NTP servers and set the system clock and then the virtual machines are supposed to get their time from there. That isn’t working for some reason and the virtual machines always have the wrong time. (We can laugh at Windows Vista, but I have never seen a Vista machine that was off by more than a second or two.)

The more serious issue is that the machine simply hangs up and won’t respond to keystrokes for several seconds out of every minute or two. At first I figured that the problem was virtual machines being paged out to disk so I asked the whiz kids to disable swap for each and every one of the four Linux installations. They did that and the machine is still halting temporarily. Here’s a description of the problem from one of the whiz kids (they will remain nameless so that they don’t need to be more ashamed than they already should be)…

We're running VMware Server on Linux and have noticed that the virtual machines will hang for several seconds at a time. This is creating serious performance problems and making even the most basic console interactions painful. The disk access light in the status bar of the management console is lit for the duration of the freeze. When the light goes out the VM resumes running smoothly. At first we thought this was a paging problem but upon adjusting memory allocation neither the VMware host nor any of the virtual machines report any swap use whatsoever. The problem continues to occur even with swapping disabled. Checking the vmware.log file reveals hundreds of disk timeouts. Here's a five minute span from the log file:


Jul 07 14:57:30: vmx| DISK: DISK/CDROM timeout of 13.156 seconds on ide0:0 (ok)
Jul 07 14:58:00: vmx| DISK: DISK/CDROM timeout of 7.435 seconds on ide0:0 (ok)
Jul 07 14:58:35: vmx| DISK: DISK/CDROM timeout of 9.563 seconds on ide0:0 (ok)
Jul 07 14:59:13: vmx| DISK: DISK/CDROM timeout of 9.763 seconds on ide0:0 (ok)
Jul 07 14:59:45: vmx| DISK: DISK/CDROM timeout of 5.775 seconds on ide0:0 (ok)
Jul 07 15:00:21: vmx| DISK: DISK/CDROM timeout of 8.015 seconds on ide0:0 (ok)
Jul 07 15:00:57: vmx| DISK: DISK/CDROM timeout of 9.939 seconds on ide0:0 (ok)
Jul 07 15:01:31: vmx| DISK: DISK/CDROM timeout of 10.320 seconds on ide0:0 (ok)
Jul 07 15:02:00: vmx| DISK: DISK/CDROM timeout of 3.345 seconds on ide0:0 (ok)
Jul 07 15:02:37: vmx| DISK: DISK/CDROM timeout of 6.182 seconds on ide0:0 (ok)

Any advice from a VMware hero? When is it time to wipe the machine and run it like we would have in 1978?

43 thoughts on “Any VMware experts reading this blog?

  1. I’ve found that the linux host is the weakest of all the vmware hypervisors. Whether that is VMWare’s fault or Linux’s fault doesn’t really matter, but I suspect there’s plenty of blame to go around. (And unfortunately VMWare doesn’t support using FreeBSD as a host.)

    I’d suggest using either (horror of horrors) Windows Server (which is what I do, and run mostly FreeBSD VMs underneath), or use one of their bare-metal hypervisors, ESX / ESXi.

  2. I know almost nothing about VMWare. However, I can thoroughly recommend OpenVZ. In less than an hour you’d be up and running with as many virtual machines as you want, whilst only running one kernel. This gives you the best of both worlds – you would get your almost zero overhead horsepower, your techie whiz kids would get an easy to manage virtual machine environment. Create one image with all your Oracle, AOLServer, stuff and deploy new copies in a couple of minutes.

    We’ve been using it for a while on about 30 servers and it hasn’t crashed or done anything weird yet.

  3. Phil,

    I would point my finger at disk contention being the issue. On the VMware host, each virtual machine is represented by a set of files. Any activity in the virtual machines result in the host updating these files. When multiple virtual machines are in use simultaneously, multiple sets of files also need to be updated simultaneously. This is where you should imagine disk heads spinning frantically and the controller becoming dizzy from monitoring them. The solution to this problem obviously is to minimize disk contention. The simplest way to do this is to put each VM’s fileset on a separate disk and controller.

    This is my theory. You’ll need to test it to (in)validate it.

  4. Not knowing much about VMWare on Linux, I’d say try pointing the VMWare CDROM device on each of the four VMs at an ISO rather than the hardware CDROM device.

  5. You don’t provide a whole lot of specifics on your configuration so I am guessing at a few things.

    Firstly, it appears you’re using the free vmware server product. It works, but honestly it’s years behind ESX and Xen performance wise, and IO performance and time keeping are it’s biggest achilles heels. All can be worked around, but after running it for several years I’ve given up on it for all but the most trivial tasks.

    Secondly, I’m guessing you’re running on a slightly older linux variant, like CentOS 4, on your host. Older linux kernels have a major problem with starving read IO anytime heavy write activity hits the disk. Obviously a virtualized environment compounds the problem badly. Try doing dd if=/dev/zero of=/tmp/big.fil bs=1M count=4096& and then running ls /usr/bin to see the problem in action. Upgrading to a newer linux kernel, playing with the IO scheduler, adding RAM, or adding disk spindles (ideally a mirrored pair for each VM!) could all help in their own way. may solve the problem.

    Personally, I’ve spent too much time working around VMWare Server issues in production to recommend it for anything other than playing around. If you want to virtualize, fine, but you probably will want more ram and a real virtualization solution like VMWare ESX or maybe Xen if you’re feeling cheeky. Plan to get a lot of disks as well. You can get a nice rackmount server from dell including their new integrated ESXi hypervisor quite cheaply.

    Virtualization is great in that you insulate your application environment from your hardware, and you could do that by running a SINGLE virtual machine on your hardware, and eliminate most of the goofy problems you’re seeing. If you’re running redhat (centos) and want to use free virtualization, I’d consider using Xen over vmware server. It’s a radically more efficient linux-on-linux solution once you get it up and running.

    My recommendation? Just do it the old fashioned way.

    Feel free to contact me if you need, as I may or may not remember to check comments here. I’ve gained enough from your writing over the years to be happy to give back a bit if I can.

  6. Phil,

    Give each Virtual Machine its own Hard Disk.
    Give each Virtual Machine its own network interface and REAL ip address. You will notice a difference.

  7. Craig: We are using the free version of VMware, I believe. We are using software RAID on the Linux box.

    Dragonfly: We can’t give each virtual machine its own hard drive because this is a pizza box with just two drives (mirrored so that they look like one).

    Everyone who says to use something other than CentOS/VMware underneath the virtual machines: Can we do this without wiping the hard drive? If not, I think it would make sense to go back to a known solution (one Linux, multiple users and groups).

  8. I wonder if you could split your primary Linux host partition into two, install Windows on the new partition, copy the images from old partition to new partition (or just mount your extfs partition on Windows), and test to see if it resolves the problem.

    How big are your VMWare images? If you have somewhere to temporarily store them, you could wipe the drive, reinstall the OS, then copy the images back to the drive.

    We run VMWare on Windows Server, haven’t seen this issue.

  9. Any chance you simply have a disk with a bad spot in one of the VMWare files, and the delay is caused by the disk’s firmware doing some heavy duty recovery? A simple dd from the current drive(s) to a set of blank drives (and then booting with the blank drives) would help eliminate that possibility.

    Does the disk activity/lockup occur only with all four VMs running? If you remove the VMs one at a time, can you narrow it down to one specific VM? If so, orphan it (don’t delete it), and create a new replacement for it.

  10. To answer people’s questions:

    The kernel is now passed clock=pit (it was previously running fast). The drift was slow so it remains to be seen if this keeps the drift in check.

    The host is running CentOS 5.1 as are all of the VMs.

    The VMware version is in fact the free VMware Server. Perhaps a box with a single pair of mirrored disks just isn’t beefy enough in the I/O department to handle this load? I can see how this could be problematic if the application was heavy on the disk but the fact that things are grinding to a halt under almost no load (just a couple of ssh sessions) makes me suspicious.

    cthrall: That thread is in fact very interesting. We have seen it and we are using software RAID. Can you offer an explaination for the apparent link between RAID and the performance hit?

    All: If we were to use ESX would it be able to offer us the same RAID capability we get from CentOS? Would it be able to read the VM files from their current home (an ext3 partition on the mirrored disks)?

  11. Ha ha– samir and dragonfly are very helpful but you have to love the irony. VMWare Server allows you to partition a single machine into multiple virtual servers — except you need a separate hard drive and IP for each server. Woops. So basically you’re just sharing the processor and RAM, except it’s probably a multicore CPU so you’re not even really sharing that.

    Maybe you could be careful to only use one server for disk intensive stuff, ie database work. But I like the “just use unix as God intended” way better.

  12. First, I don’t see any particularly good reason to do VMs for this unless the development environments are different in some way. If you have a bunch of different projects, or multiple tiers of software development environments (dev, qa, staging) I’ve found VM setups valuable. In this particular case it seems people wanted to play with fun new stuff, as opposed to picking the right tool for the job.

    Regarding time, I have seen this as well. It’s the VMware Tools software installed in the guest OS that’s supposed to synchronize time between the host and the clients. So a) you need to have VMware Tools installed, and b) you need to have it setup to synchronize the time.

    If you have that and it’s still not working, I haven’t had any issues with having VMs all synchronize themselves over NTP with a known good NTP host (or several) on the local network, or even go out to the Tubes if you have to.

    Regarding the timeouts issue, I’d check the following
    – SATA vs SCSI/SAS drives – SATA is not good for multiple simultaneous access, and when you get more than 3-4 processes running at once will slow to a crawl.
    – hardware vs software RAID – software RAID is likely to be slower, though shouldn’t be giving this level of problem unless you have something misconfigured
    – ask VMware – it’s a free product, but it can’t hurt to ask

  13. Erm, in that previous comment the SATA comment about “3-4 processes running at once” should have read “3-4 processes doing lots of disk I/O at once”.

  14. These is a weird post. I have used VMWare quite a bit and not encountered anything like this.

    Firstly, you don’t provide nearly enough information. What is the OS running on the host server? It needs to be pretty recent to have good performance with VMs of any stripe.

    Secondly, disabling swap on the VMs does absolutely nothing. If the machines are being swapped out, that is being done by the host OS. The fact that these “whiz kids” seemingly didn’t realise this makes me wonder whether they know anything at all about how this all works.

    Thirdly you mention disk a lot but the log file posted mentions the CDROM drive. This makes me immediately suspect either a hardware problem with the CDROM or that the VMWare instances are trying to access it as a disk, freezing while it checks itself and then coming back. This seems the most likely explanation; it does take several seconds for a CDROM to check itself for a CD and return, which fits with the several seconds wait time described.

    My initial diagnosis would be faulty server config especially regarding the CD drive, and/or faulty IDE hardware. Mix that with an old kernel version and sounds like could cause the problems you are seeing.

    The issues with the NTP are harder to guess at but my guess would be host server config, especially if it’s by the same “whiz kids” who can’t see the word CDROM in a log file. Can the VMs ping the outside? Have you installed the vmware tools on them? Is tools.syncTime set to true in the .vmx file? This should all Just Work.

    My suggestion: wipe the server and install a recent linux like centos/RH 5. Why are you so averse to wiping the box? Seems like it would take no time at all compared to the time already wasted. Copy off the vmware containers first if you want to keep them. The ability to easily do that kind of thing is half the selling point about VMs!

    Another option is to use RH5’s inbuilt VM system, although I’d prefer the portability of VMWare, honestly.

    A final option is just to run Windows, like another commenter said – I am not a big Windows fan but can’t deny it runs VMWare very reliably and is easy to set up. I personally ran something very similar to your desired your setup on Server 2003 for several years without any problems whatsoever.

    VMWare is usually a fine product and these issues are pretty anomalous. Redo the server, without messing with it, using a new OS, and try again.

    However, since you don’t seem to have any compelling reason to be using VMs at all apart from the idea that they’re new and cool, maybe just going back to a setup you’re familiar with is the best solution.

  15. The VMware Knowledge Base article on clock issues mentions that starting with kernel version 2.6.18 the clock should no longer run too fast. We are running 2.6.18 and even with clock=pit passed to the kernel by GRUB (in the virtual machines) the clock continues to run fast.

  16. Phil,

    IIRC this sounds like an issue I had 3 or 4 months ago. I have to agree with boozedog. As an alternative to pointing each of the vm’s to an iso for the cdrom run the vm’s with the cdrom disconnected. Not know more details we both could be completely wrong. The change suggested shouldn’t take more than a few minutes in the VMWare console. If it doesn’t work you haven’t lost anything.

    HTH,

    Stephen

  17. You’re talking to the wrong whiz kids. Get Xen or switch to OpenSolaris zones, if you’re going to muck around with virtual machines.

    But yeah, you haven’t described any functional requirement that suggests virtualization in the first place.

  18. Phil, I would first, disable any virtual CDROM access from within the VMs, as you don’t need it anyways. In fact, strip down each VM instance to the minimum needed (basically RAM, CPU, disk space, network interface).

    Second, I don’t see how software RAID would harm/help anything, that is, I don’t think that is the problem.

    Third, while VMWare Server may not be as fast as ESX, it is currently serving a URL you may find familiar, http://www.openacs.org/ (AOLserver and Postgres), with little problem other than a few teething problems due to migration from another server.

    We are currently running a dual-Opteron with 8GB RAM and it is lightly loaded.

    If problems persist, install Solaris10 and set up zones.

  19. I doubt that you need vmware. The whiz kids just have to create 3 schemas in one database or at worst run 3 oracle instances with different databases. Imho, it’s a bad idea to run Oracle on VM, especially the free vmware server which is slow and not certified. In former days, Oracle would have been configured to use raw devices to bypass the fs layer of the operating system. Now you put a vm layer between the db and the hardware? Seems like a waste of money.

  20. The most likely culprit with respect to the timekeeping issue would be the host CPU speed changing based on load. On our system, disabling cpuspeed (chkconfig cpuspeed off) and rebooting fixed it.

    Here’s how to find out if you have that problem:
    On an idle system:
    cat /proc/cpuinfo
    Look at what it says for “cpu MHz” for all of the cores. You may notice that your super-fast CPU is actually running much slower than what it is clocked for (normally this is a good thing, the system is idle so it consumes less power and produces less heat)
    Now run something like “perl -e ‘1 while 1′” which will consume 100% CPU on one of those cores
    Now cat /proc/cpuinfo again and you may find that one of those cores is reporting running at the full speed of the CPU.

  21. First, I am using both VMware Workstation (on my Ubuntu desktop box), and VMware Server (on a Ubuntu 64-bit 4GB dual-CPU server box). All are current versions, and all work very well. Since VMware Server allows remote connections, I can connect to the VMs running on the server box from my Windows laptop (when out in the backyard!) or my Linux desktop – very useful.

    But … my main use is to host various versions of Windows for testing and development. For that purpose I need broad support of Windows versions, and VMware is pretty much the best tool.

    Note that the free VMware Server is not the most optimal version for performance. Common disk performance is often faster (via the Linux disk cache), but network performance is slower. For my usage, this is an acceptable trade-off.

    For your use, there is still some point in configuring virtual machines. The ability to completely configure a server for some specific purpose (within a VM), then later move the VM to entirely different hardware with a simple file copy – can be a large time saver. But I would also guess that for your use, trading away performance is not good.

    In your situation – where you are using Unix-on-Unix – I would be looking at the virtualization support that seems to be mainstream in the current version of Ubuntu:

    https://help.ubuntu.com/8.04/serverguide/C/virtualization.html

    I might also consider Solaris, as Sun seems to have excellent support for virtualization. Since the last Sun I used heavily was a (then new) Sun 3 … my first choice is going to be what I use every day (Ubuntu).

    For CentOS you are on your own. 🙂

  22. Stephen: The device mentioned in the log is the hard drive image not the CDROM. Despite that I implemented your suggestion but it didn’t change anything. As you said, can’t hurt to try.

    Andrew: We actually first attempted this with Xen (I do agree that this would be a better solution for Linux on Linux virtualization) but at some point ran into an issue with one of the software packages (can’t remember which). The only solution we could find was to change to a non Xen kernel.

    Patrick: As I said to Stephen the issue does not seem to be with the CDROM (despite the wording of the error message). Removing the CDROM device all together did not solve anything. It’s good to hear that OpenACS is running Server. At least someone (with much more traffic) has managed to make this work in production.

    Chris: I don’t dispute your claim as to the necessity of VMware. To be fair the story of the whiz kids peddling VMware as a gateway drug to the dark word of virtualization could use a bit of tweaking. If my memory serves me correctly the decision to use virtualization came about from a suggestion from Philip.

    That being said I really don’t think that the concept of virtualization is the problem here. Sure a virtualized server isn’t going to be as fast as running on bare metal. Perhaps a bit of background as to this box’s role could prove enlightening. It was conceived as a development server for philip.greenspun.com and http://www.greenspun.com. All it does is provide a playground to test changes before going live and should never have more than 3 or 4 simultaneous users. It’s just a check-in station for our distributed version control. The third VM is a web server for a very simple site. The site is for a company that is just in the concept stage and gets very little traffic. The fourth VM is just a sandbox to play around with Apache and RoR. As you can see we really don’t need “Enterprise Grade” stuff here. Our needs aren’t exactly demanding.

    I can agree with you as to Oracle as well. It is clearly designed to be as close to the hardware as possible so running in a VM might be seen as insulting. That being said the load being placed on this particular Oracle install is so minimal that I just can’t see this being the bottleneck. To make sure this wasn’t the result of Oracle thrashing about I shutdown the Oracle VM. The remaining VMs continue to fill up their log files with the error message. Not sure how this is a waste of money since the only money we spent was on hardware. Maybe it could be argued that we’re getting lower performance than we’d get from doing it the “good ol’ UNIX way” but the higher level of isolation and useful management interface offset that in my opinion as we’re not even approaching the capabilities of this bottom of the line 1U Dell. I really do believe that if we weren’t encountering these configuration problems we’d have more than adequate performance.

    Adam: Good call. I overlooked that as I’d always associated that with laptops. I suppose power management made its way through the product line quite some time ago. Silly me for not thinking about that. Your test was right on the money. The CPU does appear to be scaling. I will disable this and report back on how well the clocks are behaving.

  23. Your “whiz kids” can’t get virtualization to work in two months? And I agree, virtual machines are a waste of resources. If you absolutely must use them, how about getting rid of vmware and using kvm instead. Works like a charm in our CentOS environment and you won’t have to wipe the hard drive either.

  24. Not to be a pain in the rear, but really, do you still have a good reason to buy and maintain your own dev hardware? If you can do dev on EC2, for instance, each developer really can have their own dedicated machine. There are now several levels of machine performance to choose from, and you can use RightScale’s tools (http://www.rightscale.com) to manage the server configurations. Then, not only do you not have to buy and own any pizza boxes, you also don’t have to spend days in sysadmin hell or fooling with VMWare’s foibles.

  25. Sho: I must have missed your post when I replied earlier. Responding:

    The OS info is in fact supplied above. It is CentOS 5.1 for every installation. As for memory I can’t say I agree with you. If the VMs are swapping their own memory that would be generating disk access. Nobody claimed that disabling paging within a VM would affect the paging of the physical host. The VMs themselves are not being swapped out either because VMware has been configured to keep the VMs entirely in RAM and throw an error if it can’t fit them. They are entirely in RAM and the physical host reports 0k swap used. I’m not sure where you got the impression that someone thought that changing settings within a VM would affect the host.

    As for the CDROM thing I also think I’ve been pretty clear about explaining that the error message references ide0:0. This is the hard drive disk image not the CDROM. The physical machine doesn’t even have an optical drive at all. The virtual machines are either pointing at ISOs or have the CDROM drive disabled. The CDROM really has nothing to do with this.

    The kernel version is 2.6.18 which isn’t that old at all.

    Again I believe your NTP diagnosis to be quite off the mark as the host is keeping perfect time which it gets from NTP. The VMs all have VMware tools installed and do sync time but this only works when the VMs have a clock that runs slow. Simply reading the VMware timekeeping whitepaper you will see that it doesn’t help if the clock is running fast. My current belief is that the time issue is related to CPU power management features. I will post an update when I figure it out.

    I’m unsure how to feel about your frequent use of quotes around whiz kids (not that I really like the sound of that term in the first place). If you’re going to come on here and insult us about misreading a log file it would be good to make sure that you actually read the log properly. I really must say that this has nothing to do with the CDROM.

    The VMs can certainly ping the outside. VMware tools is installed. tools.syncTime is set to true. It all doesn’t “Just Work.”

    Running CentOS 5 would be a downgrade as we’re currently running CentOS 5.1. I’m averse to wiping the box because I don’t see how that will help anything. I just don’t see anything in your proposal that would change things were I to wipe the box.

    RH5’s inbuilt system as far as I can tell is Xen. Please correct me if I’m wrong.

    I think Philip would rather run this on punch cards than Windows. It seems strange to me that this can’t be accomplished with a free OS. Not sure how to redo the server since I don’t believe anything has been messed with in the first place.

  26. bryanf: We tried just running NTP in every VM but that didn’t help. The clock gets so skewed that NTP gives up and dies.

    The harddrives are SATA but again the disk load is quite minimal. This isn’t 3-4 processors this is 2 cores which barely touch the disk after startup (when everything gets loaded into RAM).

  27. PS – I realize I misread processes are processors. I still don’t believe we’re anywhere near the capability of the disks. I just can’t believe that you could get Dell to sell you a harddrive that times out for 13 seconds when you ask to read a few blocks just because another process asked first.

  28. I’d do the old way. You just need virtual machines if you really must have different development machines. I can not see that this should be needed for Server, well maybe some come along and tell you something about that you shouldn’t run Database and WebServer and etc on one machine, but what’ll happen if you do? Nothing really, if your security requirements are higher than this might be different, but we still have chroot and the like.

    However I’m using vmware since 1998 without bigger troubles than installing a few require tools for compiling the modules. Vmware does not have crashed on me directly, but the Xorg server sometimes has….

    I’ve run Windows XP, Windows 2000 and Windows Server 2003 (both 32 and 64 bit in vmware workstation and it simply “works”

    If you want to be adventorous you may like to give qemu a try 😉

    Good luck
    Friedrich

  29. @John

    I ran into the timing problem like you. As suggested by the other commenters on this post, you should disable cpu scaling and other such things that throttle the speed of your CPU.

    I have a desktop with a dual core AMD CPU, disabling Cool n’ Quiet from the BIOS was one of the things that fixed it for me. Not sure if there is such a setting on your pizzabox.

    I haven’t encountered the ide error from your logs yet. Maybe it’s because I don’t have my data on virtual hard drives. I allocate enough space on the virtual hard drive to install the OS, web server, database server and whatever software I need but I put all the data files (e.g. html, tcl, pg data files) in the host OS.

    My host OS has an NFS server so I just configure my virtual machines to mount the exported directories where my data files are from the host OS.

    Maybe you can give the above setup a try.

    Like you, I use vmware for development, testing and playing around, and I really do a lot of playing around trying new stuff and breaking things, so virtualization is a god send. For me it boosts productivity and lessens the time it takes to set things up which in turn gives me more time to code.

    Given the intention for the pizzabox, I think the “whiz kids” made a sound recommendation going with VMware server.

  30. Philip,
    I initially ran into similar issues with VMware Server 1.05 on Ubuntu 7.10. After quite a bit of research, here are the changes that eventually got things running well for me:

    VMware Console -> Host Settings -> Memory -> Fit all virtual machine memory into reserved host RAM
    CD-ROM in each VM is set to not “Connect at power on”
    VMware Tools scripts are configured to run for all actions

    Added/changed following entries in /etc/fstab:
    devshm /dev/shm tmpfs size=4G 0 0
    usbfs /proc/bus/usb usbfs auto 0 0

    Added following lines to /etc/vmware/config:
    MemTrimRate=0
    sched.mem.pshare.enable = “FALSE”
    mainMem.useNamedFile = “FALSE”
    tmpDirectory = “/dev/shm”

  31. You might want to take a look at the disks, I almost purchased a bunch of the “green” WD 1TB drives until I found out they can act wonky in RAID configurations. Those drives will retry on a read failure for 5-15seconds before giving up…of course in a RAID setup you’d rather it fail quick and just use the redundant info to recreate the missing data. Turns out WD makes a different version of those disks (RE2-GP) which is optimized for RAID environs, it’s been working great for me:
    http://techreport.com/articles.x/13578

    If this is for research/educational use, you can get ESX standard (two physical CPU) for free, I’d recommend it over Linux/Win + VMWare Server.
    http://www.vmware.com/partners/academic/

  32. To update everybody on the current status:

    I disabled cpuspeed on the physical host and now the VMs appear to run their clocks at the proper speed. The issue that remains is that they appear to be getting the wrong time on bootup. The time they do display seems to be fairly random but generally a few hours and minutes into the future. I thought about perhaps installing NTP in the VMs as well to get a good time on boot and then let VMware tools takeover from there but if NTP sets the time even a second ahead of what VMware tools sees on the host the time will not be synced by VMware. I suppose this would only be an issue if the clock ran fast. Do you guys think it’s safe to assume that the clock will no longer run too quickly and if it runs slow VMware tools will pick up the slack?

    Moving onto the issue of the harddrive problems. I ran diagnostic checks on the drives themselves and they passed just fine. Intrigued by the suggestion that this was due to software RAID I broke down the RAID and booted the server from one of the drives. The log files no longer contain the error message and as far as I can tell the system has stopped freezing for seconds at a time. I think it’s now time to evaluate the situation and decide what to do (we’ll have to decide whether we can live without RAID, whether we want to move to hardware RAID, or whether despite figuring out this problem we’d rather wipe everything so that we can use software RAID). Regardless of the direction we take I think this page serves as a testament to the fact that VMware Server and software RAID on Linux do not play well together. Let this be a warning to others thinking of such a configuration.

  33. I’ll also add that running Oracle from within a VM doesn’t seem to be causing any problems. Ther performance is a bit less than if it were running natively but I’ve yet to encounter any issues as far as functionality.

    PS – Turning off the software RAID cut the time needed to execute a 32,000 row update in half.

  34. John,
    I was and still am running software RAID (md RAID 1) in the configuration I mentioned previously. I was seeing the same kind of VM freezes that you described, but for even longer periods of time, before making the changes I mentioned. The other difference between your configuration and mine is that I have configured all of my virtual disks as SCSI (lsilogic controller), while yours are apparently IDE.

    For more background on the changes that I think were key to getting my system running well, see

    http://vmfaq.com/?View=entry&EntryID=25

    and

    http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=844

  35. I had exactly the same problems with VMware workstation as you described.
    Setting the three parameters:
    MemTrimRate=0
    sched.mem.pshare.enable = “FALSE”
    mainMem.useNamedFile = “FALSE”
    in the vmx-files finally solved the issue.

    The thing that i still do not understand is why and when the problems started. I am using these virtual machines for 3 or 4 years now and suddenly (maybe 6 month ago) these problems started. I just can not figure out when and why.
    Maybe it was the switch to a tick-less kernel ?

  36. I disagree with changing host OS, hypervisor, etc just for the sake of changing things. CentOS is awesome, and frankly so is VMware Server 1. Granted, I would have picked ESXi…

    I would also check for disk contention. First use a tool like bonnie++ to see what the array can handle for disk IO. Then, when under a load of all 4 VM’s, run iostat to see if you’ve hit the practical limit for random reads/writes.
    That will let you know… if IO is the issue, you either need fewer VM’s or faster storage (e.g. more spindles)

  37. VMWare can be a great choice, if you’re doing something that it provides extra leverage for.

    You’re clearly not doing anything for which VMWare is a help. Therefore, you are (as you clearly suspect) just wasting computrons on it, and adding unnecessary complexity as well.

    VMWare is great if you have a proper computer on which you need to run a couple of applications that only work with a garbage OS. (I’m using it because my fiancee is dependent on a Windoze application. I run VMWare on one of my servers, with TightVNC running inside the virtual machine, and she can connect to it and do her thing whenever she needs to. And when the Windoze instance gets gorked, as they all inevitably do eventually, I can just restore a snapshot I took shortly after I finished the configuration, and everything is fine again.)

    VMWare is less great than that (but still very much a viable option worth considering) if you’re virtualizing a large distributed application in such a way as to provide for more dynamic resource allocation and more effective redundancy. If you’re devoting a meaningful fraction of a data center to your application, and wondering about whether you might be able to structure it more efficiently by using virtualization, the answer is probably “yes”, and VMWare can be a good part of the response.

    But if you’re only throwing one physical server at the problem, and the problem itself is one that is implemented entirely in a modern unix, there’s absolutely no benefit to VMWare that can’t more easily be achieved by a lighter-weight solution that doesn’t require heavyweight resource partitioning and virtualization overhead. If you think it can be done with standard unix filesystem tools, it probably can, and that’s probably the best way. If you think it can’t, you’re probably wrong. And if you KNOW that you can’t, you’re probably still better off with something like UML under Linux or jail under FreeBSD.

  38. From My experience , running Oracle or other DB application on Virtual infrastructure, causes delays and performance is bad..
    I’d put no more than 1virtual server per physical platform.
    In that case use VMware for backup, not hardware reduce.

Comments are closed.