Monitoring Direct Attached Storage Under ESXi

One of the first things I wanted to do with my ESXi lab box was to simulate a hard drive failure to see what alarms would be raised by ESXi. This exercise doesn’t serve any purpose in the “real world” where ESXi hosts are likely to be using shared storage in all but the most esoteric of installations but since my lab box isn’t using shared storage I wanted to make sure I understood the behavior of ESXi during a drive failure. This post is also a guide to my future self should a drive fail for real :-).

Scenario

My home ESXi box has two drives in a mirror set connected to an LSI 9260-4i. These tests were all done with the default LSI CIM provider that comes with ESXi 4.1u1. I simulated a drive failure by pulling out one of the hot swap drive trays.

T=0: Everything Normal

This is the view within the vSphere Client when there are no failures. Drive 0 and 1 on enclosure 252 (the LSI 9260-4i) are both ONLINE; the RAID 1 logical volume is OPTIMAL.

VMWare Sensors All Green

T=1: Drive Fails

Here I’ve pulled a drive out of the enclosure.

VMWare RAID Degraded

Notice Drive 1 is no longer showing and that the RAID 1 logical volume is showing a DEGRADED state. Also notice how the yellow status indicator rolls up at each level so you can see there is a fault without even drilling down all the way.

T=2: Drive is Replaced

The drive has now been plugged back into the enclosure.

VMWare Drive Replaced

The LSI card does not automatically rebuild the mirror onto the newly replaced drive. The drive is put into the UNCONFIGURED BAD state and requires manual intervention to initiate a rebuild. With the LSI CIM provided by ESXi 4.1u1 there is no way to initiate an array rebuild (or do any array maintenance for that matter) so a reboot into the LSI BIOS is necessary.

MegaRAID Drive Replaced

Even though the drive has been physically replaced, the BIOS shows that there is a “PD Missing” on backplane 252, slot 1.  By switching to the “physical view” and selecting the drive that’s shown as “Unconfigured Bad”, the drive can be changed to “unconfigured good” by marking the radio button and clicking Go.

MegaRAID Physical View

MegaRAID Make Unconf Good

Now that the drive is in a “good” state, it can be added into the array by marking the radio button beside Replace Missing PD and hitting Go.

MegaRAID Replace Missing PD

After that, choose to Rebuild Drive and away you go.

Back in the vSphere Client, the host status now shows the drive in a REBUILD state and the RAID volume in a DEGRADED state. Once the rebuild is complete, everything goes green again.

Proactive Monitoring

The vSphere Client has no capabilities for generating an alarm via email or SNMP when a hardware fault occurs. You have to fire up the client and inspect the hardware status manually or employ a tool that uses the API within ESXi to poll the hardware status and generate its own reports.

41 thoughts on “Monitoring Direct Attached Storage Under ESXi”

  1. I have intel raid control last 2 days back my raid 10 one disk failed and I have done pd replaced missing pd then what happen 0 port id disk moves on 1 port one and 1 id disk goes to port 0 because i have remove cable of failed disk and insert same cable to new disk and select option of replace missing PD, Then 4 disk comes online but still my esxi is not booting. Then I also tried consistency check with data protection. So, I want to know how to make old raid working and what are the steps required further.

    1. Hi Ashwin,

      I’m really not sure on that one. I would try asking in the Intel support forums or even contacting Intel directly if that’s an option they provide.

  2. I found that LSI makes this controller 9260-4i which can use the Megaraid utlity, and they also make the controller with the same hardware called 3ware 9750-4i.
    The big difference? the 3ware model has firmware that lets you manage the RAID from a web interface, no rebooting into BIOS. Plus it will email you an alert if a drive fails.
    I have already sent back one 9260-4i in exchange for the 3ware 9750-4i and I am contemplating doing the same with the other 9260-4i.
    To me it’s mind-boggling that LSI offers both products when this monitoring and notification is so critical and Megaraid software can’t even support it in VMWare or Ubuntu!

    1. i found out today ESXi does not (yet) have native support for the LSI 3ware 9750 controller; the drivers need to be compiled into the ESXi installation…

      1. Joe, thanks for posting this info. Would you mind posting where the drivers are located and a link to any info you have on how to install them?

  3. Hi, we have just set up a EXP2512 using a raid controller (MR10m) linking it to a server.
    But for some reason all 9 hard drivers are coming up “reconfigured BAD”.
    Is is new and hasn’t been configured before, but we cant change it out to GOOD.
    We have used CLi, web bios, but no luck.
    I know this may not be the right page for this, but any help of advise where I could find help would be appreciated.

    Cheers

  4. Hi, I have rebuilt the disk ,but the progress bar keeps 0% for almost 3 hours. And what can i do for this. And abort the process of rebuilt .

  5. Hi, excellent article !
    Is there possibility to have informations on physicals disks which are in the RAID, with vSphere Client GUI or CLI ? ( Like S.M.A.R.T., S/N, … )
    Thanks !

    1. Hi Paul, and thanks.

      I looked pretty hard for a way to see information on the individual disks but was unsuccessful. If you find anything, please post the information back here!

  6. Many Thanks !!! UR article saved my time. I have done the way u explained and now my drive is rebuilding. Let me see wot happens.
    Thanks to Hardforums they guided me here.

  7. I have my drive in rebuild state and I think it will take a quite some time depending on the data size. Will wait and I think the beep will go away with umber light.

    I can not accees my CIMC through web browser though I can ping cimc address. When I restarted the server it takes more than 10 minutes on testing and configuring hardware screen and then after next few screens it gives me cimc address and mac error. What needs to be done any Idea ?

    1. Hi. Do you have support on that server? I would put a call into TAC (cisco.com/support) and get them to diagnose the issue for you.

  8. I have UCSC-C220-M3SBE server with vmware ESXi 5.1.0, 1065491 installed on the same in which CUCM subsciber running which got hing twice within a month. I keep on getting these messages everage after every 4 days or so from vsphere client. How to stop these messages ? Is there any relation of ahnging of subscriber and these messages?

    Lost access to volume
    521d7b5d-3611a8a0-3ad8-b0faeb975-
    d1e (datastore1) due to connectivity
    issues. Recovery attempt is in progress
    and outcome will be reported shortly.

    After 1 minute I will get another message:
    Successfully restored access to volume
    521d7b5d-3611a8a0-3ad8-b0faeb975-
    d1e (datastore1) following connectivity
    issues.

    1. Hi. Yeah absolutely that could cause issues with UCM.

      Is datastore1 on a SAN/NAS? That error message reads as if the connectivity between ESXi and the remote storage is being interrupted. That’s bound to cause bad things to happen on the VMs. I would troubleshoot this as a connectivity issue and not as a UCM issue.

  9. Hi, great article I found it very helpful but I am still having issues that you might be able to help with. I have a Dell PE715 running esxi4.1. A HD failed in a DAS unit attached to that server. I replaced the failed drive, but I cannot tell if it is rebuilding or not. How can I tell if it is rebuilding?

    I see in the hardware status that there are some warnings in Storage saying Partially Degraded, but nothing saying rebuilding.

    Any ideas?

    1. Hey Michael. That sounds like the same thing I see with the card in my server at home too. I had to reboot and go into the card’s firmware to see the actual status of the rebuild. Since it’s a Dell box, does it have an on-board administrator/DRAC? That interface would typically tell you the status of the hardware.

  10. Hi Joel,

    I did get into the Drac but it didn’t contain any Raid info. I’ve tried just about everything I can think of to view that info and it seems the only way to do that is to reboot and go into the Raid properties of the card through the reboot screen.

  11. hello friend I have a problem with an IBM System X3300 m4 server, I get an orange LED light “check log Led” and disk raid 0 has a LED light orange color. and I do not recognize the hard drive. and I want to recover the data because you can not start windows server 2008 R2, thanks for your help
    PD missing : Enclosure 65: slot 255.

    1. Hey Jhon,

      You should get on-site help with that. I wouldn’t recommend soliciting help on a blog. If you have support from IBM, call them.

  12. I continue to get “PD Missing” and “Failed to start operation on drive” on a Lenovo TD200 ThinkServer and I made sure to replace the drive with a Lenovo–same FRU. Have tried it with a similar drive too, light in front of unit stays amber no matter what I try.

      1. Hm no we don’t but it’s the same reboot MegaRaid software you are using but it fails on “Make Uncofig Bad”. What else could the problem be?

        1. Can you be more explicit when describing the problem? What are the exact steps you’re following? What are the precise errors/outputs you see? Screenshots would really help a lot too.

    1. I agree with David’s response: you’ve got to keep troubleshooting. I understand that you’ve tried some things but if they had no affect you’ve got to keep trying until you find the faulty part. Replace the cable. Try a different port/slot. Think of the most unlikely cause and troubleshoot that.

Leave a Reply

Your email address will not be published. Required fields are marked *

Would you like to subscribe to email notification of new comments? You can also subscribe without commenting.