Start a Conversation

This post is more than 5 years old

Solved!

Go to Solution

1245

September 2nd, 2016 09:00

ScaleIO - SDS on Windows 2012 R2 - Problem after complete power outage simulation

Hello,

we are evaluating ScaleIO on two different infrastructures: on the former architecture ScaleIO is installed on Windows 2012 R2 and on the latter on Ubuntu 14.04.

What is relevant for this post is the SDS role.

Simulating an unexpected power outage we noticed a major issue on SDS nodes installed on Windows 2012 R2 when power came back.

This is the description of the problem we noticed with SDS on windows 2012 R2.

We have 5 SDS installed on 4 different Windows 2012 R2 servers. We assigned to every SDS a 100 GB disk with GUID partition table, left unformatted with a letter assigned to it.

Here you can see as disk how the disk looks like from the windows disk manager after setup:

sds_disk.JPG.jpg

We simulated a power outage so all nodes lost power simultaneously. When the SDS came back we noticed that only one SDS was working correctly while the other 3 nodes were unavailable because, in particular, ScaleIO couldn't access the disks.

From the windows disk manager we noticed that the disk appears as "not initialized" and immediately a pop-up ask if we want to initialize disk (previously used by SDS). It seems that the partition table has been lost.

Here you can see as disk how the disk looks like from the windows disk manager after power outage:

sds_after_power_outage.JPG.jpg

With linux all the test ran smoothly, after power outage ScaleIO came back like nothing had happened.

Someone noticed some similar behaviour with ScaleIO on Windows?

Thanks in Advance,

Davide

September 3rd, 2016 09:00

Can you please download secinspect.exe from Download Sector Inspector (SecInspect.exe) from Official Microsoft Download Center and save it to a temp folder on the C: drive and run the following command:-

c:\temp>secinspect.exe > secoutput.txt

And upload the file here? Also let us know the physical disk number. We can check its MBR information from this output.

FYI i have seen issues in the past wherein after a poweroutage on windows server the MBR on some disks is zeroed out. it can be on physical hardware as well as on VMs. It is unrelated to ScaleIO.

September 2nd, 2016 12:00

Did you power on the MDMs in the ScaleIO first or the SDSs first?  Can you check if you power on the Node with Primary MDM first if you still see the same behavior?

Also did you verify if the SDS service was running after the boot?

68 Posts

September 2nd, 2016 14:00

Hello SanjeevMalhotra,

I powered on all nodes at the same time so, since it's a 4 nodes ScaleIO cluster, there is a little chance that the first node to came up was the primary MDM.

I checked services and the SDS service was running after boot. I tried also to restart the service manually but if backend disk doesn't have a letter anymore, because partition table is lost at windows level, the SDS service can't use it (as backend disk).  I want to point out that the disk unavailable after power outage, where I lost partitioning, is not the volume mounted on SDC but the backend disk on 3 SDS (on a total of 4). That is strange, the problem seems related to a Windows issue with RAW disks and not to a ScaleIO problem. I suppose that ScaleIO has nothing to do with partitioning of backend disk, It uses the RAW disk to write blocks to... The initialization and the letter assignment is useful to attach disk to SDS.

What led me to write to EMC community is that I lost partitioning only to backend disks used by ScaleIO while I didn't loose partitioning on OS disk as well. But maybe the problem is due to a bug in Windows with initialized and unformatted disks (maybe in some scenario they can loose "initialization" after a power outage).

Thanks,

Davide

68 Posts

September 3rd, 2016 18:00

Hello SanjeevMalhostra,

I can confirm what you are stating, I downloaded SectorInspector, I analyzed the output: the MBR is zeroed out after power outage. I had some problem to replicate the issue, it isn't systematic: I had to simulate 6 power outage to replicate the issue today.

I can confirm that the test rig we are using to test ScaleIO can't be considered high end. All ScaleIO roles are installed on 4 HyperV virtual machines.

The production scenario will be different: we are planning to build a disaggregated storage block (not hyperconverged) with 5 ScaleIO nodes on physical Cisco UCS servers. We are planning to build an all flash array with every single disk in RAID 0 (to take advantage of controller memory cache).

We would like to use Windows because also unexperienced technician can manage LSI array (add new disks for example) from the LSI controller GUI while managing it from linux can be quite difficult because of MegaCLI exotic syntax (it is only for experienced users).

Right now because of this Windows issue we are considering to go in production with linux but we will test again ScaleIO on Windows on UCS Servers so we could take a conscious decision about the best OS in our specific environment.

Thanks a lot for your support,

Davide

No Events found!

Top