Small file (55KiB) Mirroring instead of FEC

Question

Hello,

Please see my previous answered question for a bit of history:

Best, most efficient, protection level for small files - 55KB

After the help and feedback I received from the community and my EMC reps, we are looking at setting the protection level to 2x mirroring on specific directories to cope with our overhead issues. Before we pull the trigger, I wanted to run it by the experts here.

For example:

In a 6 node, single pool NL cluster(8TB drives), the DPL will be set to 3d1n1d.

Specific directories with billions of small files will be set to 2x mirroring

Since everything is replicated to an alternate site with the same protection scheme, would I be sufficiently protected from data loss?

Is a small file guaranteed to be placed on 2 different nodes and not just on 2 different disks?

After a node outage, will the files be created/copied to a different node to keep the 2x status?

If both copies of the file are gone( double node outage ) will the system try to recover from the replicated data set?

Is this feasible?

Regards,

Oz

kipcranford · Accepted Answer

> In a 6 node, single pool NL cluster(8TB drives), the DPL will be set to 3d1n1d. Specific directories with billions of small files will be set to 2x mirroring.

> Since everything is replicated to an alternate site with the same protection scheme, would I be sufficiently protected from data loss?

Going from +3d:1n1d to (essentially) +1n is a huge difference, and those directories will be at risk. Two drive failures (within the same failure domain) means data loss. Since you have the data copied to a second location, perhaps one could argue that the files really aren't 2x, but 3x. That's mostly true, however dealing with data loss on a cluster isn't fun and can cause other issues not directly related to missing data, e.g. Job Engine jobs that fail when they encounter inaccessible files, workloads that expect files to be in a certain place (and fail when they're not), etc. So it's not necessarily as easy as "just restore the damaged file from the DR site".

Since you said you spoke with your account team, what did they say about the supportability of such a plan? Are you sure that Isilon Support will help you in the event you do actually lose data (I'm not saying you will, but let's think worst case)? Will they help clean up the filesystem? I don't know these answers nor your threshold for risk, but you should fully understand the ramifications of doing what you're suggesting.

> Is a small file guaranteed to be placed on 2 different nodes and not just on 2 different disks?

Yes. 2x, or +1n, will guarantee the loss of a single device without loss of data.

> After a node outage, will the files be created/copied to a different node to keep the 2x status?

Yes, this is what the FlexProtect Job Engine job does. Understand, though, that this is a cluster-wide job and while it's running (scanning your billions of files) you're in a "window of risk" where a second device failure means data loss (for your under-protected data).

> If both copies of the file are gone( double node outage )

To be clear here, you have data loss when 2 nodes OR 2 drives fail (assuming those drives are in the same failure domain [disk pool]). Drive failures are much more common than node failures. So don't think that this is just about node failures.

> will the system try to recover from the replicated data set?

No. At 2x with two failures, there are parts of filesystem that are lost. Fortunately your directories should all be intact since we protect directories one level higher than the protection policy (so directories will be 3x ). But any file data within that 2x directory tree that had data on both of the failed devices will be inaccessible. OneFS will not be able to repair these files, and will NOT take any action to restore these files either from tape backup or DR cluster. Both of those processes would require manual intervention by you.

Peter_Sero · Answer

>> After a node outage, will the files be created/copied to a different node to keep the 2x status?

>

> Yes, this is what the FlexProtect Job Engine job does. Understand, though, that this is a cluster-wide job and while it's running (scanning > your billions of files) you're in a "window of risk" where a second device failure means data loss (for your under-protected data).

Plus the time between the occurrence of the outage and the start of the FlexProtect job!

Which can be arbitrarily long afaik, because there is no timeout after which

OneFS decides to give up on an offline node and to start rebuilding the protection;

unlike the behavior for drive failures.

With node outages it is always required for an admin and/or support

to check the cluster and to either fix the node in question

(could be "just" a crash, a network issue, or a failure of a replaceable part)

or otherwise to confirm the final loss of that node, and then issue a stopfail action.

The earlier the node gets investigated and eventually stopfailed, the earlier

FlexProtect will kick in...

Cheers

-- Peter

kipcranford · Answer

> Which can be arbitrarily long afaik, because there is no timeout after which OneFS decides to give up on an offline node and to start rebuilding the protection; unlike the behavior for drive failures.

Exactly right and good point. If you somehow didn't notice the failure of an entire node for a week, your cluster is going to sit there degraded for a week waiting for someone to take action. This just increases the window of risk.

In OneFS, there is NO automatic repair job started when a node is lost. It requires manual intervention. In the past, OneFS did have a concept of "down for time", which was essentially a timeout after which FlexProtect would start in the presence of a down node. This didn't work well in practice given the transient nature of some node failures (plus maintenance, etc), and ended up causing more repair work to get done (initial repair, plus the "un-repair" when the node was returned to the group). Additionally, with newer nodes having swappable journals and with disk-tango a more commonplace function, actually fixing a node and returning it to service is more realistic nowadays.

ozb · Answer

Thank you Kip and Peter.

I have set up another meeting with my EMC reps to discuss....

From the info presented above, I will be speaking with them about a 3x local and 2x remote mirroring scheme. I believe that will take care of some of the concerns and still give me some needed capacity relief.

"To be clear here, you have data loss when 2 nodes OR 2 drives fail (assuming those drives are in the same failure domain [disk pool]). Drive failures are much more common than node failures. So don't think that this is just about node failures."

I'm a bit confused on the above statement. If I run 3x mirroring on specific exports in a 6 node FEC pool and I experience 3 drive failures, whats the likely-hood that a particular file is on those 3 specific disks, or does that even matter? How would things be different if I had a 20 node pool with the same setup?

Thanks,

Oz

kipcranford · Answer

> I'm a bit confused on the above statement. If I run 3x mirroring on specific exports in a 6 node FEC pool and I experience 3 drive

> failures, whats the likely-hood that a particular file is on those 3 specific disks, or does that even matter? How would things be

> different if I had a 20 node pool with the same setup?

The number of nodes really doesn't matter in this example. I was talking about disk pools, which are (at most) 6 drives/node "across" the nodes in the cluster (up to some number which I believe is 20). So in a chassis with 36 HDDs, you have six 6-drive disk pools. With 6 nodes in the cluster (using my 36 HDD example), each disk pool contains 36 HDDs (6 HDDs from each chassis, 6 nodes total).

When OneFS writes a file, it writes it to one of the disk pools -- the file won't span disk pools. So in the failure case when running 2x, where you have 3 copies of the file each on 3 disks on 3 different nodes, you'd have to be unlucky enough to lose each of those specific drives in that disk pool in order to lose data. If one of the three failures was in another disk pool, then technically your data is still protected since you have 2 failures in one disk pool and 1 failure in another disk pool.

I don't really have a "likelihood feeling" for you on risk, unfortunately. Personally I'd still be really sure I wanted to run 2x on my data, regardless of disk pools. There are very, very few cluster configurations where 2x/+1 is actually recommended from an MTTDL standpoint (where disk pools are taken into consideration), and none of those configurations use the denser drives. Just be fully aware of all the ramifications of your choice.

ozb · Answer

Thanks, that clears it up for me. Oz

Isilon

Small file (55KiB) Mirroring instead of FEC

Was this post helpful?