*REALLY* Failed 6010 array

Question

hello all, we are trying to fire up a used 6010 array and it seems to be stuck in a boot loop - needless to say it's not on support - we have tried pulling out enough drives to fail, each individual power supply and with a single controller (more on this in a moment) it gets stuck in a boot loop. if you time it right you can get to a login prompt and login, I've even managed to START a reset before it crashes and reboots, I've tried the password reset via boot prom - I'm wondering if there are other options to completely wipe the configuration / reinstall the OS from the boot prom side of things - thanks

here is is what the log spits back on the primary controller:

#########################################################
# #
# Dell (tm), Inc. Storage Array #
# Copyright 2001-2009 #
# Part=70-0300 Rev=A05 SN=XXXXXXXXXXXXXX ECO=C00 #
# #
#########################################################
Bootloader Version 2.3.3 (SWINT Rev:1)
Compiled on Wed Mar 31 16:37:31 EDT 2010
(type h for help)
Enter Ctrl-P for boot prompt

Executing bootcmd0 [dload sd primary/eqlstor.gz]
0X80231000/10400640 0X81a6bb20/35510764 entrypt 80231000

Executing bootcmd1 [run]
cpu_online_map=ffff, userapp_cpu_map ffff
psb_os_active_mask=0, psb_os_mask=0
boot1_info: userapp_cpu_map=ffff, psb_os_cpu_map=0
cpu_online_map = 0xffff
Jumping to the application... 0x80231000
------------------------------------------------------------
Preparing ffff bitmask of cpus to run
No network device to cleanup
count = 16, total = 16
All slave cpus (16) ack'ed userapp init
count = 4, total = 4
All slave cpus (4) ack'ed message ring init

Dell, Inc. Storage Array

Copyright 2001-2010 Dell, Inc.
SP:12.51:mips_pss_init.c:296:INFO:28.2.107:Control module in slot 1 with serial number XXXXXXXXXXXX
is designated as active.
SP:1434998416.01:ppool_nvram.c:330:ERROR:15.4.1:NVRAM contains valid data. This is a PANIC RECOV
ERY due to a panic on a NetBSD processor.
SP:1434998416.04:ppool_nvram.c:185:ERROR:15.4.5:Saved CPU registers, CPU 12
at 0000000000000000 v0 0000000004010000 v1 ffffffffbef08010
a0 0000000000000104 a1 0000000000000000 a2 0000000000000104 a3 ffffffffd25f7a48
t0 0000000000000001 t1 ffffffffd25f7a48 t2 0000000027bd0000 t3 000000000000000c
t4 000000000000002b t5 000000000000003f t6 0000000003bf0000 t7 0000000003ff0000
s0 ffffffff808ed850 s1 ffffffffd25f7a48 s2 ffffffff808ed850 s3 0000000000000001
s4 ffffffffc003b7a0 s5 ffffffffc003ba58 s6 0000000000000001 s7 0000000000000032
t8 ffffffffbfffffff t9 ffffffffd6d70000 k0 ffffffff804d0bfc k1 0000000000000000
gp ffffffff8003e000 sp ffffffffd25f78c0 s8 ffffffffc0169380 ra ffffffff804c1a1c
SP:1434998416.07:ppool_nvram.c:189:ERROR:15.4.6:Saved CP0 registers, CPU 12
sr 0048c005 badva c0133270 epc 806b3a3c errorepc 804c19f0
cause 00000000 errctl 00000000 cacheeri 00000000 cacheerd 00000000
buserr 0000000000000000 cacheerrdpa 0000000000000000
SP:1434998416.09:ppool_nvram.c:195:ERROR:15.4.7:Saved function call stack, CPU 12
804c19f0 804c1a1c 804c1be0 8065e964 80660934 8066ac2c 8061fe3c 80616ca4
804d15ac 00000000 00000000 00000000 00000000 00000000 00000000 00000000
SP:1434998416.12:eqllog_mbuf_Q.c:996:ERROR:2.4.0:Panic recovery from CPU0 with reason 'NEED READ
CAPACITY 16!!'.
MFS set up
Building databases...
SP:1434998420.74:emm.c:1239:INFO:28.2.6:Enclosure serial number: XXXXXXXXXXXXXX.
Mon Jun 22 14:40:26 EDT 2015
ipctunnel: 127.0.0.1: Undefined error: 0
Jun 22 14:40:26 init: kernel security level changed from 0 to 1

PS Series Storage Arrays
Unauthorized Access Prohibited

login: [PFAIL]

====

when trying just the second controller it simply outputs this:

[ENTRY][BIST][EQL][WDOG][TWDOG][WFP]
Abort the Boot, Board Pull or Pfail in progress

===

this is as far as I get trying a cli reset being super quick:

CLI> reset

Warning: This command resets an array to the factory

defaults (original condition). The result is the

elimination of all group and volume configuration

information and any volume data residing on the array.

Before resetting an array that is a member of a group,

it is recommended that you delete the member from the

group.

Reset this array to factory defaults? [n/DeleteAllMyDataNow] DeleteAllMyDataNow

Resetting system, this will take a few minutes.

Deleting backup password files

Deleting agent.cnf

Zeroing drives, [PFAIL]

[ENTRY][BIST][EQL][TWDOG][WFP][210][MESR][SCAN][BOOT2][PROC][BOARD][SMP][IOBUS][FLASH][HT]

*****************************************************

POST WARNING - 06/22/2015 17:25:23 Temp 047C

FPGA/XRCP/05130004

Power Supply 0 failed. Input: bad DC output: bad

*****************************************************

[ETH0][ETH1][ETH2]

BorisT15 · Answer

ha please disregard I did a google image search and see it's on the bottom, the one place I didn't check, thanks :)

BorisT15 · Answer

thanks so much I will attempt this tomorrow and update - I should mention I was mistaken and it's actually a 6000 with 2 x type 10 controllers, when I was working on this today I remembered something about the CF, silly question where on the type 10 controller is the flash card? I recall briefly looking today and it didn't seem obvious. thanks again.

BorisT15 · Answer

i will attempt this tomorrow, i should append that its actually a 6000 with type 10 controllers, pulled it put looking for the flash card today, any pointers as to where its located it didnt seem obvious to me for some reason. Thank you

BorisT15 · Answer

Some progress - (either) single controller and dual controllers both boot and don't crash with no drives - I did a reset command with no drives, which it took but obviously isn't going to accomplish much - any drives in any bay causes the same crash loop I pasted , it crashes within 15 seconds (right after drive inserted message) - 1426:79:MgmtExec:23-Jun-2015 12:58:15.160080:emdEcd.cc:107:INFO:7.2.0:Disk 0 has been inserted. - or if I boot with even 1 drive in any bay.

any thoughts on the best way to proceed? I suppose all new drives (the drives are themselves new but obviously have a bad eq config written to them? maybe erase one on a pc?) thanks

BorisT15 · Answer

also.. running diags with no drives, will see how that goes. thanks

BorisT15 · Answer

ok so of course the usb adapter we have on hand is SATA only, so will try a sas <-> sata thing on a computer tomorrow will see how it goes - this is a used unit with all brand new out of the box drives... someone else installed them I suspect it was a combo of an old raid config on the controllers + dirty shutdown after drive install == mulch - but yes thankfully no data at all. will update thursday probably, thanks

BorisT15 · Answer

strange doesn't show my reply, anyway, any drive insertion in any slot crashes the system and it goes into the same reboot loop I originally posted, this afternoon I'm going to erase a drive via a USB<->sata on a computer and see what happens

BorisT15 · Answer

here is the update - the SATA to SAS adapter I got was actually useless as it required a controller that could talk SAS (which my $20 usb to sata adapter certainly wouldn't) we ordered a single brand new drive of the same model as the others, s soon as we popped in just that drive, the unit crashed as well. so rather stumped, I did find the microSD card that the controller uses, but not sure at this point what would be a erase plan? thanks

BorisT15 · Answer

I tried a few other things just for curiosity.. putting a clean newest OS on the microSD card.. is there a proper way to actually wipe the nvram on these controllers? this is as close as I got with that experiment:

which makes sense I suppose.

thanks

he nvram version currently running on the array does not match

the version saved in nvram.

Current version: 16

Saved version: 12

This situation can occur if the array experienced an abnormal shutdown

during a firmware update. Contact your customer support provider for

instructions on how to proceed.

The array has been halted.

You can safely power off the array or type any key to restart the array.

BorisT15 · Answer

did a little more digging in bash mode - seems the controllers are running 5.0.2 if that matters, tried FTPing FROm the array to download a 5.1.2 to update - no dice, never seems to be enough disk space anywhere on the SD card for some reason. anyway I'm stumped - thanks for your help

BorisT15 · Answer

I got that error message by overwriting I think the current folder with the contents of the kit, copying back 5.0.2 booted fine again (without drives) I guess there is no way to just reset the nvram and start this clean? I guess not.

BorisT15 · Answer

absolutely, this was a system with brand new drives and no data I was trying to get online - just chasing a lot of ghosts - I also suspect these drives may be too large to be supported by 5.0.2? reading release notes sounds like 5.0 introduced support for sas drives larger than 2tb but I know the early 5 code had some issues. thanks very much for your help on this.

BorisT15 · Answer

I tried a (single) dell 3tb SAS I had, and some OEM 4tb SAS (which work fine in 2 other 6010 units I have) but those were all at 6.x , I am really leaning towards a backplane failure - I can't definitively find anything saying 3/4 tb drives wouldn't work in 5.0.2 - and certainly not cause the OS to crash, will replace and advise, thanks again.

BorisT15 · Answer

I have two units that work flawlessly, but both started with 6.x so i'm 50/50 a 5.x issues or backplane, will get a replacement backplane this week it may have just been that.

BorisT15 · Answer

Thanks for the help, here is the final update: The issue in the end was that it was running 5.0.2 - I put in 8 SATA drives, configured a group, didn't even need to create a raid - did the step by step upgrade to latest 7.x, installed my new drives, all show up fine, had to re-create the group obviously,  built the array and verifying the raid now.

EqualLogic

REALLY Failed 6010 array

Was this post helpful?