Categories
Hardware How-To Software Technical

Dirty Cache – Dell, Equallogic Storage Array

Dirty Cache – Dell, Equallogic Storage Array

I hope you never encounter such an issue, but if you found yourself searching for a method to get online you’re in luck.

Symptoms:

  • Equallogic Storage Array  no longer responds to pings
  • iSCSI attached volumes have all gone offline
  • Unable to access the Equallogic Storage Array using SAN HQ
  • Unable to access the Equallogic Storage Array via its web interface

By this time you may have been alerted to the fact and are aware that your Equallogic Storage Array is offline.

If by now you have consoled in with the serial cable you will see the following message type: Logger daemon is losing messages because offline disks are generating more events than the daemon can handle.

Actions to Take:

  • Connect Serial Interface Cable
  • Have your grpadmin password ready
  • Have putty or terminal emulator of choice ready for use

Now that you are ready  connect to the system via the serial interface on one of the controllers.

Log into the san using the  grpadmin account –

You will see the following message:

Login to account grpadmin succeeded, using local authentication. User privilege is group-admin.

It appears that the storage array has not been configured.
Would you like to configure the array now ? (y/n) [n] | choose n

The following message will be displayed:

Please run setup before executing management commands
It appears that the storage array has not been configured. Please run setup before executing management commands

We are not doing this as this will destroy your data

 

Now that we have logged into the Equallogic Storage Array we need to drop into the BASH command shell.

To do this we type:  su ex sh

You will see the following message:

You are running a support command, which is normally restricted to PS Series Technical Support personnel. Do not use a support command without instruction from Technical Support.

Run the following command: raidtool
In my cases the following message displayed:

Driver Status: *Admin Intervention Requested*

Next we drop into the ecli by typing: ecli
Now in the ecli we want to type:  hs – the following message may be displayed to you:

Health Status (0x0000000800000000): RED Conditions:
RAID_LOST_CACHE_CONDITION

* what we have just confirmed is the raid cache is corrupted

We want to  quit to exit to the CLI>  prompt

And issue the following command:  clearlostdata

This will display the following:

The clearlostdata command will gather information about the
state of this array for support and troubleshooting purposes.
No user information will be included in this data.

E-mail notification is not available, so you must retrieve the results
by using the “text capture” feature of your terminal emulator
or Telnet program.

You will be given information to help you do this at the end of this procedure.

Finally, please remember to include your Dell Technical Support case or incident number in the subject line of any e-mail that you send to Dell Support. This will help ensure that the message is routed correctly.

Do you wish to proceed with data collection? (y/n) [y]: select y

Next you will see:

Starting data collection on …

Section 1 of 1: ..
Finished in 2 seconds

You also have the option to capture the output by using the “text capture” feature of your Telnet or terminal emulator program.
Do you wish to do this (y/n) [n]: y

The configuration data will now be sent to the console. Please enable text capture in your terminal emulator or Telnet program, and submit the resulting file with your problem report.

Please press the Enter key when you are ready to proceed.

When completed your system will come online once again.

I can’t stress this more.  Get your data off that system now.

In my case we replaced both controllers and the issue still happens. Be on the safe side and evacuate your data “NOW”

Other Tech Info:

Model:  70-0011
Family:  PS100
Chassis:  1403
Disks:  SATA HDD
Firmware:  V5.2.4 (R255063)

 

15 replies on “Dirty Cache – Dell, Equallogic Storage Array”

I have experience with these arrays. Applying a firmware update very nearly resulted in a primary AND secondary array failure due to the firmware marking drives as bad with no recovery options. Needless to say we moved to a different vendor immediately.

Glad to hear you were able to recover from this.

extremesanity’s comment isn’t really related to what’s going on with your array (though it’s not completely off-topic, as multiple drive failures can sometimes result in cache errors too). Also, his past experience with EQL arrays isn’t at all typical, especially with the more recent firmware generations.

If the problem keeps returning after it has been cleared and the controllers have been replaced, you’ve likely still got a hardware problem hiding somewhere. Pinpointing it would likely require digging through logs including dmesg, and a good deal of expertise…

update, after replacing both controllers, power supplies, (waste of money – all HDD’s) it failed again and we replaced the back-plane with another and it has not failed

Hi,

I’m working for Dell, and I am stressing you to please do not follow this guide.

If for any reason you array becomes unresponsive, please call technical support, even out of warranty, we’ll do our best to help you.

When you see “You are running a support command, which is normally restricted to PS Series Technical Support personnel. Do not use a support command without instruction from Technical Support.”, we mean it.

Also please be aware that taking such actions before calling us can void your warranty.

Thank you,

The issue is this — “void your warranty.” When you do not have a warranty or active support subscription your support teams do not offer assistance. In my case the SAN was out of warranty and support. In fact it was listed as end of life and no one would provide proper support. I offer my readers these steps as I needed to invoke them several times to get to my data so I could recover data and offload to another system.

The only reason I have approved your comment on this post is to be 100% transparent. If your statements are true. I still have the SAN that is offline, out of warranty. Will you support getting it operational more so will you send me a replacement if you can’t. I doubt that your answer can or will ever be “YES”

Thank you and if you can please feel free to respond with contact info; If you are able to fix this issue and replace the SAN I will remove my post.

– Jermal

Hi,

To be a bit clearer : the lost cache condition can happen for multiple reasons. Clearing it is not the first step we take because sometimes it can be recovered.

Then, the steps you give in your post are about getting the array online and removing data from it. If a customer calls us in a situation where data is not available, even with array out of warranty, we’ll deal with it and, if possible depending on the technical situation, bring his/her array online so he can move data away from it.

Of course we won’t offer warranty replacement of hardware if hardware is faulty after end of warranty, but we can quote it if necessary, or we can quote warranty extensions as well.

The main point is : if these steps were to be taken, they have to be done and overviewed by technical support.

I don’t know if your situation is hardware or firmware related, but if you want me to take a look, feel free to send me an email and I will try to help you as much as possible.

Think you for your response on this. I understand your main point.
I thank you for your offer to help support me but I have been down said road and well didn’t get far.

Best Regards

Hello,

3 year later, i agree that i already use this issue (which work Very Very Fine).

@Thibaut, for your information (if you see my post), i had a pb on our San, warranty OK, i call Technical Support, the engineer work on the case, but 24h alter tell us that we lost data !!! For fun i execute the operating mode that jermsmit describe here, and all data is recovered (access & all data are fine, no lost data).

So, @Thibaut, please, when you see some post on WIKI (for free), mostly when it work, please please, leave us alone, do not scare the people …

Sorry for my english (i’m French)

@jermsmit, Thank you very much.

—French below

Well, the goal is not to scare people, but to make them realize that this approach is the wrong one. If you have an issue, you really SHOULD CONTACT SUPPORT. if you want to do stuff without knowing the impact on your array, then go ahead.

Again, we do NOT charge customers for EQL arrays out of warranty, and I help customers with out of warranty arrays on a weekly basis.

@Reda, I am very interested to have the case # for what you’re describing, because normally, after we’re tried to recover the cache, if we can’t do it, we discard it the way described in the article. It’s just that we have more steps to try and save it. So if you could drop me an email, I’d be happy to review to try and explain what happened here because we didn’t do our jobs properly.

So it’s not a matter of free wiki article. One of the customer I worked on recently had run that command and lost all his data even after we spent a weekend trying to save his data.

I’ve been working on this product for 5 years and a half now, and I still stand by my post from 3 years ago : DO NOT DO THAT FROM SCRATCH, CALL TECHNICAL SUPPORT. PLEASE.

I have a equallogic ps100 that says 3 drives failed

I contacted dell and this is their response – basically no support.
I have a dell ps100 raid.
I have backups of all the drives
terminal and netwrok access
I am looking to pay someone for some consulting help.
Like how to force the config online
skype me
wayne_horner

====================
Hello Dear Customer,

I hope your day is going well.

Unfortunately, your account is unauthorized because our records indicate that the unit PS100 has reached the end of their serviceable life, and are no longer eligible for new service contracts. In an effort to align with the EqualLogic terms and conditions for Firmware Downloads, we are required to remove customers download access once their service programs expire.

My apologies for the inconvenience. Dell has a strict policy regarding EQL Support Site Access and firmware download privileges on the EqualLogic support site. In order for an email account to be eligible for Support Site Access (firmware, documentation), the email account must meet the following criteria:

– The asset tag, it must be owned by the company requesting download access.

– The asset tag, it must have valid, active service contracts from Dell.

Our End User License Agreement (EULA) stipulates that access to firmware and software downloads on the support site requires at least one EqualLogic array be currently under warranty. You can find it here.

I apologize for the inconvenience and trust you understand our reasoning.

Please contact me if you require any further assistance

Regards,

XXXXXXXXXXXXXXXXXX

Global Tag Team

Dell EMC | Services

Hi,

The PS100 was retired before I joined the EQL team back in 2012. No wonder this is the answer you got 😀

What type of consulting are you looking for and what’s the current situation ?

Also you contacted the customer service and not technical support, according to the person’s name you disclosed in your transcript. (Jermal I don’t know if you could remove the name in the email signature or not, but I think it would be a good idea considering that person probably don’t want his name thrown out on the internet :))

Could you reply to that comment to tell me what’s your situation ?

Thanks,

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.