Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Error after Power outage

Hi,

Yesterday our PetaSAN system suffered two power outages in 30 mins. Once I came back to the controller node I got few errors:
- pg inconsistent (I tried repair them by pg repair + deep scrubbing, some are done but some are still inconsistent), and some of them are just coming up after awhile

PG_DAMAGED Possible data damage: 3 pgs inconsistent
pg 3.8 is active+clean+inconsistent, acting [11,0]
pg 3.4c is active+clean+inconsistent, acting [10,4]
pg 3.75 is active+clean+inconsistent, acting [9,6]

The inconsistent objects:

root@petasan-node01:~# rados --cluster DSP-T11 list-inconsistent-obj 3.75 --format=json-pretty
{
"epoch": 2750,
"inconsistents": [
{
"object": {
"name": "rbd_data.46f59327b23c6.0000000000008d70",
"nspace": "",
"locator": "",
"snap": "head",
"version": 0
},
"errors": [],
"union_shard_errors": [
"read_error"
],
"shards": [
{
"osd": 6,
"primary": false,
"errors": [
"read_error"
],
"size": 4194304
},
{
"osd": 9,
"primary": true,
"errors": [
"read_error"
],
"size": 4194304
}
]
}
]
}

- objects unfound (I marked down by "revert" and they are all gone)

- stuck request (5 stuck requests on my osd2)

REQUEST_STUCK 5 stuck requests are blocked > 4096 sec. Implicated osds 2
5 ops are blocked > 67108.9 sec
osd.2 has stuck requests > 67108.9 sec

- can not stop a iSCSI disk (in order to restart an iSCSI target, image-00011, I planned to stop and then start it, but the button text is freezing at "stopping"), tried unmap and map image again but no luck
log:

22/05/2019 21:31:53 ERROR LIO error deleting Target None, maybe the iqn is not exists.
22/05/2019 21:31:53 ERROR Cannot unmap image image-00011. error, no mapped images found.
22/05/2019 21:31:57 ERROR Could not find ips for image-00011

Could you please help us out on this case?

Best regards,

Tom

This link may help

http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/

You can also look into the ceph-objectstore-tool  which allows you to export and import a complete pg from/to an osd. You need to stop the osd service for the tool to work,

I would focus on fixing the pgs, put aside the stucj requests and iSCSI. if ceph is fixed, they will work.

good luck.

 

 

Hi,

Apparently, I've checked the link and tried repairing, deep-scrubbing and back-filling however the system keep saying "read error", "0 fixed"...

Therefore the PGs are still inconsistent and the iSCSI target won't fire up.

I just want to understand that why did a simple Power outage can hurt the whole system that much? Actually we're planning deploying PetaSAN as our main storage and I need to prepare for such a scenario.

Should x3 replica help? Or any suggestion for the system to survive after unexpected power-lost?

Best regards,

Tom

as suggested, try to use the ceph-objectstore-tool to export the entire pg from an osd with a good copy,  look at the log files for the acting osds  for a pg ( for pg 75 these are osd 9 and 6 ) in /var/log/ceph/osd and try to find which one has the good copy, if it is the primary (9) then a pg repair should do else the export/import should be done. It is always prudent to make a pg backup export of both osds anyway, even from the one believed bad and even if you will do a pg repair.  i assume the osds are up and you have no kernel  dmesg errors.

Using a replica of 3 will make this better, but it is better to look at the hardware used for potential power outage susceptibility.  Some consumer SSDs have no power loss protection (google to get a list), cache controller without battery backing, generally any disk write cache without  proper backup,  in some cases the hdparm command can help you to see if caching is on. The best advice is to test all possible failures including power loss in a pre-production cluster to see how  well your hardware will respond.

Hi,

in addition to admins answer:

We did a power cable pull test in our environment and discovered that our hp raid controller (raid 0) with battery was ignoring direct IOs respectively did not deactivate ondisks write cache. We could reproduce each time pulling the cable to loose data when writing was in progress. Some osds didnt even came up afterwards.

We replaced them with sas-hbas and checked with sdparm that ondisk write-cache was off. Using filestore(self created) we did not see a performance loss but power cable pull tests were successful then.

Another recommendation is to put your nodes in different rooms/buildings with different power supplies (with according crush rules to seperate the replicas!). So you will loose only one copy if power gets lost and the other will stay online. Never use x2 replication because it makes identifying the bad copy a real challenge. We use x4 replication with 2 copies in 2 server rooms and a management node in a third room with direct connection(no additional switch, but only because the third room is not capable of beeing an room for osd nodes).

Hope that helps,

Dennis

Hi,

Thank you admin and Dennis. Turned out our Intel 545s was not support Power-loss Data Protection, and I did not turn off the ondisk write-cache.

So for now I strongly believe the best practice for us is getting a clean installation and then using x3 replica with write-cache disable as Dennis suggestion. Unfortunately we did not have that many rooms to put servers so better believing in what we currently have. Another thing that I will use EC instead of using x2 replica for saving space.

And I better test the system several more times before actually putting anything inside.

Your answers saved my day. Thank you a lot, especially Admin when you pointed out the SSD Power-loss data protection feature.

Best regards,

Tom.

> and then using x3 replica

Be sure to set min_size = 2. That is another stumbling block for breaking a cluster...

Regards,

Dennis