Ceph Health Errors
erazmus
40 Posts
September 25, 2017, 4:50 pmQuote from erazmus on September 25, 2017, 4:50 pmHello,
My control panel is showing "Error" for Ceph Health. When I click on it, it says:
Any suggestions on what this means and how I debug further?
Hello,
My control panel is showing "Error" for Ceph Health. When I click on it, it says:
Any suggestions on what this means and how I debug further?
erazmus
40 Posts
September 25, 2017, 5:00 pmQuote from erazmus on September 25, 2017, 5:00 pmTo answer my own question, I found this:
http://ceph.com/geen-categorie/ceph-manually-repair-object/
I did the 'ceph pg repair' command and it fixed it. I'd still like to understand what is happening, and possibly suggest a GUI method of debugging/repair.
To answer my own question, I found this:
http://ceph.com/geen-categorie/ceph-manually-repair-object/
I did the 'ceph pg repair' command and it fixed it. I'd still like to understand what is happening, and possibly suggest a GUI method of debugging/repair.
admin
2,930 Posts
September 26, 2017, 1:37 pmQuote from admin on September 26, 2017, 1:37 pmAlthough you were able to repair the error, you should investigate this more. The inconsistent error means there was a difference when comparing different replicas of the same object. PetaSAN uses block storage, so in our case the object is a 4M stripe from you lun. The difference could be a different data checksum was found, or different metadata such as size or a zero size among your replicas.
This could be due to several things:
A disk going bad causing 'bit rot'. You can view the disks involved with the PG you repaired via
ceph pg dump --cluster CLUSTER_NAME | grep PG_NUMBER
It will display the osds responsible for each replica in brakets such as [2,5,8]
You may want to install and run smart mon (apt-get install smartmontools) and monitor the health of these disks
On the node running the primary osd for this pg (the first number in the brakets) view the log /var/log/ceph/CLUSTER_NAME.log
look for scrub error to try to find why it failed: data_digest != , size !=, size 0
example for digest error:
2017-09-26 02:33:00.468390 osd.2 10.0.4.11:6802/1610 10 : cluster [INF] 1.50 deep-scrub starts
2017-09-26 02:33:00.472609 osd.2 10.0.4.11:6802/1610 11 : cluster [ERR] 1.50 shard 5: soid 1:0a0ff85c:::testobj:head data_digest 0x9c455797 != data_digest 0x9b87effd from shard 2, data_digest 0x9c455797 != data_digest 0x9b87effd from auth oi 1:0a0ff85c:::testobj:head(64'1 client.64352.0:1 dirty|data_digest|omap_digest s 15 uv 1 dd 9b87effd od ffffffff alloc_hint [0 0]), size 5 != size 15 from auth oi 1:0a0ff85c:::testobj:head(64'1 client.64352.0:1 dirty|data_digest|omap_digest s 15 uv 1 dd 9b87effd od ffffffff alloc_hint [0 0]), size 5 != size 15 from shard 2
Beside bit rot is maybe you are using controller with a write-back cache without battery backup and the node failed while there was io inflight, do not use write-back unless backed by battery power. Ceph uses direct/sync write flags to make sure that a successful write did actually get stored on disk, some low end consumer SSDs may lie about saving to media while still in cache and do not have power backup in case the node goes down.
One high recommendation is to use replica size of 3 if you do not already. They help in cases when you do get inconsistencies. If your current level is 2 you can switch to 3, it will cause a lot of background traffic to create this third replica. Although it runs in lower priority than client io it will still be felt, so maybe choose to run it on a weekend.
Lastly in some ceph official documentation it is claimed that clock skew may report inconsistent PGs, it is probably the most happier scenario but i would not count on it.
Although you were able to repair the error, you should investigate this more. The inconsistent error means there was a difference when comparing different replicas of the same object. PetaSAN uses block storage, so in our case the object is a 4M stripe from you lun. The difference could be a different data checksum was found, or different metadata such as size or a zero size among your replicas.
This could be due to several things:
A disk going bad causing 'bit rot'. You can view the disks involved with the PG you repaired via
ceph pg dump --cluster CLUSTER_NAME | grep PG_NUMBER
It will display the osds responsible for each replica in brakets such as [2,5,8]
You may want to install and run smart mon (apt-get install smartmontools) and monitor the health of these disks
On the node running the primary osd for this pg (the first number in the brakets) view the log /var/log/ceph/CLUSTER_NAME.log
look for scrub error to try to find why it failed: data_digest != , size !=, size 0
example for digest error:
2017-09-26 02:33:00.468390 osd.2 10.0.4.11:6802/1610 10 : cluster [INF] 1.50 deep-scrub starts
2017-09-26 02:33:00.472609 osd.2 10.0.4.11:6802/1610 11 : cluster [ERR] 1.50 shard 5: soid 1:0a0ff85c:::testobj:head data_digest 0x9c455797 != data_digest 0x9b87effd from shard 2, data_digest 0x9c455797 != data_digest 0x9b87effd from auth oi 1:0a0ff85c:::testobj:head(64'1 client.64352.0:1 dirty|data_digest|omap_digest s 15 uv 1 dd 9b87effd od ffffffff alloc_hint [0 0]), size 5 != size 15 from auth oi 1:0a0ff85c:::testobj:head(64'1 client.64352.0:1 dirty|data_digest|omap_digest s 15 uv 1 dd 9b87effd od ffffffff alloc_hint [0 0]), size 5 != size 15 from shard 2
Beside bit rot is maybe you are using controller with a write-back cache without battery backup and the node failed while there was io inflight, do not use write-back unless backed by battery power. Ceph uses direct/sync write flags to make sure that a successful write did actually get stored on disk, some low end consumer SSDs may lie about saving to media while still in cache and do not have power backup in case the node goes down.
One high recommendation is to use replica size of 3 if you do not already. They help in cases when you do get inconsistencies. If your current level is 2 you can switch to 3, it will cause a lot of background traffic to create this third replica. Although it runs in lower priority than client io it will still be felt, so maybe choose to run it on a weekend.
Lastly in some ceph official documentation it is claimed that clock skew may report inconsistent PGs, it is probably the most happier scenario but i would not count on it.
Last edited on September 26, 2017, 3:44 pm by admin · #3
Ceph Health Errors
erazmus
40 Posts
Quote from erazmus on September 25, 2017, 4:50 pmHello,
My control panel is showing "Error" for Ceph Health. When I click on it, it says:
Any suggestions on what this means and how I debug further?
Hello,
My control panel is showing "Error" for Ceph Health. When I click on it, it says:
Any suggestions on what this means and how I debug further?
erazmus
40 Posts
Quote from erazmus on September 25, 2017, 5:00 pmTo answer my own question, I found this:
http://ceph.com/geen-categorie/ceph-manually-repair-object/
I did the 'ceph pg repair' command and it fixed it. I'd still like to understand what is happening, and possibly suggest a GUI method of debugging/repair.
To answer my own question, I found this:
http://ceph.com/geen-categorie/ceph-manually-repair-object/
I did the 'ceph pg repair' command and it fixed it. I'd still like to understand what is happening, and possibly suggest a GUI method of debugging/repair.
admin
2,930 Posts
Quote from admin on September 26, 2017, 1:37 pmAlthough you were able to repair the error, you should investigate this more. The inconsistent error means there was a difference when comparing different replicas of the same object. PetaSAN uses block storage, so in our case the object is a 4M stripe from you lun. The difference could be a different data checksum was found, or different metadata such as size or a zero size among your replicas.
This could be due to several things:
A disk going bad causing 'bit rot'. You can view the disks involved with the PG you repaired via
ceph pg dump --cluster CLUSTER_NAME | grep PG_NUMBER
It will display the osds responsible for each replica in brakets such as [2,5,8]
You may want to install and run smart mon (apt-get install smartmontools) and monitor the health of these disks
On the node running the primary osd for this pg (the first number in the brakets) view the log /var/log/ceph/CLUSTER_NAME.loglook for scrub error to try to find why it failed: data_digest != , size !=, size 0
example for digest error:
2017-09-26 02:33:00.468390 osd.2 10.0.4.11:6802/1610 10 : cluster [INF] 1.50 deep-scrub starts
2017-09-26 02:33:00.472609 osd.2 10.0.4.11:6802/1610 11 : cluster [ERR] 1.50 shard 5: soid 1:0a0ff85c:::testobj:head data_digest 0x9c455797 != data_digest 0x9b87effd from shard 2, data_digest 0x9c455797 != data_digest 0x9b87effd from auth oi 1:0a0ff85c:::testobj:head(64'1 client.64352.0:1 dirty|data_digest|omap_digest s 15 uv 1 dd 9b87effd od ffffffff alloc_hint [0 0]), size 5 != size 15 from auth oi 1:0a0ff85c:::testobj:head(64'1 client.64352.0:1 dirty|data_digest|omap_digest s 15 uv 1 dd 9b87effd od ffffffff alloc_hint [0 0]), size 5 != size 15 from shard 2Beside bit rot is maybe you are using controller with a write-back cache without battery backup and the node failed while there was io inflight, do not use write-back unless backed by battery power. Ceph uses direct/sync write flags to make sure that a successful write did actually get stored on disk, some low end consumer SSDs may lie about saving to media while still in cache and do not have power backup in case the node goes down.
One high recommendation is to use replica size of 3 if you do not already. They help in cases when you do get inconsistencies. If your current level is 2 you can switch to 3, it will cause a lot of background traffic to create this third replica. Although it runs in lower priority than client io it will still be felt, so maybe choose to run it on a weekend.
Lastly in some ceph official documentation it is claimed that clock skew may report inconsistent PGs, it is probably the most happier scenario but i would not count on it.
Although you were able to repair the error, you should investigate this more. The inconsistent error means there was a difference when comparing different replicas of the same object. PetaSAN uses block storage, so in our case the object is a 4M stripe from you lun. The difference could be a different data checksum was found, or different metadata such as size or a zero size among your replicas.
This could be due to several things:
A disk going bad causing 'bit rot'. You can view the disks involved with the PG you repaired via
ceph pg dump --cluster CLUSTER_NAME | grep PG_NUMBER
It will display the osds responsible for each replica in brakets such as [2,5,8]
You may want to install and run smart mon (apt-get install smartmontools) and monitor the health of these disks
On the node running the primary osd for this pg (the first number in the brakets) view the log /var/log/ceph/CLUSTER_NAME.log
look for scrub error to try to find why it failed: data_digest != , size !=, size 0
example for digest error:
2017-09-26 02:33:00.468390 osd.2 10.0.4.11:6802/1610 10 : cluster [INF] 1.50 deep-scrub starts
2017-09-26 02:33:00.472609 osd.2 10.0.4.11:6802/1610 11 : cluster [ERR] 1.50 shard 5: soid 1:0a0ff85c:::testobj:head data_digest 0x9c455797 != data_digest 0x9b87effd from shard 2, data_digest 0x9c455797 != data_digest 0x9b87effd from auth oi 1:0a0ff85c:::testobj:head(64'1 client.64352.0:1 dirty|data_digest|omap_digest s 15 uv 1 dd 9b87effd od ffffffff alloc_hint [0 0]), size 5 != size 15 from auth oi 1:0a0ff85c:::testobj:head(64'1 client.64352.0:1 dirty|data_digest|omap_digest s 15 uv 1 dd 9b87effd od ffffffff alloc_hint [0 0]), size 5 != size 15 from shard 2
Beside bit rot is maybe you are using controller with a write-back cache without battery backup and the node failed while there was io inflight, do not use write-back unless backed by battery power. Ceph uses direct/sync write flags to make sure that a successful write did actually get stored on disk, some low end consumer SSDs may lie about saving to media while still in cache and do not have power backup in case the node goes down.
One high recommendation is to use replica size of 3 if you do not already. They help in cases when you do get inconsistencies. If your current level is 2 you can switch to 3, it will cause a lot of background traffic to create this third replica. Although it runs in lower priority than client io it will still be felt, so maybe choose to run it on a weekend.
Lastly in some ceph official documentation it is claimed that clock skew may report inconsistent PGs, it is probably the most happier scenario but i would not count on it.