Iscsi Disk freeze
moh
16 Posts
December 16, 2020, 11:13 amQuote from moh on December 16, 2020, 11:13 amWe have a cluster of 3 nodes, the datas are no is redundancy and 2 iscsi disks. The two disks are accessible by ping and telnet on the port but on our iscsi initiator the status on one disk is reconnecting. We trying to stop the disk he steel stopping. We have to restart every time, and it happens every day.
How can I do to resolved it ??
We have a cluster of 3 nodes, the datas are no is redundancy and 2 iscsi disks. The two disks are accessible by ping and telnet on the port but on our iscsi initiator the status on one disk is reconnecting. We trying to stop the disk he steel stopping. We have to restart every time, and it happens every day.
How can I do to resolved it ??
admin
2,930 Posts
December 16, 2020, 3:01 pmQuote from admin on December 16, 2020, 3:01 pmCan you provide more info. did you rule out hardware and network ? what is the configuration of hardware ? how many osds ? is the cluster status Ok or is it in error ? the charts for PG Status : does it show are pgs active at time of failure ? do the charts for mons and osds show any down at time of failure ? does it happen everyday at some specific load condition like backup jobs ? the charts for disk, cpu, mem % utilization are they ok at time of failure or saturated ? do you see any errors in logs ( PetaSAN/syslog/ceph) ?
what do you mean by the datas are no is redundancy
Can you provide more info. did you rule out hardware and network ? what is the configuration of hardware ? how many osds ? is the cluster status Ok or is it in error ? the charts for PG Status : does it show are pgs active at time of failure ? do the charts for mons and osds show any down at time of failure ? does it happen everyday at some specific load condition like backup jobs ? the charts for disk, cpu, mem % utilization are they ok at time of failure or saturated ? do you see any errors in logs ( PetaSAN/syslog/ceph) ?
what do you mean by the datas are no is redundancy
Last edited on December 16, 2020, 3:02 pm by admin · #2
moh
16 Posts
December 17, 2020, 10:01 amQuote from moh on December 17, 2020, 10:01 amno, we haven't a rule on hardware and network. we have 48 osds, ceph status.
ceph health show an error "1/2129038 objects unfound (0.000%)
Possible data damage: 1 pg recovery_unfound
Degraded data redundancy: 1/2129088 objects degraded (0.000%), 1 pg degraded
1 pgs not deep-scrubbed in time
1 pgs not scrubbed in time
1 slow ops, oldest one blocked for 62569 sec, mon.NODEO2 has slow ops
cluster:
id: bf167be6-46ed-4c6d-bb3e-72e466994805
health: HEALTH_ERR
1/2129038 objects unfound (0.000%)
Possible data damage: 1 pg recovery_unfound
Degraded data redundancy: 1/2129088 objects degraded (0.000%), 1 pg degraded
1 pgs not deep-scrubbed in time
1 pgs not scrubbed in time
1 slow ops, oldest one blocked for 63836 sec, mon.NODEO2 has slow ops
services:
mon: 3 daemons, quorum NODEO3,NODEO1,NODEO2 (age 17h)
mgr: NODEO2(active, since 17h), standbys: NODEO3, NODEO1
mds: cephfs:1 {0=NODEO3=up:active} 2 up:standby
osd: 48 osds: 48 up (since 17h), 48 in (since 17h)
data:
pools: 4 pools, 1216 pgs
objects: 2.13M objects, 8.0 TiB
usage: 8.0 TiB used, 78 TiB / 86 TiB avail
pgs: 1/2129088 objects degraded (0.000%)
1/2129038 objects unfound (0.000%)
1215 active+clean
1 active+recovery_unfound+degraded
io:
client: 6.4 KiB/s rd, 7 op/s rd, 0 op/s wr
no, we haven't a rule on hardware and network. we have 48 osds, ceph status.
ceph health show an error "1/2129038 objects unfound (0.000%)
Possible data damage: 1 pg recovery_unfound
Degraded data redundancy: 1/2129088 objects degraded (0.000%), 1 pg degraded
1 pgs not deep-scrubbed in time
1 pgs not scrubbed in time
1 slow ops, oldest one blocked for 62569 sec, mon.NODEO2 has slow ops
cluster:
id: bf167be6-46ed-4c6d-bb3e-72e466994805
health: HEALTH_ERR
1/2129038 objects unfound (0.000%)
Possible data damage: 1 pg recovery_unfound
Degraded data redundancy: 1/2129088 objects degraded (0.000%), 1 pg degraded
1 pgs not deep-scrubbed in time
1 pgs not scrubbed in time
1 slow ops, oldest one blocked for 63836 sec, mon.NODEO2 has slow ops
services:
mon: 3 daemons, quorum NODEO3,NODEO1,NODEO2 (age 17h)
mgr: NODEO2(active, since 17h), standbys: NODEO3, NODEO1
mds: cephfs:1 {0=NODEO3=up:active} 2 up:standby
osd: 48 osds: 48 up (since 17h), 48 in (since 17h)
data:
pools: 4 pools, 1216 pgs
objects: 2.13M objects, 8.0 TiB
usage: 8.0 TiB used, 78 TiB / 86 TiB avail
pgs: 1/2129088 objects degraded (0.000%)
1/2129038 objects unfound (0.000%)
1215 active+clean
1 active+recovery_unfound+degraded
io:
client: 6.4 KiB/s rd, 7 op/s rd, 0 op/s wr
Last edited on December 17, 2020, 10:02 am by moh · #3
admin
2,930 Posts
December 17, 2020, 11:03 amQuote from admin on December 17, 2020, 11:03 am1) what do you mean by "the datas are no is redundancy" ?
2) the cluster state is in error HEALTH_ERR, there is 1 pg with recovery_unfound, this is preventing the iSCSI disks from working. Can you trace from the PG Status charts when this happened and do you recall anything occurred at that time ?
3) find the pg with this error
ceph health detail
then show the output of:
ceph pg PG list_unfound
ceph pg PG query
1) what do you mean by "the datas are no is redundancy" ?
2) the cluster state is in error HEALTH_ERR, there is 1 pg with recovery_unfound, this is preventing the iSCSI disks from working. Can you trace from the PG Status charts when this happened and do you recall anything occurred at that time ?
3) find the pg with this error
ceph health detail
then show the output of:
ceph pg PG list_unfound
ceph pg PG query
Last edited on December 17, 2020, 11:03 am by admin · #4
Iscsi Disk freeze
moh
16 Posts
Quote from moh on December 16, 2020, 11:13 amWe have a cluster of 3 nodes, the datas are no is redundancy and 2 iscsi disks. The two disks are accessible by ping and telnet on the port but on our iscsi initiator the status on one disk is reconnecting. We trying to stop the disk he steel stopping. We have to restart every time, and it happens every day.
How can I do to resolved it ??
We have a cluster of 3 nodes, the datas are no is redundancy and 2 iscsi disks. The two disks are accessible by ping and telnet on the port but on our iscsi initiator the status on one disk is reconnecting. We trying to stop the disk he steel stopping. We have to restart every time, and it happens every day.
How can I do to resolved it ??
admin
2,930 Posts
Quote from admin on December 16, 2020, 3:01 pmCan you provide more info. did you rule out hardware and network ? what is the configuration of hardware ? how many osds ? is the cluster status Ok or is it in error ? the charts for PG Status : does it show are pgs active at time of failure ? do the charts for mons and osds show any down at time of failure ? does it happen everyday at some specific load condition like backup jobs ? the charts for disk, cpu, mem % utilization are they ok at time of failure or saturated ? do you see any errors in logs ( PetaSAN/syslog/ceph) ?
what do you mean by the datas are no is redundancy
Can you provide more info. did you rule out hardware and network ? what is the configuration of hardware ? how many osds ? is the cluster status Ok or is it in error ? the charts for PG Status : does it show are pgs active at time of failure ? do the charts for mons and osds show any down at time of failure ? does it happen everyday at some specific load condition like backup jobs ? the charts for disk, cpu, mem % utilization are they ok at time of failure or saturated ? do you see any errors in logs ( PetaSAN/syslog/ceph) ?
what do you mean by the datas are no is redundancy
moh
16 Posts
Quote from moh on December 17, 2020, 10:01 amno, we haven't a rule on hardware and network. we have 48 osds, ceph status.
ceph health show an error "1/2129038 objects unfound (0.000%)
Possible data damage: 1 pg recovery_unfound
Degraded data redundancy: 1/2129088 objects degraded (0.000%), 1 pg degraded
1 pgs not deep-scrubbed in time
1 pgs not scrubbed in time
1 slow ops, oldest one blocked for 62569 sec, mon.NODEO2 has slow opscluster:
id: bf167be6-46ed-4c6d-bb3e-72e466994805
health: HEALTH_ERR
1/2129038 objects unfound (0.000%)
Possible data damage: 1 pg recovery_unfound
Degraded data redundancy: 1/2129088 objects degraded (0.000%), 1 pg degraded
1 pgs not deep-scrubbed in time
1 pgs not scrubbed in time
1 slow ops, oldest one blocked for 63836 sec, mon.NODEO2 has slow opsservices:
mon: 3 daemons, quorum NODEO3,NODEO1,NODEO2 (age 17h)
mgr: NODEO2(active, since 17h), standbys: NODEO3, NODEO1
mds: cephfs:1 {0=NODEO3=up:active} 2 up:standby
osd: 48 osds: 48 up (since 17h), 48 in (since 17h)data:
pools: 4 pools, 1216 pgs
objects: 2.13M objects, 8.0 TiB
usage: 8.0 TiB used, 78 TiB / 86 TiB avail
pgs: 1/2129088 objects degraded (0.000%)
1/2129038 objects unfound (0.000%)
1215 active+clean
1 active+recovery_unfound+degradedio:
client: 6.4 KiB/s rd, 7 op/s rd, 0 op/s wr
no, we haven't a rule on hardware and network. we have 48 osds, ceph status.
ceph health show an error "1/2129038 objects unfound (0.000%)
Possible data damage: 1 pg recovery_unfound
Degraded data redundancy: 1/2129088 objects degraded (0.000%), 1 pg degraded
1 pgs not deep-scrubbed in time
1 pgs not scrubbed in time
1 slow ops, oldest one blocked for 62569 sec, mon.NODEO2 has slow ops
cluster:
id: bf167be6-46ed-4c6d-bb3e-72e466994805
health: HEALTH_ERR
1/2129038 objects unfound (0.000%)
Possible data damage: 1 pg recovery_unfound
Degraded data redundancy: 1/2129088 objects degraded (0.000%), 1 pg degraded
1 pgs not deep-scrubbed in time
1 pgs not scrubbed in time
1 slow ops, oldest one blocked for 63836 sec, mon.NODEO2 has slow ops
services:
mon: 3 daemons, quorum NODEO3,NODEO1,NODEO2 (age 17h)
mgr: NODEO2(active, since 17h), standbys: NODEO3, NODEO1
mds: cephfs:1 {0=NODEO3=up:active} 2 up:standby
osd: 48 osds: 48 up (since 17h), 48 in (since 17h)
data:
pools: 4 pools, 1216 pgs
objects: 2.13M objects, 8.0 TiB
usage: 8.0 TiB used, 78 TiB / 86 TiB avail
pgs: 1/2129088 objects degraded (0.000%)
1/2129038 objects unfound (0.000%)
1215 active+clean
1 active+recovery_unfound+degraded
io:
client: 6.4 KiB/s rd, 7 op/s rd, 0 op/s wr
admin
2,930 Posts
Quote from admin on December 17, 2020, 11:03 am1) what do you mean by "the datas are no is redundancy" ?
2) the cluster state is in error HEALTH_ERR, there is 1 pg with recovery_unfound, this is preventing the iSCSI disks from working. Can you trace from the PG Status charts when this happened and do you recall anything occurred at that time ?
3) find the pg with this error
ceph health detail
then show the output of:
ceph pg PG list_unfound
ceph pg PG query
1) what do you mean by "the datas are no is redundancy" ?
2) the cluster state is in error HEALTH_ERR, there is 1 pg with recovery_unfound, this is preventing the iSCSI disks from working. Can you trace from the PG Status charts when this happened and do you recall anything occurred at that time ?
3) find the pg with this error
ceph health detail
then show the output of:
ceph pg PG list_unfound
ceph pg PG query