config issue and cannot map rbd into linux filesystems

pedro6161
36 Posts

October 10, 2022, 9:38 am

Hi,

i had issue with service petasan-config-upload show on below, and how to recreate ?

Oct 10 12:52:18 ae-pst-10-01 systemd[1]: Started PetaSAN Config Upload.
Oct 10 12:52:19 ae-pst-10-01 petasan_config_upload.py[169023]: no such key 'config/global/public_network'
Oct 10 12:52:20 ae-pst-10-01 petasan_config_upload.py[169093]: no such key 'config/global/cluster_network'
Oct 10 12:52:20 ae-pst-10-01 petasan_config_upload.py[169145]: no such key 'config/global/bluestore_block_db_size'
Oct 10 12:52:21 ae-pst-10-01 petasan_config_upload.py[169266]: no such key 'config/mon/setuser_match_path'

and also since i having issue with iscsi, always showing error iscsi login timeout to network portal, how to mount rbd image into linux files systems ? so i can export the content and move into NFS in petasan cluster

root@ae-pst-10-01:/opt/petasan/config# rbd showmapped --cluster PETASAN-SSD-CLUSTER
did not load config file, using default settings.
2022-10-10T13:41:46.244+0400 7f326e9f8380 -1 Errors while parsing config file!
2022-10-10T13:41:46.244+0400 7f326e9f8380 -1 parse_file: filesystem error: cannot get file size: No such file or directory [PETASAN-SSD-CLUSTER.conf]
2022-10-10T13:41:46.244+0400 7f326e9f8380 -1 Errors while parsing config file!
2022-10-10T13:41:46.244+0400 7f326e9f8380 -1 parse_file: filesystem error: cannot get file size: No such file or directory [PETASAN-SSD-CLUSTER.conf]

i was try run :

systemctl stop petasan-iscsi

rbd map image-0001 (it's showing using lsblk)

since the FS using VMFS, so try mount using vmfs6-fuse command

after mount, i can see the content but once i trying to copy the content, server hung, cannot doing anything

Please Advice

admin
2,961 Posts

October 10, 2022, 12:06 pm

I would not worry about config issuse. it is complaining it cannot move same entries from config file to the central monitor config, which we added around version 2.4, some entries are no longer supported.

-what version are you using ? what version did you start with ?

-when did you start see-ing iSCSI login errors ? from the start or suddenly or after some changes to cluster ?

-is the cluster health ok ?

-are you meeting the hardware requirements for the system ?

-do you any high % utilization on mem, cpu, disks .. in the charts ?

pedro6161
36 Posts

October 11, 2022, 11:00 am

Quote from pedro6161 on October 11, 2022, 11:00 am
Hi, below is the answer :

version : 3.1.0, upgrade from 2.8.0

ISCSI login error i saw long time and i'm ignore but now very often and cannot connect again

cluster health multiple error and there is physical disk failed but i already stop the OSD where the physical disk had issue but i'm not yet replace just remove as OSD and i did already ceph pg repair but health still showing same

hardware is using dual xeon and 256 GB of ram, with 24 Bay disk, 4 SSD as journal and the rest 18 disk as OSD

no high utilizations

cluster health :

root@ae-pst-10-02:~# ceph health detail
HEALTH_ERR 2 scrub errors; Reduced data availability: 35 pgs inactive, 1 pg down, 34 pgs incomplete; Possible data damage: 2 pgs inconsistent; 5 pgs not deep-scrubbed in time; 5 pgs not scrubbed in time
[ERR] OSD_SCRUB_ERRORS: 2 scrub errors
[WRN] PG_AVAILABILITY: Reduced data availability: 35 pgs inactive, 1 pg down, 34 pgs incomplete
pg 10.14 is incomplete, acting [1,35]
pg 10.5d is incomplete, acting [32,45]
pg 10.6a is incomplete, acting [10,51]
pg 10.125 is incomplete, acting [48,32]
pg 10.14b is incomplete, acting [18,47]
pg 10.15d is incomplete, acting [27,45]
pg 10.17d is incomplete, acting [0,10]
pg 10.1d4 is incomplete, acting [51,8]
pg 10.22e is incomplete, acting [32,30]
pg 10.265 is incomplete, acting [30,7]
pg 10.26f is incomplete, acting [41,37]
pg 10.27a is incomplete, acting [45,37]
pg 10.2e0 is incomplete, acting [49,35]
pg 10.32e is incomplete, acting [0,31]
pg 10.357 is incomplete, acting [3,26]
pg 10.35c is incomplete, acting [9,40]
pg 10.38b is incomplete, acting [39,3]
pg 10.445 is incomplete, acting [30,10]
pg 10.44d is incomplete, acting [48,32]
pg 10.4af is incomplete, acting [37,49]
pg 10.4c5 is incomplete, acting [50,31]
pg 10.4f9 is incomplete, acting [45,36]
pg 10.533 is inconsistent+incomplete, acting [38,52]
pg 10.53d is incomplete, acting [39,0]
pg 10.594 is incomplete, acting [27,1]
pg 10.62f is incomplete, acting [52,8]
pg 10.645 is incomplete, acting [27,2]
pg 10.66d is incomplete, acting [36,45]
pg 10.6c3 is down, acting [40,11]
pg 10.711 is incomplete, acting [49,8]
pg 10.722 is incomplete, acting [11,52]
pg 10.736 is incomplete, acting [33,46]
pg 10.748 is incomplete, acting [49,11]
pg 10.79a is incomplete, acting [45,35]
pg 10.7c5 is inconsistent+incomplete, acting [37,30]
[ERR] PG_DAMAGED: Possible data damage: 2 pgs inconsistent
pg 10.533 is inconsistent+incomplete, acting [38,52]
pg 10.7c5 is inconsistent+incomplete, acting [37,30]
[WRN] PG_NOT_DEEP_SCRUBBED: 5 pgs not deep-scrubbed in time
pg 10.6c3 not deep-scrubbed since 2022-09-02T01:21:36.412664+0400
pg 10.1d4 not deep-scrubbed since 2022-09-04T13:37:48.327137+0400
pg 10.14b not deep-scrubbed since 2022-09-15T13:27:18.272612+0400
pg 10.5d not deep-scrubbed since 2022-09-06T01:51:34.877560+0400
pg 10.265 not deep-scrubbed since 2022-09-27T09:27:07.302964+0400
[WRN] PG_NOT_SCRUBBED: 5 pgs not scrubbed in time
pg 10.6c3 not scrubbed since 2022-09-02T01:21:36.412664+0400
pg 10.1d4 not scrubbed since 2022-09-04T13:37:48.327137+0400
pg 10.14b not scrubbed since 2022-09-17T22:40:50.194025+0400
pg 10.5d not scrubbed since 2022-09-06T01:51:34.877560+0400
pg 10.265 not scrubbed since 2022-09-28T15:32:30.786157+0400
root@ae-pst-10-02:~#

Hi, below is the answer :

version : 3.1.0, upgrade from 2.8.0

ISCSI login error i saw long time and i'm ignore but now very often and cannot connect again

cluster health multiple error and there is physical disk failed but i already stop the OSD where the physical disk had issue but i'm not yet replace just remove as OSD and i did already ceph pg repair but health still showing same

hardware is using dual xeon and 256 GB of ram, with 24 Bay disk, 4 SSD as journal and the rest 18 disk as OSD

no high utilizations

cluster health :

root@ae-pst-10-02:~# ceph health detail
HEALTH_ERR 2 scrub errors; Reduced data availability: 35 pgs inactive, 1 pg down, 34 pgs incomplete; Possible data damage: 2 pgs inconsistent; 5 pgs not deep-scrubbed in time; 5 pgs not scrubbed in time
[ERR] OSD_SCRUB_ERRORS: 2 scrub errors
[WRN] PG_AVAILABILITY: Reduced data availability: 35 pgs inactive, 1 pg down, 34 pgs incomplete
pg 10.14 is incomplete, acting [1,35]
pg 10.5d is incomplete, acting [32,45]
pg 10.6a is incomplete, acting [10,51]
pg 10.125 is incomplete, acting [48,32]
pg 10.14b is incomplete, acting [18,47]
pg 10.15d is incomplete, acting [27,45]
pg 10.17d is incomplete, acting [0,10]
pg 10.1d4 is incomplete, acting [51,8]
pg 10.22e is incomplete, acting [32,30]
pg 10.265 is incomplete, acting [30,7]
pg 10.26f is incomplete, acting [41,37]
pg 10.27a is incomplete, acting [45,37]
pg 10.2e0 is incomplete, acting [49,35]
pg 10.32e is incomplete, acting [0,31]
pg 10.357 is incomplete, acting [3,26]
pg 10.35c is incomplete, acting [9,40]
pg 10.38b is incomplete, acting [39,3]
pg 10.445 is incomplete, acting [30,10]
pg 10.44d is incomplete, acting [48,32]
pg 10.4af is incomplete, acting [37,49]
pg 10.4c5 is incomplete, acting [50,31]
pg 10.4f9 is incomplete, acting [45,36]
pg 10.533 is inconsistent+incomplete, acting [38,52]
pg 10.53d is incomplete, acting [39,0]
pg 10.594 is incomplete, acting [27,1]
pg 10.62f is incomplete, acting [52,8]
pg 10.645 is incomplete, acting [27,2]
pg 10.66d is incomplete, acting [36,45]
pg 10.6c3 is down, acting [40,11]
pg 10.711 is incomplete, acting [49,8]
pg 10.722 is incomplete, acting [11,52]
pg 10.736 is incomplete, acting [33,46]
pg 10.748 is incomplete, acting [49,11]
pg 10.79a is incomplete, acting [45,35]
pg 10.7c5 is inconsistent+incomplete, acting [37,30]
[ERR] PG_DAMAGED: Possible data damage: 2 pgs inconsistent
pg 10.533 is inconsistent+incomplete, acting [38,52]
pg 10.7c5 is inconsistent+incomplete, acting [37,30]
[WRN] PG_NOT_DEEP_SCRUBBED: 5 pgs not deep-scrubbed in time
pg 10.6c3 not deep-scrubbed since 2022-09-02T01:21:36.412664+0400
pg 10.1d4 not deep-scrubbed since 2022-09-04T13:37:48.327137+0400
pg 10.14b not deep-scrubbed since 2022-09-15T13:27:18.272612+0400
pg 10.5d not deep-scrubbed since 2022-09-06T01:51:34.877560+0400
pg 10.265 not deep-scrubbed since 2022-09-27T09:27:07.302964+0400
[WRN] PG_NOT_SCRUBBED: 5 pgs not scrubbed in time
pg 10.6c3 not scrubbed since 2022-09-02T01:21:36.412664+0400
pg 10.1d4 not scrubbed since 2022-09-04T13:37:48.327137+0400
pg 10.14b not scrubbed since 2022-09-17T22:40:50.194025+0400
pg 10.5d not scrubbed since 2022-09-06T01:51:34.877560+0400
pg 10.265 not scrubbed since 2022-09-28T15:32:30.786157+0400
root@ae-pst-10-02:~#

admin
2,961 Posts

October 11, 2022, 11:47 am

The problem is because 35 pgs are inactive. This is a problem at the lower Ceph layers not iSCSI.

The reason they are inactive is that they are incomplete. This is a bad error and can lead to data loss in some cases. incomplete means the different OSDs serving the pg cannot form a continuous sequence log/history of the changes. This cam happen if the min size for the pool was set to 1 (so you can accept writes to 1 OSD only) and the OSDs serving the pg were going up and down so each has a different sequence than the others and cannot agree on what the final data should be. Apart from min size =1, it could also happen if multiple media errors happen at same time.

Fixing this is tricky, search the net for fixing incomplete pgs and also the flag osd_find_best_info_ignore_history_les but be careful as it could lead to data loss. Good luck.

pedro6161
36 Posts

October 14, 2022, 6:39 am

Quote from pedro6161 on October 14, 2022, 6:39 am
i already change 4 failed disk and the result there is 1 pg 10.445 always creating after 2 days never finish, may be stalled, do you have clue ?

root@ae-pst-10-03:~# ceph health detail
HEALTH_ERR 1 scrub errors; Reduced data availability: 33 pgs inactive, 33 pgs incomplete; Possible data damage: 1 pg inconsistent; 9 pgs not deep-scrubbed in time; 5 pgs not scrubbed in time; 15 slow ops, oldest one blocked for 83221 sec, daemons [osd.10,osd.11,osd.18,osd.27,osd.33,osd.44] have slow ops.
[ERR] OSD_SCRUB_ERRORS: 1 scrub errors
[WRN] PG_AVAILABILITY: Reduced data availability: 33 pgs inactive, 33 pgs incomplete
pg 10.14 is incomplete, acting [44,35]
pg 10.5d is incomplete, acting [32,44]
pg 10.6a is incomplete, acting [10,44]
pg 10.125 is incomplete, acting [44,32]
pg 10.14b is incomplete, acting [18,44]
pg 10.15d is incomplete, acting [27,44]
pg 10.17d is incomplete, acting [44,10]
pg 10.1d4 is incomplete, acting [44,8]
pg 10.22e is incomplete, acting [32,44]
pg 10.265 is incomplete, acting [44,7]
pg 10.26f is incomplete, acting [44,37]
pg 10.27a is incomplete, acting [44,37]
pg 10.2e0 is incomplete, acting [44,35]
pg 10.32e is incomplete, acting [44,31]
pg 10.357 is incomplete, acting [44,53]
pg 10.35c is incomplete, acting [9,44]
pg 10.445 is creating+incomplete, acting [44,10]
pg 10.44d is incomplete, acting [44,32]
pg 10.4af is incomplete, acting [37,44]
pg 10.4c5 is incomplete, acting [44,31]
pg 10.4f9 is incomplete, acting [44,36]
pg 10.53d is incomplete, acting [39,44]
pg 10.594 is incomplete, acting [27,44]
pg 10.62f is incomplete, acting [44,38]
pg 10.645 is incomplete, acting [27,44]
pg 10.66d is incomplete, acting [36,44]
pg 10.6c3 is incomplete, acting [44,11]
pg 10.711 is incomplete, acting [44,8]
pg 10.722 is incomplete, acting [11,52]
pg 10.736 is incomplete, acting [33,44]
pg 10.748 is incomplete, acting [44,11]
pg 10.79a is incomplete, acting [44,35]
pg 10.7c5 is inconsistent+incomplete, acting [37,44]
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
pg 10.7c5 is inconsistent+incomplete, acting [37,44]
[WRN] PG_NOT_DEEP_SCRUBBED: 9 pgs not deep-scrubbed in time
pg 10.6c3 not deep-scrubbed since 2022-09-02T01:21:36.412664+0400
pg 10.53d not deep-scrubbed since 2022-10-01T14:07:16.560552+0400
pg 10.4f9 not deep-scrubbed since 2022-09-29T20:31:55.576463+0400
pg 10.4af not deep-scrubbed since 2022-09-30T14:22:39.965113+0400
pg 10.1d4 not deep-scrubbed since 2022-09-04T13:37:48.327137+0400
pg 10.14b not deep-scrubbed since 2022-09-15T13:27:18.272612+0400
pg 10.6a not deep-scrubbed since 2022-09-30T18:56:18.451517+0400
pg 10.5d not deep-scrubbed since 2022-09-06T01:51:34.877560+0400
pg 10.265 not deep-scrubbed since 2022-09-27T09:27:07.302964+0400
[WRN] PG_NOT_SCRUBBED: 5 pgs not scrubbed in time
pg 10.6c3 not scrubbed since 2022-09-02T01:21:36.412664+0400
pg 10.1d4 not scrubbed since 2022-09-04T13:37:48.327137+0400
pg 10.14b not scrubbed since 2022-09-17T22:40:50.194025+0400
pg 10.5d not scrubbed since 2022-09-06T01:51:34.877560+0400
pg 10.265 not scrubbed since 2022-09-28T15:32:30.786157+0400
[WRN] SLOW_OPS: 15 slow ops, oldest one blocked for 83221 sec, daemons [osd.10,osd.11,osd.18,osd.27,osd.33,osd.44] have slow ops.

after i do ceph pg 10.445 query showing peering_blocked_by_history_les_bound and i was try to googling the clue is
osd find best info ignore history les = true and
osd recovery sleep hdd = 0

but i don't know how to, may be you can help, so i can export the vm from iscsi rbd image

Please Advice

i already change 4 failed disk and the result there is 1 pg 10.445 always creating after 2 days never finish, may be stalled, do you have clue ?

root@ae-pst-10-03:~# ceph health detail
HEALTH_ERR 1 scrub errors; Reduced data availability: 33 pgs inactive, 33 pgs incomplete; Possible data damage: 1 pg inconsistent; 9 pgs not deep-scrubbed in time; 5 pgs not scrubbed in time; 15 slow ops, oldest one blocked for 83221 sec, daemons [osd.10,osd.11,osd.18,osd.27,osd.33,osd.44] have slow ops.
[ERR] OSD_SCRUB_ERRORS: 1 scrub errors
[WRN] PG_AVAILABILITY: Reduced data availability: 33 pgs inactive, 33 pgs incomplete
pg 10.14 is incomplete, acting [44,35]
pg 10.5d is incomplete, acting [32,44]
pg 10.6a is incomplete, acting [10,44]
pg 10.125 is incomplete, acting [44,32]
pg 10.14b is incomplete, acting [18,44]
pg 10.15d is incomplete, acting [27,44]
pg 10.17d is incomplete, acting [44,10]
pg 10.1d4 is incomplete, acting [44,8]
pg 10.22e is incomplete, acting [32,44]
pg 10.265 is incomplete, acting [44,7]
pg 10.26f is incomplete, acting [44,37]
pg 10.27a is incomplete, acting [44,37]
pg 10.2e0 is incomplete, acting [44,35]
pg 10.32e is incomplete, acting [44,31]
pg 10.357 is incomplete, acting [44,53]
pg 10.35c is incomplete, acting [9,44]
pg 10.445 is creating+incomplete, acting [44,10]
pg 10.44d is incomplete, acting [44,32]
pg 10.4af is incomplete, acting [37,44]
pg 10.4c5 is incomplete, acting [44,31]
pg 10.4f9 is incomplete, acting [44,36]
pg 10.53d is incomplete, acting [39,44]
pg 10.594 is incomplete, acting [27,44]
pg 10.62f is incomplete, acting [44,38]
pg 10.645 is incomplete, acting [27,44]
pg 10.66d is incomplete, acting [36,44]
pg 10.6c3 is incomplete, acting [44,11]
pg 10.711 is incomplete, acting [44,8]
pg 10.722 is incomplete, acting [11,52]
pg 10.736 is incomplete, acting [33,44]
pg 10.748 is incomplete, acting [44,11]
pg 10.79a is incomplete, acting [44,35]
pg 10.7c5 is inconsistent+incomplete, acting [37,44]
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
pg 10.7c5 is inconsistent+incomplete, acting [37,44]
[WRN] PG_NOT_DEEP_SCRUBBED: 9 pgs not deep-scrubbed in time
pg 10.6c3 not deep-scrubbed since 2022-09-02T01:21:36.412664+0400
pg 10.53d not deep-scrubbed since 2022-10-01T14:07:16.560552+0400
pg 10.4f9 not deep-scrubbed since 2022-09-29T20:31:55.576463+0400
pg 10.4af not deep-scrubbed since 2022-09-30T14:22:39.965113+0400
pg 10.1d4 not deep-scrubbed since 2022-09-04T13:37:48.327137+0400
pg 10.14b not deep-scrubbed since 2022-09-15T13:27:18.272612+0400
pg 10.6a not deep-scrubbed since 2022-09-30T18:56:18.451517+0400
pg 10.5d not deep-scrubbed since 2022-09-06T01:51:34.877560+0400
pg 10.265 not deep-scrubbed since 2022-09-27T09:27:07.302964+0400
[WRN] PG_NOT_SCRUBBED: 5 pgs not scrubbed in time
pg 10.6c3 not scrubbed since 2022-09-02T01:21:36.412664+0400
pg 10.1d4 not scrubbed since 2022-09-04T13:37:48.327137+0400
pg 10.14b not scrubbed since 2022-09-17T22:40:50.194025+0400
pg 10.5d not scrubbed since 2022-09-06T01:51:34.877560+0400
pg 10.265 not scrubbed since 2022-09-28T15:32:30.786157+0400
[WRN] SLOW_OPS: 15 slow ops, oldest one blocked for 83221 sec, daemons [osd.10,osd.11,osd.18,osd.27,osd.33,osd.44] have slow ops.

after i do ceph pg 10.445 query showing peering_blocked_by_history_les_bound and i was try to googling the clue is

osd find best info ignore history les = true and

osd recovery sleep hdd = 0

but i don't know how to, may be you can help, so i can export the vm from iscsi rbd image

Please Advice