Forums - PetaSAN

ForumGeneral DiscussionReplace failed drive
You need to log in to create posts and topics. Login · Register
Replace failed drive

Pages: 1 2

khopkins
96 Posts

September 2, 2020, 7:23 pm
Quote from khopkins on September 2, 2020, 7:23 pm
The disk will be in Friday so will remove it and replace it, then see if UI is working.

Thanks,

The disk will be in Friday so will remove it and replace it, then see if UI is working.

Thanks,

#11

khopkins
96 Posts

September 23, 2020, 2:10 pm
Quote from khopkins on September 23, 2020, 2:10 pm
Got the bad drive in and spun up, and have a GUI which shows status of OSD 9 down. Do I need to delete the disk and add it back? If I do delete the disk, will the GUI change to add it back?

Deleted Disk and it removed it from the GUI. Replaced disk does not show up to add, how do we add the disk?

New drive is not showing up in GUI.

Got the bad drive in and spun up, and have a GUI which shows status of OSD 9 down. Do I need to delete the disk and add it back? If I do delete the disk, will the GUI change to add it back?

Deleted Disk and it removed it from the GUI. Replaced disk does not show up to add, how do we add the disk?

New drive is not showing up in GUI.

Last edited on September 23, 2020, 4:00 pm by khopkins · #12

khopkins
96 Posts

September 23, 2020, 4:39 pm
Quote from khopkins on September 23, 2020, 4:39 pm

Looks like the some things failed to prep the disk.

23/09/2020 10:19:53 ERROR sync_replication_node called on non-backup node

23/09/2020 10:19:52 INFO Update roles.

23/09/2020 10:19:52 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.

23/09/2020 10:19:52 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.

23/09/2020 10:19:52 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.

23/09/2020 09:35:21 ERROR ceph-volume.zap failed

23/09/2020 09:35:18 INFO Starting ceph-volume --log-path /opt/petasan/log lvm zap /dev/not_set --destroy

23/09/2020 09:35:18 INFO Start cleaning disk not_set

23/09/2020 09:35:18 INFO osd.9 is deleted from crush map

23/09/2020 09:35:17 INFO Start delete osd.9 from crush map

23/09/2020 09:35:17 WARNING osd.9 is removed from crush map

23/09/2020 09:35:17 WARNING The osd still up you need to stop osd service of osd.9

23/09/2020 09:35:16 ERROR Error executing ceph auth del osd.9

23/09/2020 09:35:15 ERROR Error executing ceph osd crush remove osd.9

23/09/2020 09:35:13 ERROR Error executing ceph osd out osd.9

23/09/2020 09:35:12 INFO Start remove osd.9 from crush map

23/09/2020 09:35:08 INFO Start delete osd job 1349670

23/09/2020 09:35:08 INFO Start delete job for osd.9

23/09/2020 09:35:08 INFO -id 9 -disk_name not_set

23/09/2020 09:35:08 INFO params

23/09/2020 09:35:08 INFO /opt/petasan/scripts/admin/node_manage_disks.py delete-osd

Looks like the some things failed to prep the disk.

23/09/2020 10:19:53 ERROR sync_replication_node called on non-backup node

23/09/2020 10:19:52 INFO Update roles.

23/09/2020 10:19:52 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.

23/09/2020 10:19:52 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.

23/09/2020 10:19:52 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.

23/09/2020 09:35:21 ERROR ceph-volume.zap failed

23/09/2020 09:35:18 INFO Starting ceph-volume --log-path /opt/petasan/log lvm zap /dev/not_set --destroy

23/09/2020 09:35:18 INFO Start cleaning disk not_set

23/09/2020 09:35:18 INFO osd.9 is deleted from crush map

23/09/2020 09:35:17 INFO Start delete osd.9 from crush map

23/09/2020 09:35:17 WARNING osd.9 is removed from crush map

23/09/2020 09:35:17 WARNING The osd still up you need to stop osd service of osd.9

23/09/2020 09:35:16 ERROR Error executing ceph auth del osd.9

23/09/2020 09:35:15 ERROR Error executing ceph osd crush remove osd.9

23/09/2020 09:35:13 ERROR Error executing ceph osd out osd.9

23/09/2020 09:35:12 INFO Start remove osd.9 from crush map

23/09/2020 09:35:08 INFO Start delete osd job 1349670

23/09/2020 09:35:08 INFO Start delete job for osd.9

23/09/2020 09:35:08 INFO -id 9 -disk_name not_set

23/09/2020 09:35:08 INFO params

23/09/2020 09:35:08 INFO /opt/petasan/scripts/admin/node_manage_disks.py delete-osd

#13

admin
2,972 Posts

September 23, 2020, 9:15 pm
Quote from admin on September 23, 2020, 9:15 pm
you can delete the bad OSD either before removal or after it is removed.

if you do not see the new drive listed in the ui, there could be an issue with the hardware, check it is listed in /sys/block and /dev/sdX if not look at kernel logs via dmesg there could be a problem with it.

you can delete the bad OSD either before removal or after it is removed.

if you do not see the new drive listed in the ui, there could be an issue with the hardware, check it is listed in /sys/block and /dev/sdX if not look at kernel logs via dmesg there could be a problem with it.

#14

khopkins
96 Posts

September 24, 2020, 3:14 pm
Quote from khopkins on September 24, 2020, 3:14 pm
Replaced drive with a new one, same results, no disk showing up in GUI to add

root@PS-Node1:/sys/block# ls
dm-0 dm-2 dm-4 rbd0 sda sdd sdf sr0
dm-1 dm-3 dm-5 rbd1 sdb sde sdgcd /sys

No sdc

/dev/sdc did not show up

/var/lib/ceph/osd# ls
ceph-10 ceph-11 ceph-6 ceph-7 ceph-8 ceph-9

Loads of these in dmesg

[31534765.299824] Buffer I/O error on dev dm-5, logical block 976486384, async page read
[31534766.619981] Buffer I/O error on dev dm-5, logical block 976486384, async page read
[31534784.503301] Buffer I/O error on dev dm-5, logical block 976486384, async page read
[31534784.643601] Buffer I/O error on dev dm-5, logical block 976486384, async page read
[31534784.799517] Buffer I/O error on dev dm-5, logical block 976486384, async page read
[31534784.965121] Buffer I/O error on dev dm-5, logical block 976486384, async page read
[31534785.127645] Buffer I/O error on dev dm-5, logical block 976486384, async page read
[31534787.274066] Buffer I/O error on dev dm-5, logical block 976486384, async page read

Just a thought, do these drives need to be initialized? These are disk straight out of the package

If this one node was to be updated to 2.6, will it have any issues as the other nodes are 2.3.1? Just until node 1 is good, than we'll update the others.

Replaced drive with a new one, same results, no disk showing up in GUI to add

root@PS-Node1:/sys/block# ls
dm-0 dm-2 dm-4 rbd0 sda sdd sdf sr0
dm-1 dm-3 dm-5 rbd1 sdb sde sdgcd /sys

No sdc

/dev/sdc did not show up

/var/lib/ceph/osd# ls
ceph-10 ceph-11 ceph-6 ceph-7 ceph-8 ceph-9

Loads of these in dmesg

[31534765.299824] Buffer I/O error on dev dm-5, logical block 976486384, async page read
[31534766.619981] Buffer I/O error on dev dm-5, logical block 976486384, async page read
[31534784.503301] Buffer I/O error on dev dm-5, logical block 976486384, async page read
[31534784.643601] Buffer I/O error on dev dm-5, logical block 976486384, async page read
[31534784.799517] Buffer I/O error on dev dm-5, logical block 976486384, async page read
[31534784.965121] Buffer I/O error on dev dm-5, logical block 976486384, async page read
[31534785.127645] Buffer I/O error on dev dm-5, logical block 976486384, async page read
[31534787.274066] Buffer I/O error on dev dm-5, logical block 976486384, async page read

Just a thought, do these drives need to be initialized? These are disk straight out of the package

If this one node was to be updated to 2.6, will it have any issues as the other nodes are 2.3.1? Just until node 1 is good, than we'll update the others.

Last edited on September 24, 2020, 4:41 pm by khopkins · #15

admin
2,972 Posts

September 24, 2020, 6:26 pm
Quote from admin on September 24, 2020, 6:26 pm
It depends on the time of the error, do a dmesg -T to make sure

You do not need to do anything for the new drives to show up, but they are not detected in /sys/block which could be a hardware issue.

You can upgrade if health status of cluster is Ok

It depends on the time of the error, do a dmesg -T to make sure

You do not need to do anything for the new drives to show up, but they are not detected in /sys/block which could be a hardware issue.

You can upgrade if health status of cluster is Ok

#16

khopkins
96 Posts

September 24, 2020, 6:56 pm
Quote from khopkins on September 24, 2020, 6:56 pm
Really appreciate the responses, we'll look more into being a backplane issue. Thanks

Really appreciate the responses, we'll look more into being a backplane issue. Thanks

#17

Post Reply: Replace failed drive

Cancel

Pages: 1 2