Forums - PetaSAN

ForumBug ReportingEsxi Server Freeze.
You need to log in to create posts and topics. Login · Register
Esxi Server Freeze.

Pages: 1 2 3 4

msalem
87 Posts

February 7, 2019, 8:03 am
Quote from msalem on February 7, 2019, 8:03 am
Hello Admin,

The network issue we had was, the storage switch dropped, and that caused PetaSAN node to shutdown, a few other users here reported the same issue. seems to be more a Ubuntu issue than PetaSAN.

So lets recap here to get things across:

1 - we have some OSD dropping randomly:

pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 09:00:00.000121 mon.srocceph1 mon.0 10.228.72.101:6789/0 285839 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 10:00:00.048938 mon.srocceph1 mon.0 10.228.72.101:6789/0 286030 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 11:00:00.000144 mon.srocceph1 mon.0 10.228.72.101:6789/0 286228 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 12:00:00.000136 mon.srocceph1 mon.0 10.228.72.101:6789/0 286423 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 13:00:00.000126 mon.srocceph1 mon.0 10.228.72.101:6789/0 286612 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 14:00:00.000152 mon.srocceph1 mon.0 10.228.72.101:6789/0 286806 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 15:00:00.000170 mon.srocceph1 mon.0 10.228.72.101:6789/0 287000 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 16:00:00.000127 mon.srocceph1 mon.0 10.228.72.101:6789/0 287214 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 17:00:00.000078 mon.srocceph1 mon.0 10.228.72.101:6789/0 287406 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 18:00:00.000125 mon.srocceph1 mon.0 10.228.72.101:6789/0 287605 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 19:00:00.000150 mon.srocceph1 mon.0 10.228.72.101:6789/0 287800 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 20:00:00.000145 mon.srocceph1 mon.0 10.228.72.101:6789/0 287996 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 21:00:00.000179 mon.srocceph1 mon.0 10.228.72.101:6789/0 288193 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 22:00:00.000143 mon.srocceph1 mon.0 10.228.72.101:6789/0 288391 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 23:00:00.000097 mon.srocceph1 mon.0 10.228.72.101:6789/0 288575 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-17 00:00:00.000132 mon.srocceph1 mon.0 10.228.72.101:6789/0 288765 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-17 01:00:00.000161 mon.srocceph1 mon.0 10.228.72.101:6789/0 288955 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-17 02:00:00.000146 mon.srocceph1 mon.0 10.228.72.101:6789/0 289141 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-17 03:00:00.000166 mon.srocceph1 mon.0 10.228.72.101:6789/0 289335 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43

2 - now we have a few LUN's stuck and we cannot delete them from PetaSAN.

3 - ISCSI service is stuck.

4 - "16 pgs inactive, 16" - This error is showing in the GUI and We need a way to fix it.

So I would assume the steps would be.
1- In the CLI I need the commands to delete the Bad LUN's

2 - Fix this Ceph issue and bring it back to clean state.

3 - Reproduce the issue again to identify the problem and fix it.

Thanks Admin

Hello Admin,

The network issue we had was, the storage switch dropped, and that caused PetaSAN node to shutdown, a few other users here reported the same issue. seems to be more a Ubuntu issue than PetaSAN.

So lets recap here to get things across:

1 - we have some OSD dropping randomly:

pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 09:00:00.000121 mon.srocceph1 mon.0 10.228.72.101:6789/0 285839 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 10:00:00.048938 mon.srocceph1 mon.0 10.228.72.101:6789/0 286030 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 11:00:00.000144 mon.srocceph1 mon.0 10.228.72.101:6789/0 286228 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 12:00:00.000136 mon.srocceph1 mon.0 10.228.72.101:6789/0 286423 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 13:00:00.000126 mon.srocceph1 mon.0 10.228.72.101:6789/0 286612 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 14:00:00.000152 mon.srocceph1 mon.0 10.228.72.101:6789/0 286806 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 15:00:00.000170 mon.srocceph1 mon.0 10.228.72.101:6789/0 287000 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 16:00:00.000127 mon.srocceph1 mon.0 10.228.72.101:6789/0 287214 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 17:00:00.000078 mon.srocceph1 mon.0 10.228.72.101:6789/0 287406 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 18:00:00.000125 mon.srocceph1 mon.0 10.228.72.101:6789/0 287605 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 19:00:00.000150 mon.srocceph1 mon.0 10.228.72.101:6789/0 287800 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 20:00:00.000145 mon.srocceph1 mon.0 10.228.72.101:6789/0 287996 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 21:00:00.000179 mon.srocceph1 mon.0 10.228.72.101:6789/0 288193 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 22:00:00.000143 mon.srocceph1 mon.0 10.228.72.101:6789/0 288391 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-16 23:00:00.000097 mon.srocceph1 mon.0 10.228.72.101:6789/0 288575 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-17 00:00:00.000132 mon.srocceph1 mon.0 10.228.72.101:6789/0 288765 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-17 01:00:00.000161 mon.srocceph1 mon.0 10.228.72.101:6789/0 288955 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-17 02:00:00.000146 mon.srocceph1 mon.0 10.228.72.101:6789/0 289141 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43
2019-01-17 03:00:00.000166 mon.srocceph1 mon.0 10.228.72.101:6789/0 289335 : cluster [ERR] overall HEALTH_ERR Reduced data availability: 16 pgs inactive, 16
pgs incomplete; 3 stuck requests are blocked > 4096 sec. Implicated osds 4,43

2 - now we have a few LUN's stuck and we cannot delete them from PetaSAN.

3 - ISCSI service is stuck.

4 - "16 pgs inactive, 16" - This error is showing in the GUI and We need a way to fix it.

So I would assume the steps would be.
1- In the CLI I need the commands to delete the Bad LUN's

2 - Fix this Ceph issue and bring it back to clean state.

3 - Reproduce the issue again to identify the problem and fix it.

Thanks Admin

#11

msalem
87 Posts

February 7, 2019, 8:06 am
Quote from msalem on February 7, 2019, 8:06 am
Hello,

This is the new error now - from the GUI.
Reduced data availability: 16 pgs inactive, 16 pgs incomplete
10 stuck requests are blocked > 4096 sec. Implicated osds 26

Hello,

This is the new error now - from the GUI.
Reduced data availability: 16 pgs inactive, 16 pgs incomplete
10 stuck requests are blocked > 4096 sec. Implicated osds 26

#12

admin
2,930 Posts

February 7, 2019, 10:21 am
Quote from admin on February 7, 2019, 10:21 am
The network issues i was refering to are
http://www.petasan.org/forums/?view=thread&id=395&part=2#postid-2350
Make sure you have solved these. Also you may want to use bonded nics with 2 switches.

As stated your current issue is your Ceph is down. It is not a Consul or iSCSI service issue. I am not sure what you mean by deleting bad luns, but you cannot do anything at the iSCSI layer if your Ceph storage is down. Once Ceph is up, your iSCSI will work again and you will be able to add or delete luns.

For OSD dropping randomly: you need to look at the OSD log for a failed OSD in:
/var/log/ceph
it should give you detail why they fail, if not increase the log level in the conf file to above 5.
The following link will help you, look at the flapping OSD section
http://docs.ceph.com/docs/mimic/rados/troubleshooting/troubleshooting-osd/

For the 16 stuck pg:
you need to use cli commands as in:
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/

If you need us for professional support, do hesitate to contact us.

Good luck.

The network issues i was refering to are
http://www.petasan.org/forums/?view=thread&id=395&part=2#postid-2350
Make sure you have solved these. Also you may want to use bonded nics with 2 switches.

As stated your current issue is your Ceph is down. It is not a Consul or iSCSI service issue. I am not sure what you mean by deleting bad luns, but you cannot do anything at the iSCSI layer if your Ceph storage is down. Once Ceph is up, your iSCSI will work again and you will be able to add or delete luns.

For OSD dropping randomly: you need to look at the OSD log for a failed OSD in:
/var/log/ceph
it should give you detail why they fail, if not increase the log level in the conf file to above 5.
The following link will help you, look at the flapping OSD section
http://docs.ceph.com/docs/mimic/rados/troubleshooting/troubleshooting-osd/

For the 16 stuck pg:
you need to use cli commands as in:
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/

If you need us for professional support, do hesitate to contact us.

Good luck.

#13

msalem
87 Posts

February 7, 2019, 5:27 pm
Quote from msalem on February 7, 2019, 5:27 pm
I wanted to delete the Pools, it was deleted however. the ISCSI is still present and does not want to delete.

07/02/2019 12:24:47 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:24:47 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:24:47 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:24:47 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:24:47 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:24:58 INFO     Stopping disk 00005
07/02/2019 12:24:58 ERROR    Could not find ips for image-00005
07/02/2019 12:24:58 ERROR    LIO error deleting Target None, maybe the iqn is not exists.
07/02/2019 12:24:58 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:25:00 INFO     Found pool:rbd for disk:00005 via consul
07/02/2019 12:25:00 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:25:00 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:25:00 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:25:00 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:25:00 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:25:00 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:25:00 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:25:00 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:25:00 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:25:00 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:25:00 ERROR    Cannot unmap image image-00005. error, no mapped images found.

I know that the ISCSI is layer above Ceph, and to start clean I need to delete these ISCSI images.

Any idea how to start from there.

Thanks

I wanted to delete the Pools, it was deleted however. the ISCSI is still present and does not want to delete.

07/02/2019 12:24:47 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:24:47 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:24:47 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:24:47 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:24:47 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:24:58 INFO     Stopping disk 00005
07/02/2019 12:24:58 ERROR    Could not find ips for image-00005
07/02/2019 12:24:58 ERROR    LIO error deleting Target None, maybe the iqn is not exists.
07/02/2019 12:24:58 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:25:00 INFO     Found pool:rbd for disk:00005 via consul
07/02/2019 12:25:00 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:25:00 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:25:00 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:25:00 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:25:00 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:25:00 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:25:00 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:25:00 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:25:00 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:25:00 ERROR    Cannot unmap image image-00005. error, no mapped images found.
07/02/2019 12:25:00 ERROR    Cannot unmap image image-00005. error, no mapped images found.

I know that the ISCSI is layer above Ceph, and to start clean I need to delete these ISCSI images.

Any idea how to start from there.

Thanks

#14

admin
2,930 Posts

February 7, 2019, 5:48 pm
Quote from admin on February 7, 2019, 5:48 pm
since Ceph was down, i presume you deleted the pools via cli not via the ui, in this case you need to remove the iSCSI disks from Consul via cli. since you want to start clean, i recommend you re-install clean. else i will post the command to remove from Consul.

since Ceph was down, i presume you deleted the pools via cli not via the ui, in this case you need to remove the iSCSI disks from Consul via cli. since you want to start clean, i recommend you re-install clean. else i will post the command to remove from Consul.

#15

msalem
87 Posts

February 7, 2019, 6:07 pm
Quote from msalem on February 7, 2019, 6:07 pm
Hello Admin,

I did not touch Ceph from the CLI, this is the whole point of PetaSAN 🙂 ,, so level-one and level-two support can manage it.

Since we have 6 nodes and it is really time consuming to wipe the disks and re-install .. I would rather the commands to just delete the LUN images and create a new Pool and setup the cluster.
Thanks

Hello Admin,

I did not touch Ceph from the CLI, this is the whole point of PetaSAN 🙂 ,, so level-one and level-two support can manage it.

Since we have 6 nodes and it is really time consuming to wipe the disks and re-install .. I would rather the commands to just delete the LUN images and create a new Pool and setup the cluster.
Thanks

#16

admin
2,930 Posts

February 7, 2019, 7:18 pm
Quote from admin on February 7, 2019, 7:18 pm
you do not need to wipe the disks. just select them to be included during the deployment and they will be wiped automatically. if you do not select them we do not wipe them.

it will probably be quicker too. All the iSCSI info such as ips and LIO target information is stored in Ceph as image metadata, so if Ceph is down we cannot know these resources to automatically clean them and it has to be done manually.

you do not need to wipe the disks. just select them to be included during the deployment and they will be wiped automatically. if you do not select them we do not wipe them.

it will probably be quicker too. All the iSCSI info such as ips and LIO target information is stored in Ceph as image metadata, so if Ceph is down we cannot know these resources to automatically clean them and it has to be done manually.

#17

msalem
87 Posts

February 11, 2019, 10:43 am
Quote from msalem on February 11, 2019, 10:43 am
Hello Admin,

I managed to fix the Ceph issue, seems that some images where created and cannot be deleted after, I have created a new Pool and everything looks good now.

One thing I noticed - the VM Storage Migration is very slow and it fails, I have created a few VM's on the LUN's and kept them busy by creating random files, to test the load.

But anything VMware related, Cloning, VM migrations is crazy slow..

Any suggestions ?

Hello Admin,

I managed to fix the Ceph issue, seems that some images where created and cannot be deleted after, I have created a new Pool and everything looks good now.

One thing I noticed - the VM Storage Migration is very slow and it fails, I have created a few VM's on the LUN's and kept them busy by creating random files, to test the load.

But anything VMware related, Cloning, VM migrations is crazy slow..

Any suggestions ?

#18

admin
2,930 Posts

February 11, 2019, 11:05 am
Quote from admin on February 11, 2019, 11:05 am
did you re-install or did you clean all resources mentioned before ?

did you re-install or did you clean all resources mentioned before ?

Last edited on February 11, 2019, 11:06 am by admin · #19

msalem
87 Posts

February 11, 2019, 1:50 pm
Quote from msalem on February 11, 2019, 1:50 pm
clean all resources mentioned before

its clean now, I can create and delete LUN's on the fly. and its stable so far.

But the performance is really concerning.

clean all resources mentioned before

its clean now, I can create and delete LUN's on the fly. and its stable so far.

But the performance is really concerning.

#20

Post Reply: Esxi Server Freeze.

Cancel

Pages: 1 2 3 4