Disks unresponsive, slow requests are blocked
roy.lei
2 Posts
December 9, 2019, 3:21 amQuote from roy.lei on December 9, 2019, 3:21 amWe have 5 PetaSAN OSD vms running on 5 Hyperv hosts. The OSD vms are not in the Hyperv failover cluster. Each Hyperv host has 12 spinning disks pass-through to the PetaSAN OSD vm.
On 02/12/19 at around 09:40, Hyperv host node 3 was frozen. And the 12 pass-through spinning disks stopped resonding to the OSD vm's Ceph OSD processes (OSD 24-35).
In the cluster logs, slow requests were blocked. But the implicated OSD are not limit to 24-35.
2019-12-02 09:42:37.216878 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 17500 : cluster [WRN] Health check failed: 8 slow requests are blocked > 32 sec. Implicated osds 2,9 (REQUEST_SLOW)
2019-12-02 09:42:42.623838 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 17501 : cluster [WRN] Health check update: 28 slow requests are blocked > 32 sec. Implicated osds 2,4,9,16,31,37,51 (REQU
EST_SLOW)
2019-12-02 09:42:47.624156 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 17502 : cluster [WRN] Health check update: 82 slow requests are blocked > 32 sec. Implicated osds 1,2,4,7,9,11,16,22,31,3
7,43,49,50,51,52,53,56 (REQUEST_SLOW)
2019-12-02 09:42:46.553401 osd.9 osd.9 10.0.4.221:6806/2644 328 : cluster [WRN] 3 slow requests, 3 included below; oldest blocked for > 30.891256 secs
2019-12-02 09:42:46.553410 osd.9 osd.9 10.0.4.221:6806/2644 329 : cluster [WRN] slow request 30.891256 seconds old, received at 2019-12-02 09:42:15.662057: osd_op(client.10369681.0:776104831
3.77 3:ee191869:::rbd_data.9ca786b8b4567.00000000000b084a:head [set-alloc-hint object_size 4194304 write_size 4194304,write 1720320~4096] snapc 0=[] ondisk+write+known_if_redirected e9499)
currently sub_op_commit_rec from 38
2019-12-02 09:42:46.553414 osd.9 osd.9 10.0.4.221:6806/2644 330 : cluster [WRN] slow request 30.891154 seconds old, received at 2019-12-02 09:42:15.662158: osd_op(client.10369681.0:776104832
3.77 3:ee191869:::rbd_data.9ca786b8b4567.00000000000b084a:head [set-alloc-hint object_size 4194304 write_size 4194304,write 1748992~4096] snapc 0=[] ondisk+write+known_if_redirected e9499)
currently op_applied
2019-12-02 09:42:46.553419 osd.9 osd.9 10.0.4.221:6806/2644 331 : cluster [WRN] slow request 30.891098 seconds old, received at 2019-12-02 09:42:15.662214: osd_op(client.10369681.0:776104833
3.77 3:ee191869:::rbd_data.9ca786b8b4567.00000000000b084a:head [set-alloc-hint object_size 4194304 write_size 4194304,write 2523136~4096] snapc 0=[] ondisk+write+known_if_redirected e9499)
currently op_applied
At 09:43, OSD 24-35 were marked down.
2019-12-02 09:43:12.625916 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9499 we have enough reporters to mark osd.31 down
2019-12-02 09:43:25.358119 7f21dcb44700 1 mon.cmps-ps1-mgmt@0(leader).osd e9501 we have enough reporters to mark osd.33 down
2019-12-02 09:43:27.783766 7f21dcb44700 1 mon.cmps-ps1-mgmt@0(leader).osd e9503 we have enough reporters to mark osd.32 down
2019-12-02 09:43:28.258796 7f21dcb44700 1 mon.cmps-ps1-mgmt@0(leader).osd e9504 we have enough reporters to mark osd.35 down
2019-12-02 09:43:32.632478 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9506 we have enough reporters to mark osd.30 down
2019-12-02 09:43:42.638068 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9508 we have enough reporters to mark osd.34 down
2019-12-02 09:43:47.643609 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9510 we have enough reporters to mark osd.24 down
2019-12-02 09:43:47.643793 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9510 we have enough reporters to mark osd.25 down
2019-12-02 09:43:47.643949 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9510 we have enough reporters to mark osd.28 down
2019-12-02 09:43:47.644122 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9510 we have enough reporters to mark osd.29 down
2019-12-02 09:44:02.652213 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9512 we have enough reporters to mark osd.27 down
2019-12-02 09:44:24.566864 7f21dcb44700 1 mon.cmps-ps1-mgmt@0(leader).osd e9514 we have enough reporters to mark osd.26 down
And at 09:53, OSD 24-35 were marked out.
2019-12-02 09:53:17.724359 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18053 : cluster [INF] Marking osd.31 out (has been down for 605 seconds)
2019-12-02 09:53:27.733868 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18071 : cluster [INF] Marking osd.33 out (has been down for 601 seconds)
2019-12-02 09:53:32.740410 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18078 : cluster [INF] Marking osd.32 out (has been down for 604 seconds)
2019-12-02 09:53:32.740455 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18079 : cluster [INF] Marking osd.35 out (has been down for 603 seconds)
2019-12-02 09:53:37.743947 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18088 : cluster [INF] Marking osd.30 out (has been down for 605 seconds)
2019-12-02 09:53:47.748388 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18100 : cluster [INF] Marking osd.34 out (has been down for 605 seconds)
2019-12-02 09:53:52.752446 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18109 : cluster [INF] Marking osd.24 out (has been down for 605 seconds)
2019-12-02 09:53:52.752468 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18110 : cluster [INF] Marking osd.25 out (has been down for 605 seconds)
2019-12-02 09:53:52.752486 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18111 : cluster [INF] Marking osd.28 out (has been down for 605 seconds)
2019-12-02 09:53:52.752505 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18112 : cluster [INF] Marking osd.29 out (has been down for 605 seconds)
2019-12-02 09:54:07.759680 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18129 : cluster [INF] Marking osd.27 out (has been down for 605 seconds)
2019-12-02 09:54:27.764938 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18144 : cluster [INF] Marking osd.26 out (has been down for 602 seconds)
The strange thing was OSD 36-47 were marked down and recovered within 2 seconds between 09:55:22 and 09:55:24. These OSDs are on Hyperv host node 4.
2019-12-02 09:55:22.912324 osd.37 osd.37 10.0.4.224:6804/4191733 189 : cluster [WRN] Monitor daemon marked osd.37 down, but it is still running
2019-12-02 09:55:22.991172 osd.47 osd.47 10.0.4.224:6800/4191693 801 : cluster [WRN] Monitor daemon marked osd.47 down, but it is still running
2019-12-02 09:55:22.878230 osd.36 osd.36 10.0.4.224:6806/4191779 162 : cluster [WRN] Monitor daemon marked osd.36 down, but it is still running
2019-12-02 09:55:22.924994 osd.46 osd.46 10.0.4.224:6820/4191655 174 : cluster [WRN] Monitor daemon marked osd.46 down, but it is still running
2019-12-02 09:55:22.976335 osd.42 osd.42 10.0.4.224:6818/4191844 153 : cluster [WRN] Monitor daemon marked osd.42 down, but it is still running
2019-12-02 09:55:22.919094 osd.39 osd.39 10.0.4.224:6822/4191895 206 : cluster [WRN] Monitor daemon marked osd.39 down, but it is still running
2019-12-02 09:55:22.891007 osd.45 osd.45 10.0.4.224:6815/4191845 210 : cluster [WRN] Monitor daemon marked osd.45 down, but it is still running
2019-12-02 09:55:22.886398 osd.38 osd.38 10.0.4.224:6802/4191718 157 : cluster [WRN] Monitor daemon marked osd.38 down, but it is still running
2019-12-02 09:55:22.940353 osd.44 osd.44 10.0.4.224:6812/4191784 190 : cluster [WRN] Monitor daemon marked osd.44 down, but it is still running
2019-12-02 09:55:22.903140 osd.41 osd.41 10.0.4.224:6810/4191770 187 : cluster [WRN] Monitor daemon marked osd.41 down, but it is still running
2019-12-02 09:55:22.949890 osd.40 osd.40 10.0.4.224:6814/4191837 237 : cluster [WRN] Monitor daemon marked osd.40 down, but it is still running
2019-12-02 09:55:23.014003 osd.43 osd.43 10.0.4.224:6807/4191773 179 : cluster [WRN] Monitor daemon marked osd.43 down, but it is still running
2019-12-02 09:55:23.854427 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.37 10.0.4.224:6804/4191733 boot
2019-12-02 09:55:23.854601 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.40 10.0.4.224:6814/4191837 boot
2019-12-02 09:55:23.854702 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.45 10.0.4.224:6815/4191845 boot
2019-12-02 09:55:23.855050 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.39 10.0.4.224:6822/4191895 boot
2019-12-02 09:55:23.855169 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.46 10.0.4.224:6820/4191655 boot
2019-12-02 09:55:23.855316 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.38 10.0.4.224:6802/4191718 boot
2019-12-02 09:55:24.884610 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.42 10.0.4.224:6818/4191844 boot
2019-12-02 09:55:24.884982 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.44 10.0.4.224:6812/4191784 boot
2019-12-02 09:55:24.889932 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.41 10.0.4.224:6810/4191770 boot
2019-12-02 09:55:24.889996 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.36 10.0.4.224:6806/4191779 boot
2019-12-02 09:55:24.890049 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.43 10.0.4.224:6807/4191773 boot
2019-12-02 09:55:24.890142 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.47 10.0.4.224:6800/4191693 boot
At around 10:00, we had found Hyperv host node 3 crashed with blue screen. And rebooted the node.
After reboot, and OSD 24-35 joined the cluster with no issue and started backfilling.
Hyperv doesn't seem to have good tolerance against IO delay/block over iSCSI. As a result of that, the whole Hyperv cluster were down, iSCSI disks disappeared and VMs on unexpected shutdown.
Can you please help to answer the following questions? Please let me know if you need more logs.
1. Would you know why slow requests were blocked? Especially some of the implicated OSDs didn't seem to have problems.
2. How to minimize slow requests being blocked?
3. Should the block for slow requests stop when the non-responsive OSDs were marked as down?
4. Would you know why OSD 36-47 were marked down for 2 seconds?
We have 5 PetaSAN OSD vms running on 5 Hyperv hosts. The OSD vms are not in the Hyperv failover cluster. Each Hyperv host has 12 spinning disks pass-through to the PetaSAN OSD vm.
On 02/12/19 at around 09:40, Hyperv host node 3 was frozen. And the 12 pass-through spinning disks stopped resonding to the OSD vm's Ceph OSD processes (OSD 24-35).
In the cluster logs, slow requests were blocked. But the implicated OSD are not limit to 24-35.
2019-12-02 09:42:37.216878 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 17500 : cluster [WRN] Health check failed: 8 slow requests are blocked > 32 sec. Implicated osds 2,9 (REQUEST_SLOW)
2019-12-02 09:42:42.623838 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 17501 : cluster [WRN] Health check update: 28 slow requests are blocked > 32 sec. Implicated osds 2,4,9,16,31,37,51 (REQU
EST_SLOW)
2019-12-02 09:42:47.624156 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 17502 : cluster [WRN] Health check update: 82 slow requests are blocked > 32 sec. Implicated osds 1,2,4,7,9,11,16,22,31,3
7,43,49,50,51,52,53,56 (REQUEST_SLOW)
2019-12-02 09:42:46.553401 osd.9 osd.9 10.0.4.221:6806/2644 328 : cluster [WRN] 3 slow requests, 3 included below; oldest blocked for > 30.891256 secs
2019-12-02 09:42:46.553410 osd.9 osd.9 10.0.4.221:6806/2644 329 : cluster [WRN] slow request 30.891256 seconds old, received at 2019-12-02 09:42:15.662057: osd_op(client.10369681.0:776104831
3.77 3:ee191869:::rbd_data.9ca786b8b4567.00000000000b084a:head [set-alloc-hint object_size 4194304 write_size 4194304,write 1720320~4096] snapc 0=[] ondisk+write+known_if_redirected e9499)
currently sub_op_commit_rec from 38
2019-12-02 09:42:46.553414 osd.9 osd.9 10.0.4.221:6806/2644 330 : cluster [WRN] slow request 30.891154 seconds old, received at 2019-12-02 09:42:15.662158: osd_op(client.10369681.0:776104832
3.77 3:ee191869:::rbd_data.9ca786b8b4567.00000000000b084a:head [set-alloc-hint object_size 4194304 write_size 4194304,write 1748992~4096] snapc 0=[] ondisk+write+known_if_redirected e9499)
currently op_applied
2019-12-02 09:42:46.553419 osd.9 osd.9 10.0.4.221:6806/2644 331 : cluster [WRN] slow request 30.891098 seconds old, received at 2019-12-02 09:42:15.662214: osd_op(client.10369681.0:776104833
3.77 3:ee191869:::rbd_data.9ca786b8b4567.00000000000b084a:head [set-alloc-hint object_size 4194304 write_size 4194304,write 2523136~4096] snapc 0=[] ondisk+write+known_if_redirected e9499)
currently op_applied
At 09:43, OSD 24-35 were marked down.
2019-12-02 09:43:12.625916 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9499 we have enough reporters to mark osd.31 down
2019-12-02 09:43:25.358119 7f21dcb44700 1 mon.cmps-ps1-mgmt@0(leader).osd e9501 we have enough reporters to mark osd.33 down
2019-12-02 09:43:27.783766 7f21dcb44700 1 mon.cmps-ps1-mgmt@0(leader).osd e9503 we have enough reporters to mark osd.32 down
2019-12-02 09:43:28.258796 7f21dcb44700 1 mon.cmps-ps1-mgmt@0(leader).osd e9504 we have enough reporters to mark osd.35 down
2019-12-02 09:43:32.632478 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9506 we have enough reporters to mark osd.30 down
2019-12-02 09:43:42.638068 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9508 we have enough reporters to mark osd.34 down
2019-12-02 09:43:47.643609 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9510 we have enough reporters to mark osd.24 down
2019-12-02 09:43:47.643793 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9510 we have enough reporters to mark osd.25 down
2019-12-02 09:43:47.643949 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9510 we have enough reporters to mark osd.28 down
2019-12-02 09:43:47.644122 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9510 we have enough reporters to mark osd.29 down
2019-12-02 09:44:02.652213 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9512 we have enough reporters to mark osd.27 down
2019-12-02 09:44:24.566864 7f21dcb44700 1 mon.cmps-ps1-mgmt@0(leader).osd e9514 we have enough reporters to mark osd.26 down
And at 09:53, OSD 24-35 were marked out.
2019-12-02 09:53:17.724359 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18053 : cluster [INF] Marking osd.31 out (has been down for 605 seconds)
2019-12-02 09:53:27.733868 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18071 : cluster [INF] Marking osd.33 out (has been down for 601 seconds)
2019-12-02 09:53:32.740410 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18078 : cluster [INF] Marking osd.32 out (has been down for 604 seconds)
2019-12-02 09:53:32.740455 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18079 : cluster [INF] Marking osd.35 out (has been down for 603 seconds)
2019-12-02 09:53:37.743947 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18088 : cluster [INF] Marking osd.30 out (has been down for 605 seconds)
2019-12-02 09:53:47.748388 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18100 : cluster [INF] Marking osd.34 out (has been down for 605 seconds)
2019-12-02 09:53:52.752446 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18109 : cluster [INF] Marking osd.24 out (has been down for 605 seconds)
2019-12-02 09:53:52.752468 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18110 : cluster [INF] Marking osd.25 out (has been down for 605 seconds)
2019-12-02 09:53:52.752486 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18111 : cluster [INF] Marking osd.28 out (has been down for 605 seconds)
2019-12-02 09:53:52.752505 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18112 : cluster [INF] Marking osd.29 out (has been down for 605 seconds)
2019-12-02 09:54:07.759680 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18129 : cluster [INF] Marking osd.27 out (has been down for 605 seconds)
2019-12-02 09:54:27.764938 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18144 : cluster [INF] Marking osd.26 out (has been down for 602 seconds)
The strange thing was OSD 36-47 were marked down and recovered within 2 seconds between 09:55:22 and 09:55:24. These OSDs are on Hyperv host node 4.
2019-12-02 09:55:22.912324 osd.37 osd.37 10.0.4.224:6804/4191733 189 : cluster [WRN] Monitor daemon marked osd.37 down, but it is still running
2019-12-02 09:55:22.991172 osd.47 osd.47 10.0.4.224:6800/4191693 801 : cluster [WRN] Monitor daemon marked osd.47 down, but it is still running
2019-12-02 09:55:22.878230 osd.36 osd.36 10.0.4.224:6806/4191779 162 : cluster [WRN] Monitor daemon marked osd.36 down, but it is still running
2019-12-02 09:55:22.924994 osd.46 osd.46 10.0.4.224:6820/4191655 174 : cluster [WRN] Monitor daemon marked osd.46 down, but it is still running
2019-12-02 09:55:22.976335 osd.42 osd.42 10.0.4.224:6818/4191844 153 : cluster [WRN] Monitor daemon marked osd.42 down, but it is still running
2019-12-02 09:55:22.919094 osd.39 osd.39 10.0.4.224:6822/4191895 206 : cluster [WRN] Monitor daemon marked osd.39 down, but it is still running
2019-12-02 09:55:22.891007 osd.45 osd.45 10.0.4.224:6815/4191845 210 : cluster [WRN] Monitor daemon marked osd.45 down, but it is still running
2019-12-02 09:55:22.886398 osd.38 osd.38 10.0.4.224:6802/4191718 157 : cluster [WRN] Monitor daemon marked osd.38 down, but it is still running
2019-12-02 09:55:22.940353 osd.44 osd.44 10.0.4.224:6812/4191784 190 : cluster [WRN] Monitor daemon marked osd.44 down, but it is still running
2019-12-02 09:55:22.903140 osd.41 osd.41 10.0.4.224:6810/4191770 187 : cluster [WRN] Monitor daemon marked osd.41 down, but it is still running
2019-12-02 09:55:22.949890 osd.40 osd.40 10.0.4.224:6814/4191837 237 : cluster [WRN] Monitor daemon marked osd.40 down, but it is still running
2019-12-02 09:55:23.014003 osd.43 osd.43 10.0.4.224:6807/4191773 179 : cluster [WRN] Monitor daemon marked osd.43 down, but it is still running
2019-12-02 09:55:23.854427 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.37 10.0.4.224:6804/4191733 boot
2019-12-02 09:55:23.854601 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.40 10.0.4.224:6814/4191837 boot
2019-12-02 09:55:23.854702 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.45 10.0.4.224:6815/4191845 boot
2019-12-02 09:55:23.855050 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.39 10.0.4.224:6822/4191895 boot
2019-12-02 09:55:23.855169 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.46 10.0.4.224:6820/4191655 boot
2019-12-02 09:55:23.855316 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.38 10.0.4.224:6802/4191718 boot
2019-12-02 09:55:24.884610 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.42 10.0.4.224:6818/4191844 boot
2019-12-02 09:55:24.884982 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.44 10.0.4.224:6812/4191784 boot
2019-12-02 09:55:24.889932 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.41 10.0.4.224:6810/4191770 boot
2019-12-02 09:55:24.889996 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.36 10.0.4.224:6806/4191779 boot
2019-12-02 09:55:24.890049 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.43 10.0.4.224:6807/4191773 boot
2019-12-02 09:55:24.890142 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.47 10.0.4.224:6800/4191693 boot
At around 10:00, we had found Hyperv host node 3 crashed with blue screen. And rebooted the node.
After reboot, and OSD 24-35 joined the cluster with no issue and started backfilling.
Hyperv doesn't seem to have good tolerance against IO delay/block over iSCSI. As a result of that, the whole Hyperv cluster were down, iSCSI disks disappeared and VMs on unexpected shutdown.
Can you please help to answer the following questions? Please let me know if you need more logs.
1. Would you know why slow requests were blocked? Especially some of the implicated OSDs didn't seem to have problems.
2. How to minimize slow requests being blocked?
3. Should the block for slow requests stop when the non-responsive OSDs were marked as down?
4. Would you know why OSD 36-47 were marked down for 2 seconds?
admin
2,930 Posts
December 9, 2019, 8:04 amQuote from admin on December 9, 2019, 8:04 amWe do not support running PetaSAN as vms.
Things to look into:
What is the %busy/utilization of the spinning disks ?
Do you have enough ram : 4GB per OSD ? is the %utilization good ?
Are there any io queue depth limits imposed on the vms ?
Generally if the system is resource limited, this could happen, heartbeats of OSDs may not be acknowledged in time and can show up/down flapping.
Good luck..
We do not support running PetaSAN as vms.
Things to look into:
What is the %busy/utilization of the spinning disks ?
Do you have enough ram : 4GB per OSD ? is the %utilization good ?
Are there any io queue depth limits imposed on the vms ?
Generally if the system is resource limited, this could happen, heartbeats of OSDs may not be acknowledged in time and can show up/down flapping.
Good luck..
Last edited on December 9, 2019, 8:05 am by admin · #2
Disks unresponsive, slow requests are blocked
roy.lei
2 Posts
Quote from roy.lei on December 9, 2019, 3:21 amWe have 5 PetaSAN OSD vms running on 5 Hyperv hosts. The OSD vms are not in the Hyperv failover cluster. Each Hyperv host has 12 spinning disks pass-through to the PetaSAN OSD vm.
On 02/12/19 at around 09:40, Hyperv host node 3 was frozen. And the 12 pass-through spinning disks stopped resonding to the OSD vm's Ceph OSD processes (OSD 24-35).
In the cluster logs, slow requests were blocked. But the implicated OSD are not limit to 24-35.
2019-12-02 09:42:37.216878 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 17500 : cluster [WRN] Health check failed: 8 slow requests are blocked > 32 sec. Implicated osds 2,9 (REQUEST_SLOW)
2019-12-02 09:42:42.623838 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 17501 : cluster [WRN] Health check update: 28 slow requests are blocked > 32 sec. Implicated osds 2,4,9,16,31,37,51 (REQU
EST_SLOW)
2019-12-02 09:42:47.624156 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 17502 : cluster [WRN] Health check update: 82 slow requests are blocked > 32 sec. Implicated osds 1,2,4,7,9,11,16,22,31,3
7,43,49,50,51,52,53,56 (REQUEST_SLOW)
2019-12-02 09:42:46.553401 osd.9 osd.9 10.0.4.221:6806/2644 328 : cluster [WRN] 3 slow requests, 3 included below; oldest blocked for > 30.891256 secs
2019-12-02 09:42:46.553410 osd.9 osd.9 10.0.4.221:6806/2644 329 : cluster [WRN] slow request 30.891256 seconds old, received at 2019-12-02 09:42:15.662057: osd_op(client.10369681.0:776104831
3.77 3:ee191869:::rbd_data.9ca786b8b4567.00000000000b084a:head [set-alloc-hint object_size 4194304 write_size 4194304,write 1720320~4096] snapc 0=[] ondisk+write+known_if_redirected e9499)
currently sub_op_commit_rec from 38
2019-12-02 09:42:46.553414 osd.9 osd.9 10.0.4.221:6806/2644 330 : cluster [WRN] slow request 30.891154 seconds old, received at 2019-12-02 09:42:15.662158: osd_op(client.10369681.0:776104832
3.77 3:ee191869:::rbd_data.9ca786b8b4567.00000000000b084a:head [set-alloc-hint object_size 4194304 write_size 4194304,write 1748992~4096] snapc 0=[] ondisk+write+known_if_redirected e9499)
currently op_applied
2019-12-02 09:42:46.553419 osd.9 osd.9 10.0.4.221:6806/2644 331 : cluster [WRN] slow request 30.891098 seconds old, received at 2019-12-02 09:42:15.662214: osd_op(client.10369681.0:776104833
3.77 3:ee191869:::rbd_data.9ca786b8b4567.00000000000b084a:head [set-alloc-hint object_size 4194304 write_size 4194304,write 2523136~4096] snapc 0=[] ondisk+write+known_if_redirected e9499)
currently op_appliedAt 09:43, OSD 24-35 were marked down.
2019-12-02 09:43:12.625916 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9499 we have enough reporters to mark osd.31 down
2019-12-02 09:43:25.358119 7f21dcb44700 1 mon.cmps-ps1-mgmt@0(leader).osd e9501 we have enough reporters to mark osd.33 down
2019-12-02 09:43:27.783766 7f21dcb44700 1 mon.cmps-ps1-mgmt@0(leader).osd e9503 we have enough reporters to mark osd.32 down
2019-12-02 09:43:28.258796 7f21dcb44700 1 mon.cmps-ps1-mgmt@0(leader).osd e9504 we have enough reporters to mark osd.35 down
2019-12-02 09:43:32.632478 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9506 we have enough reporters to mark osd.30 down
2019-12-02 09:43:42.638068 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9508 we have enough reporters to mark osd.34 down
2019-12-02 09:43:47.643609 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9510 we have enough reporters to mark osd.24 down
2019-12-02 09:43:47.643793 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9510 we have enough reporters to mark osd.25 down
2019-12-02 09:43:47.643949 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9510 we have enough reporters to mark osd.28 down
2019-12-02 09:43:47.644122 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9510 we have enough reporters to mark osd.29 down
2019-12-02 09:44:02.652213 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9512 we have enough reporters to mark osd.27 down
2019-12-02 09:44:24.566864 7f21dcb44700 1 mon.cmps-ps1-mgmt@0(leader).osd e9514 we have enough reporters to mark osd.26 downAnd at 09:53, OSD 24-35 were marked out.
2019-12-02 09:53:17.724359 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18053 : cluster [INF] Marking osd.31 out (has been down for 605 seconds)
2019-12-02 09:53:27.733868 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18071 : cluster [INF] Marking osd.33 out (has been down for 601 seconds)
2019-12-02 09:53:32.740410 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18078 : cluster [INF] Marking osd.32 out (has been down for 604 seconds)
2019-12-02 09:53:32.740455 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18079 : cluster [INF] Marking osd.35 out (has been down for 603 seconds)
2019-12-02 09:53:37.743947 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18088 : cluster [INF] Marking osd.30 out (has been down for 605 seconds)
2019-12-02 09:53:47.748388 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18100 : cluster [INF] Marking osd.34 out (has been down for 605 seconds)
2019-12-02 09:53:52.752446 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18109 : cluster [INF] Marking osd.24 out (has been down for 605 seconds)
2019-12-02 09:53:52.752468 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18110 : cluster [INF] Marking osd.25 out (has been down for 605 seconds)
2019-12-02 09:53:52.752486 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18111 : cluster [INF] Marking osd.28 out (has been down for 605 seconds)
2019-12-02 09:53:52.752505 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18112 : cluster [INF] Marking osd.29 out (has been down for 605 seconds)
2019-12-02 09:54:07.759680 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18129 : cluster [INF] Marking osd.27 out (has been down for 605 seconds)
2019-12-02 09:54:27.764938 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18144 : cluster [INF] Marking osd.26 out (has been down for 602 seconds)The strange thing was OSD 36-47 were marked down and recovered within 2 seconds between 09:55:22 and 09:55:24. These OSDs are on Hyperv host node 4.
2019-12-02 09:55:22.912324 osd.37 osd.37 10.0.4.224:6804/4191733 189 : cluster [WRN] Monitor daemon marked osd.37 down, but it is still running
2019-12-02 09:55:22.991172 osd.47 osd.47 10.0.4.224:6800/4191693 801 : cluster [WRN] Monitor daemon marked osd.47 down, but it is still running
2019-12-02 09:55:22.878230 osd.36 osd.36 10.0.4.224:6806/4191779 162 : cluster [WRN] Monitor daemon marked osd.36 down, but it is still running
2019-12-02 09:55:22.924994 osd.46 osd.46 10.0.4.224:6820/4191655 174 : cluster [WRN] Monitor daemon marked osd.46 down, but it is still running
2019-12-02 09:55:22.976335 osd.42 osd.42 10.0.4.224:6818/4191844 153 : cluster [WRN] Monitor daemon marked osd.42 down, but it is still running
2019-12-02 09:55:22.919094 osd.39 osd.39 10.0.4.224:6822/4191895 206 : cluster [WRN] Monitor daemon marked osd.39 down, but it is still running
2019-12-02 09:55:22.891007 osd.45 osd.45 10.0.4.224:6815/4191845 210 : cluster [WRN] Monitor daemon marked osd.45 down, but it is still running
2019-12-02 09:55:22.886398 osd.38 osd.38 10.0.4.224:6802/4191718 157 : cluster [WRN] Monitor daemon marked osd.38 down, but it is still running
2019-12-02 09:55:22.940353 osd.44 osd.44 10.0.4.224:6812/4191784 190 : cluster [WRN] Monitor daemon marked osd.44 down, but it is still running
2019-12-02 09:55:22.903140 osd.41 osd.41 10.0.4.224:6810/4191770 187 : cluster [WRN] Monitor daemon marked osd.41 down, but it is still running
2019-12-02 09:55:22.949890 osd.40 osd.40 10.0.4.224:6814/4191837 237 : cluster [WRN] Monitor daemon marked osd.40 down, but it is still running
2019-12-02 09:55:23.014003 osd.43 osd.43 10.0.4.224:6807/4191773 179 : cluster [WRN] Monitor daemon marked osd.43 down, but it is still running2019-12-02 09:55:23.854427 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.37 10.0.4.224:6804/4191733 boot
2019-12-02 09:55:23.854601 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.40 10.0.4.224:6814/4191837 boot
2019-12-02 09:55:23.854702 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.45 10.0.4.224:6815/4191845 boot
2019-12-02 09:55:23.855050 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.39 10.0.4.224:6822/4191895 boot
2019-12-02 09:55:23.855169 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.46 10.0.4.224:6820/4191655 boot
2019-12-02 09:55:23.855316 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.38 10.0.4.224:6802/4191718 boot
2019-12-02 09:55:24.884610 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.42 10.0.4.224:6818/4191844 boot
2019-12-02 09:55:24.884982 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.44 10.0.4.224:6812/4191784 boot
2019-12-02 09:55:24.889932 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.41 10.0.4.224:6810/4191770 boot
2019-12-02 09:55:24.889996 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.36 10.0.4.224:6806/4191779 boot
2019-12-02 09:55:24.890049 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.43 10.0.4.224:6807/4191773 boot
2019-12-02 09:55:24.890142 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.47 10.0.4.224:6800/4191693 bootAt around 10:00, we had found Hyperv host node 3 crashed with blue screen. And rebooted the node.
After reboot, and OSD 24-35 joined the cluster with no issue and started backfilling.
Hyperv doesn't seem to have good tolerance against IO delay/block over iSCSI. As a result of that, the whole Hyperv cluster were down, iSCSI disks disappeared and VMs on unexpected shutdown.
Can you please help to answer the following questions? Please let me know if you need more logs.
1. Would you know why slow requests were blocked? Especially some of the implicated OSDs didn't seem to have problems.
2. How to minimize slow requests being blocked?
3. Should the block for slow requests stop when the non-responsive OSDs were marked as down?
4. Would you know why OSD 36-47 were marked down for 2 seconds?
We have 5 PetaSAN OSD vms running on 5 Hyperv hosts. The OSD vms are not in the Hyperv failover cluster. Each Hyperv host has 12 spinning disks pass-through to the PetaSAN OSD vm.
On 02/12/19 at around 09:40, Hyperv host node 3 was frozen. And the 12 pass-through spinning disks stopped resonding to the OSD vm's Ceph OSD processes (OSD 24-35).
In the cluster logs, slow requests were blocked. But the implicated OSD are not limit to 24-35.
2019-12-02 09:42:37.216878 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 17500 : cluster [WRN] Health check failed: 8 slow requests are blocked > 32 sec. Implicated osds 2,9 (REQUEST_SLOW)
2019-12-02 09:42:42.623838 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 17501 : cluster [WRN] Health check update: 28 slow requests are blocked > 32 sec. Implicated osds 2,4,9,16,31,37,51 (REQU
EST_SLOW)
2019-12-02 09:42:47.624156 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 17502 : cluster [WRN] Health check update: 82 slow requests are blocked > 32 sec. Implicated osds 1,2,4,7,9,11,16,22,31,3
7,43,49,50,51,52,53,56 (REQUEST_SLOW)
2019-12-02 09:42:46.553401 osd.9 osd.9 10.0.4.221:6806/2644 328 : cluster [WRN] 3 slow requests, 3 included below; oldest blocked for > 30.891256 secs
2019-12-02 09:42:46.553410 osd.9 osd.9 10.0.4.221:6806/2644 329 : cluster [WRN] slow request 30.891256 seconds old, received at 2019-12-02 09:42:15.662057: osd_op(client.10369681.0:776104831
3.77 3:ee191869:::rbd_data.9ca786b8b4567.00000000000b084a:head [set-alloc-hint object_size 4194304 write_size 4194304,write 1720320~4096] snapc 0=[] ondisk+write+known_if_redirected e9499)
currently sub_op_commit_rec from 38
2019-12-02 09:42:46.553414 osd.9 osd.9 10.0.4.221:6806/2644 330 : cluster [WRN] slow request 30.891154 seconds old, received at 2019-12-02 09:42:15.662158: osd_op(client.10369681.0:776104832
3.77 3:ee191869:::rbd_data.9ca786b8b4567.00000000000b084a:head [set-alloc-hint object_size 4194304 write_size 4194304,write 1748992~4096] snapc 0=[] ondisk+write+known_if_redirected e9499)
currently op_applied
2019-12-02 09:42:46.553419 osd.9 osd.9 10.0.4.221:6806/2644 331 : cluster [WRN] slow request 30.891098 seconds old, received at 2019-12-02 09:42:15.662214: osd_op(client.10369681.0:776104833
3.77 3:ee191869:::rbd_data.9ca786b8b4567.00000000000b084a:head [set-alloc-hint object_size 4194304 write_size 4194304,write 2523136~4096] snapc 0=[] ondisk+write+known_if_redirected e9499)
currently op_applied
At 09:43, OSD 24-35 were marked down.
2019-12-02 09:43:12.625916 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9499 we have enough reporters to mark osd.31 down
2019-12-02 09:43:25.358119 7f21dcb44700 1 mon.cmps-ps1-mgmt@0(leader).osd e9501 we have enough reporters to mark osd.33 down
2019-12-02 09:43:27.783766 7f21dcb44700 1 mon.cmps-ps1-mgmt@0(leader).osd e9503 we have enough reporters to mark osd.32 down
2019-12-02 09:43:28.258796 7f21dcb44700 1 mon.cmps-ps1-mgmt@0(leader).osd e9504 we have enough reporters to mark osd.35 down
2019-12-02 09:43:32.632478 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9506 we have enough reporters to mark osd.30 down
2019-12-02 09:43:42.638068 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9508 we have enough reporters to mark osd.34 down
2019-12-02 09:43:47.643609 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9510 we have enough reporters to mark osd.24 down
2019-12-02 09:43:47.643793 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9510 we have enough reporters to mark osd.25 down
2019-12-02 09:43:47.643949 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9510 we have enough reporters to mark osd.28 down
2019-12-02 09:43:47.644122 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9510 we have enough reporters to mark osd.29 down
2019-12-02 09:44:02.652213 7f21df349700 1 mon.cmps-ps1-mgmt@0(leader).osd e9512 we have enough reporters to mark osd.27 down
2019-12-02 09:44:24.566864 7f21dcb44700 1 mon.cmps-ps1-mgmt@0(leader).osd e9514 we have enough reporters to mark osd.26 down
And at 09:53, OSD 24-35 were marked out.
2019-12-02 09:53:17.724359 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18053 : cluster [INF] Marking osd.31 out (has been down for 605 seconds)
2019-12-02 09:53:27.733868 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18071 : cluster [INF] Marking osd.33 out (has been down for 601 seconds)
2019-12-02 09:53:32.740410 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18078 : cluster [INF] Marking osd.32 out (has been down for 604 seconds)
2019-12-02 09:53:32.740455 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18079 : cluster [INF] Marking osd.35 out (has been down for 603 seconds)
2019-12-02 09:53:37.743947 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18088 : cluster [INF] Marking osd.30 out (has been down for 605 seconds)
2019-12-02 09:53:47.748388 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18100 : cluster [INF] Marking osd.34 out (has been down for 605 seconds)
2019-12-02 09:53:52.752446 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18109 : cluster [INF] Marking osd.24 out (has been down for 605 seconds)
2019-12-02 09:53:52.752468 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18110 : cluster [INF] Marking osd.25 out (has been down for 605 seconds)
2019-12-02 09:53:52.752486 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18111 : cluster [INF] Marking osd.28 out (has been down for 605 seconds)
2019-12-02 09:53:52.752505 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18112 : cluster [INF] Marking osd.29 out (has been down for 605 seconds)
2019-12-02 09:54:07.759680 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18129 : cluster [INF] Marking osd.27 out (has been down for 605 seconds)
2019-12-02 09:54:27.764938 mon.cmps-ps1-mgmt mon.0 10.0.4.211:6789/0 18144 : cluster [INF] Marking osd.26 out (has been down for 602 seconds)
The strange thing was OSD 36-47 were marked down and recovered within 2 seconds between 09:55:22 and 09:55:24. These OSDs are on Hyperv host node 4.
2019-12-02 09:55:22.912324 osd.37 osd.37 10.0.4.224:6804/4191733 189 : cluster [WRN] Monitor daemon marked osd.37 down, but it is still running
2019-12-02 09:55:22.991172 osd.47 osd.47 10.0.4.224:6800/4191693 801 : cluster [WRN] Monitor daemon marked osd.47 down, but it is still running
2019-12-02 09:55:22.878230 osd.36 osd.36 10.0.4.224:6806/4191779 162 : cluster [WRN] Monitor daemon marked osd.36 down, but it is still running
2019-12-02 09:55:22.924994 osd.46 osd.46 10.0.4.224:6820/4191655 174 : cluster [WRN] Monitor daemon marked osd.46 down, but it is still running
2019-12-02 09:55:22.976335 osd.42 osd.42 10.0.4.224:6818/4191844 153 : cluster [WRN] Monitor daemon marked osd.42 down, but it is still running
2019-12-02 09:55:22.919094 osd.39 osd.39 10.0.4.224:6822/4191895 206 : cluster [WRN] Monitor daemon marked osd.39 down, but it is still running
2019-12-02 09:55:22.891007 osd.45 osd.45 10.0.4.224:6815/4191845 210 : cluster [WRN] Monitor daemon marked osd.45 down, but it is still running
2019-12-02 09:55:22.886398 osd.38 osd.38 10.0.4.224:6802/4191718 157 : cluster [WRN] Monitor daemon marked osd.38 down, but it is still running
2019-12-02 09:55:22.940353 osd.44 osd.44 10.0.4.224:6812/4191784 190 : cluster [WRN] Monitor daemon marked osd.44 down, but it is still running
2019-12-02 09:55:22.903140 osd.41 osd.41 10.0.4.224:6810/4191770 187 : cluster [WRN] Monitor daemon marked osd.41 down, but it is still running
2019-12-02 09:55:22.949890 osd.40 osd.40 10.0.4.224:6814/4191837 237 : cluster [WRN] Monitor daemon marked osd.40 down, but it is still running
2019-12-02 09:55:23.014003 osd.43 osd.43 10.0.4.224:6807/4191773 179 : cluster [WRN] Monitor daemon marked osd.43 down, but it is still running
2019-12-02 09:55:23.854427 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.37 10.0.4.224:6804/4191733 boot
2019-12-02 09:55:23.854601 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.40 10.0.4.224:6814/4191837 boot
2019-12-02 09:55:23.854702 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.45 10.0.4.224:6815/4191845 boot
2019-12-02 09:55:23.855050 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.39 10.0.4.224:6822/4191895 boot
2019-12-02 09:55:23.855169 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.46 10.0.4.224:6820/4191655 boot
2019-12-02 09:55:23.855316 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.38 10.0.4.224:6802/4191718 boot
2019-12-02 09:55:24.884610 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.42 10.0.4.224:6818/4191844 boot
2019-12-02 09:55:24.884982 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.44 10.0.4.224:6812/4191784 boot
2019-12-02 09:55:24.889932 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.41 10.0.4.224:6810/4191770 boot
2019-12-02 09:55:24.889996 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.36 10.0.4.224:6806/4191779 boot
2019-12-02 09:55:24.890049 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.43 10.0.4.224:6807/4191773 boot
2019-12-02 09:55:24.890142 7f21d8b3c700 0 log_channel(cluster) log [INF] : osd.47 10.0.4.224:6800/4191693 boot
At around 10:00, we had found Hyperv host node 3 crashed with blue screen. And rebooted the node.
After reboot, and OSD 24-35 joined the cluster with no issue and started backfilling.
Hyperv doesn't seem to have good tolerance against IO delay/block over iSCSI. As a result of that, the whole Hyperv cluster were down, iSCSI disks disappeared and VMs on unexpected shutdown.
Can you please help to answer the following questions? Please let me know if you need more logs.
1. Would you know why slow requests were blocked? Especially some of the implicated OSDs didn't seem to have problems.
2. How to minimize slow requests being blocked?
3. Should the block for slow requests stop when the non-responsive OSDs were marked as down?
4. Would you know why OSD 36-47 were marked down for 2 seconds?
admin
2,930 Posts
Quote from admin on December 9, 2019, 8:04 amWe do not support running PetaSAN as vms.
Things to look into:
What is the %busy/utilization of the spinning disks ?
Do you have enough ram : 4GB per OSD ? is the %utilization good ?
Are there any io queue depth limits imposed on the vms ?
Generally if the system is resource limited, this could happen, heartbeats of OSDs may not be acknowledged in time and can show up/down flapping.
Good luck..
We do not support running PetaSAN as vms.
Things to look into:
What is the %busy/utilization of the spinning disks ?
Do you have enough ram : 4GB per OSD ? is the %utilization good ?
Are there any io queue depth limits imposed on the vms ?
Generally if the system is resource limited, this could happen, heartbeats of OSDs may not be acknowledged in time and can show up/down flapping.
Good luck..