OSDs stay up, can't fail them
admin
2,930 Posts
April 18, 2018, 11:37 amQuote from admin on April 18, 2018, 11:37 amThanks for testing this. We need more time to look into this issue.
Just to clarify, with PetaSAN v 1.4 (kernel 4.4.38 ) and with 4.4.120-1 the crash dump did not occur ? If the osd is not detected as down but you start writing ( for example using the benchmark tab ) does this trigger the osd to be detected as down ? does the benchmark (or any io ) complete successfully ?
Thanks for testing this. We need more time to look into this issue.
Just to clarify, with PetaSAN v 1.4 (kernel 4.4.38 ) and with 4.4.120-1 the crash dump did not occur ? If the osd is not detected as down but you start writing ( for example using the benchmark tab ) does this trigger the osd to be detected as down ? does the benchmark (or any io ) complete successfully ?
protocol6v
85 Posts
April 18, 2018, 11:45 amQuote from protocol6v on April 18, 2018, 11:45 amI'm going to have to retest with a fully configured cluster on 1.4. It doesn't seem like this issue occurs if ceph is not fully deployed, so I can't confirm for sure now if the issue is present or not with 1.4, as I previously tested by just installing 1.4 on one node and trying the disk pull to see if the dmesg output was present. Will let you know on this.
The 4.4.120-1 kernel definitely did not produce the dump, and the OSD does not automatically down. If i run a benchmark, it does complete, and it does then down the OSD.
Is this expected behavior? If there is no activity, will it not detect the failed OSD?
I'm going to have to retest with a fully configured cluster on 1.4. It doesn't seem like this issue occurs if ceph is not fully deployed, so I can't confirm for sure now if the issue is present or not with 1.4, as I previously tested by just installing 1.4 on one node and trying the disk pull to see if the dmesg output was present. Will let you know on this.
The 4.4.120-1 kernel definitely did not produce the dump, and the OSD does not automatically down. If i run a benchmark, it does complete, and it does then down the OSD.
Is this expected behavior? If there is no activity, will it not detect the failed OSD?
admin
2,930 Posts
April 18, 2018, 12:13 pmQuote from admin on April 18, 2018, 12:13 pmThanks for your clarifications, actually i now see no need to re-test with v 1.4. Kernel 4.4.120-1 does seem to fix it and is the latest SUSE kernel, so we will be updating to it instead of going back. I will see why the OSD is not detected as down right away if there is no io, but this is almost harmless as any io will cause the detection and Ceph will correct itself.
If you want to implement PetaSAN now, i would recommend you perform the installation then upgrade to the 4.4.120 kernel as you did. We did test all 3 kernels we sent you so it is safe to use. I can probably get you an iso with this kernel build in if this will help with a large installation.
Thanks for your clarifications, actually i now see no need to re-test with v 1.4. Kernel 4.4.120-1 does seem to fix it and is the latest SUSE kernel, so we will be updating to it instead of going back. I will see why the OSD is not detected as down right away if there is no io, but this is almost harmless as any io will cause the detection and Ceph will correct itself.
If you want to implement PetaSAN now, i would recommend you perform the installation then upgrade to the 4.4.120 kernel as you did. We did test all 3 kernels we sent you so it is safe to use. I can probably get you an iso with this kernel build in if this will help with a large installation.
Last edited on April 18, 2018, 12:14 pm by admin · #23
protocol6v
85 Posts
April 18, 2018, 12:26 pmQuote from protocol6v on April 18, 2018, 12:26 pmOnly have four nodes at this point, so I won't waste you're time for an ISO.
What's the release schedule looking like for the next update?
I'm probably going to hammer on this for another week or two before putting any production data on it. I've still got a lot of ceph studying to do before i'm comfortable trusting ceph and myself.
Could you also please send me info for a support contract?
Thanks for all your time and help, much appreciated.
Only have four nodes at this point, so I won't waste you're time for an ISO.
What's the release schedule looking like for the next update?
I'm probably going to hammer on this for another week or two before putting any production data on it. I've still got a lot of ceph studying to do before i'm comfortable trusting ceph and myself.
Could you also please send me info for a support contract?
Thanks for all your time and help, much appreciated.
admin
2,930 Posts
April 18, 2018, 12:41 pmQuote from admin on April 18, 2018, 12:41 pmOur next release v 2.1 is end of June, it will include custom pool and crush map support. If there are any major bugs (none so far) we do release 2.0.X
I will email you our support option.
Our next release v 2.1 is end of June, it will include custom pool and crush map support. If there are any major bugs (none so far) we do release 2.0.X
I will email you our support option.
OSDs stay up, can't fail them
admin
2,930 Posts
Quote from admin on April 18, 2018, 11:37 amThanks for testing this. We need more time to look into this issue.
Just to clarify, with PetaSAN v 1.4 (kernel 4.4.38 ) and with 4.4.120-1 the crash dump did not occur ? If the osd is not detected as down but you start writing ( for example using the benchmark tab ) does this trigger the osd to be detected as down ? does the benchmark (or any io ) complete successfully ?
Thanks for testing this. We need more time to look into this issue.
Just to clarify, with PetaSAN v 1.4 (kernel 4.4.38 ) and with 4.4.120-1 the crash dump did not occur ? If the osd is not detected as down but you start writing ( for example using the benchmark tab ) does this trigger the osd to be detected as down ? does the benchmark (or any io ) complete successfully ?
protocol6v
85 Posts
Quote from protocol6v on April 18, 2018, 11:45 amI'm going to have to retest with a fully configured cluster on 1.4. It doesn't seem like this issue occurs if ceph is not fully deployed, so I can't confirm for sure now if the issue is present or not with 1.4, as I previously tested by just installing 1.4 on one node and trying the disk pull to see if the dmesg output was present. Will let you know on this.
The 4.4.120-1 kernel definitely did not produce the dump, and the OSD does not automatically down. If i run a benchmark, it does complete, and it does then down the OSD.
Is this expected behavior? If there is no activity, will it not detect the failed OSD?
I'm going to have to retest with a fully configured cluster on 1.4. It doesn't seem like this issue occurs if ceph is not fully deployed, so I can't confirm for sure now if the issue is present or not with 1.4, as I previously tested by just installing 1.4 on one node and trying the disk pull to see if the dmesg output was present. Will let you know on this.
The 4.4.120-1 kernel definitely did not produce the dump, and the OSD does not automatically down. If i run a benchmark, it does complete, and it does then down the OSD.
Is this expected behavior? If there is no activity, will it not detect the failed OSD?
admin
2,930 Posts
Quote from admin on April 18, 2018, 12:13 pmThanks for your clarifications, actually i now see no need to re-test with v 1.4. Kernel 4.4.120-1 does seem to fix it and is the latest SUSE kernel, so we will be updating to it instead of going back. I will see why the OSD is not detected as down right away if there is no io, but this is almost harmless as any io will cause the detection and Ceph will correct itself.
If you want to implement PetaSAN now, i would recommend you perform the installation then upgrade to the 4.4.120 kernel as you did. We did test all 3 kernels we sent you so it is safe to use. I can probably get you an iso with this kernel build in if this will help with a large installation.
Thanks for your clarifications, actually i now see no need to re-test with v 1.4. Kernel 4.4.120-1 does seem to fix it and is the latest SUSE kernel, so we will be updating to it instead of going back. I will see why the OSD is not detected as down right away if there is no io, but this is almost harmless as any io will cause the detection and Ceph will correct itself.
If you want to implement PetaSAN now, i would recommend you perform the installation then upgrade to the 4.4.120 kernel as you did. We did test all 3 kernels we sent you so it is safe to use. I can probably get you an iso with this kernel build in if this will help with a large installation.
protocol6v
85 Posts
Quote from protocol6v on April 18, 2018, 12:26 pmOnly have four nodes at this point, so I won't waste you're time for an ISO.
What's the release schedule looking like for the next update?
I'm probably going to hammer on this for another week or two before putting any production data on it. I've still got a lot of ceph studying to do before i'm comfortable trusting ceph and myself.
Could you also please send me info for a support contract?
Thanks for all your time and help, much appreciated.
Only have four nodes at this point, so I won't waste you're time for an ISO.
What's the release schedule looking like for the next update?
I'm probably going to hammer on this for another week or two before putting any production data on it. I've still got a lot of ceph studying to do before i'm comfortable trusting ceph and myself.
Could you also please send me info for a support contract?
Thanks for all your time and help, much appreciated.
admin
2,930 Posts
Quote from admin on April 18, 2018, 12:41 pmOur next release v 2.1 is end of June, it will include custom pool and crush map support. If there are any major bugs (none so far) we do release 2.0.X
I will email you our support option.
Our next release v 2.1 is end of June, it will include custom pool and crush map support. If there are any major bugs (none so far) we do release 2.0.X
I will email you our support option.