Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

PetaSAN 3.3.0 Released!

Pages: 1 2

Happy to announce our newest release version 3.3.0!

New Features:
  • Centrally managed performance tuning profiles.
  • OSD Encryption using Linux dm-crypt.
  • Custom OSD Classes support from UI.
  • Creation of iSCSI disks with fast-diff for faster replication performance.
  • Deploy Wizard using https. (same port 5001)
  • Single main install partition.
  • General bug fixes.
  • Ceph 17.2.7 (latest Quincy), Kernel based on SUSE SLE15 SP5.

For online upgrade,see latest Online Upgrade Guide

Great !
Thank you... starting download/install/upgrade 🙂

 

Hello admin,

Thanks for the new release, we are planning the upgrade in the coming weeks. Before we upgrade, we have a question on the fast-diff:

What will be the impact on existing iSCSI disks related to the 'Creation of iSCSI disks with fast-diff for faster replication performance'?

Will it be enabled on existing iSCSI disks?

kr, Rbn

 

 

it can be set for new disks upon creation. it will make replication faster but client i/o slower.

great news!

We're being effected by the following bug: https://patchwork.kernel.org/project/target-devel/patch/20230319015620.96006-10-michael.christie@oracle.com/

I causes ESXI to eventually lose all paths to an iSCSI target, the only way to recover is to reboot all the petasan nodes that were serving the paths (as it's deadlocked in the kernel, moving the path does not work)

It's a gradual thing - we may lose a path every few weeks, but eventually losing all 8. It seems to occur when the nodes are under load, eg during large backup jobs or backfilling operations, which isn't ideal as if the load persists, we can lose all 8 paths overnight.

Currently we're on 3.2.1 - I'm wondering if you're able to tell if the Kernel version in 3.3.0 includes this patch? (I can't see a changelog for the SUSE kernel but I also know petasan one is slightly different too)

No we do not include this patch in 3.3. We can try to include in 4.0, but cannot commit as we are almost done with testing.

ESXi is not very tolerant with high latency, this could happen if you use HDD and your disks reach near 100% busy (see your dashboard charts).  ESXi will abort the path if it does not respond in time. Other clients like Windows and Linux are more tolerant. In this case the best solution is to tune the iSCSI parameters with less aggressive parameters like lowering queue_depth MaxOutstandingR2T

Prior to version 3.3, you would need to edit on all nodes

/opt/petasan/config/tuning/current/lio_tunings

version 3.3 you can centrally manage this from the UI via performance profiles.

Example profile:

{
"storage_objects": [
{
"attributes": {
"block_size": "512",
"emulate_3pc": "1",
"emulate_caw": "1",
"queue_depth": "16"
}
}
],

"targets": [
{
"tpgs": [
{
"attributes": {
"default_cmdsn_depth": "64"
},
"parameters": {
"DefaultTime2Retain": "20",
"DefaultTime2Wait": "2",
"FirstBurstLength": "1048576",
"ImmediateData": "Yes",
"InitialR2T": "No",
"MaxBurstLength": "1048576",
"MaxOutstandingR2T": "4",
"MaxRecvDataSegmentLength": "1048576",
"MaxXmitDataSegmentLength": "1048576"
}
}

]
}
],

"krbd": [
{
"osd_request_timeout": "40",
"queue_depth": "16"
}
]
}

Okay, thanks. I was hoping as this patch had been merged to the kernel it may have made naturally into a newer kernel version in 3.3, nonetheless, hopefully you guys can manage to include it in 4.0, from the version numbers it looks like it would be then 6.4 kernel in SUSE SLE15 SP6, but it's hard to tell.

It's all SSDs we use with NVME cache, and you're right it's only ESXI that seems effected, our Windows ISCSI connections for SQL server cluster never exhibit the issue.

Thanks for the information around the performance tuning, I'll go ahead and schedule in the upgrade to 3.3 anyway as the centrally managed tunings look like they'll make tuning easier.

Thanks again!

To apply the changed tuning param, your paths need to be moved/re-assigned or disk stopped/started.

We will review the patch, it is not clear this will directly affect the case you have.

Thanks - I've verified that this is the cause, the paths disappearing from ESXI correspond with the following messages on the petasan nodes:

root@gl-san-02a:~# dmesg -e | grep "Unable to recover"
[ +1.283547] Unable to recover from DataOut timeout while in ERL=0, closing iSCSI connection for I_T Nexus iqn.1998-01.com.vmware:gl-vmh-02b-07ffc7ce,i,0x00023d000002,iqn.2016-05.com.petasan:00020,t,0x02
[ +0.002984] Unable to recover from DataOut timeout while in ERL=0, closing iSCSI connection for I_T Nexus iqn.1998-01.com.vmware:gl-vmh-02b-07ffc7ce,i,0x00023d000002,iqn.2016-05.com.petasan:00020,t,0x02
[ +2.047011] Unable to recover from DataOut timeout while in ERL=0, closing iSCSI connection for I_T Nexus iqn.1998-01.com.vmware:gl-vmh-02b-07ffc7ce,i,0x00023d000006,iqn.2016-05.com.petasan:00028,t,0x06

Then we also see:

[Oct14 01:21] INFO: task iscsi_np:1183052 blocked for more than 983 seconds.
[ +0.001905] Tainted: G E N 5.14.21-04-petasan #1
[ +0.001862] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ +0.001876] task:iscsi_np state:D stack: 0 pid:1183052 ppid: 2 flags:0x00004004
[ +0.000008] Call Trace:
[ +0.000002] <TASK>
[ +0.000005] __schedule+0xa62/0x1270
[ +0.000011] ? pcpu_alloc_area+0x1d8/0x2e0
[ +0.000008] schedule+0x66/0xf0
[ +0.000005] schedule_timeout+0x20d/0x2a0
[ +0.000007] wait_for_completion+0x89/0xf0
[ +0.000008] iscsi_check_for_session_reinstatement+0x1e5/0x280 [iscsi_target_mod f68e86108ee7517458ee2e85cf0791e8ef328787]
[ +0.000033] iscsi_target_do_login+0x1dd/0x540 [iscsi_target_mod f68e86108ee7517458ee2e85cf0791e8ef328787]
[ +0.000022] iscsi_target_start_negotiation+0x52/0xd0 [iscsi_target_mod f68e86108ee7517458ee2e85cf0791e8ef328787]
[ +0.000022] iscsi_target_login_thread+0x840/0xe70 [iscsi_target_mod f68e86108ee7517458ee2e85cf0791e8ef328787]
[ +0.000021] ? iscsi_target_login_sess_out+0x150/0x150 [iscsi_target_mod f68e86108ee7517458ee2e85cf0791e8ef328787]
[ +0.000021] ? iscsi_target_login_sess_out+0x150/0x150 [iscsi_target_mod f68e86108ee7517458ee2e85cf0791e8ef328787]
[ +0.000020] kthread+0x158/0x190
[ +0.000007] ? set_kthread_struct+0x50/0x50
[ +0.000005] ret_from_fork+0x1f/0x30
[ +0.000007] </TASK>
[ +0.000060] INFO: task kworker/44:1:476496 blocked for more than 983 seconds.
[ +0.001896] Tainted: G E N 5.14.21-04-petasan #1
[ +0.001889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ +0.001911] task:kworker/44:1 state:D stack: 0 pid:476496 ppid: 2 flags:0x00004000
[ +0.000006] Workqueue: events target_tmr_work [target_core_mod]
[ +0.000038] Call Trace:
[ +0.000002] <TASK>
[ +0.000009] __schedule+0xa62/0x1270
[ +0.000005] ? ttwu_do_wakeup+0x17/0x180
[ +0.000006] ? try_to_wake_up+0x22a/0x590
[ +0.000006] schedule+0x66/0xf0
[ +0.000004] schedule_timeout+0x20d/0x2a0
[ +0.000005] ? sysvec_apic_timer_interrupt+0xb/0x90
[ +0.000007] ? asm_sysvec_apic_timer_interrupt+0x4d/0x60
[ +0.000005] wait_for_completion+0x89/0xf0
[ +0.000005] target_put_cmd_and_wait+0x55/0x80 [target_core_mod 3892d3b4c218b9d9e4a5e2c43d88d84ca75a22e2]
[ +0.000029] core_tmr_abort_task.cold+0x159/0x180 [target_core_mod 3892d3b4c218b9d9e4a5e2c43d88d84ca75a22e2]
[ +0.000032] target_tmr_work+0xa3/0xf0 [target_core_mod 3892d3b4c218b9d9e4a5e2c43d88d84ca75a22e2]
[ +0.000030] process_one_work+0x21a/0x3f0
[ +0.000006] worker_thread+0x4a/0x3c0
[ +0.000005] ? process_one_work+0x3f0/0x3f0
[ +0.000005] kthread+0x158/0x190
[ +0.000005] ? set_kthread_struct+0x50/0x50
[ +0.000005] ret_from_fork+0x1f/0x30
[ +0.000007] </TASK>

This in turn led me to the following discussion:

https://www.spinics.net/lists/target-devel/msg23373.html

The patch is actually part of 9 fixes to the kernel iscsi target that were submitted by Oracle.

Pages: 1 2