Forums - PetaSAN

ForumBug Reportingpetasan freeze
You need to log in to create posts and topics. Login · Register
petasan freeze

elwan
14 Posts

April 4, 2019, 11:29 am
Quote from elwan on April 4, 2019, 11:29 am
Hello i have 3 nodes in my cluster petasan and with random time one of them freeze and lot connection.I have q little bit of diagnostic and i found error below on each of them after freeze.Can you help please to solve this problem.In below error from syslogs

Node 3

Apr 3 04:25:01 NODEBKO-03 CRON[435822]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 3 04:35:01 NODEBKO-03 CRON[437623]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 3 04:42:32 NODEBKO-03 kernel: [148367.320951] tcp_sendpage() failure: 2444

Node 01

Apr 4 05:05:01 NODEBKO-01 CRON[905200]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 05:15:01 NODEBKO-01 CRON[909704]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 05:17:01 NODEBKO-01 CRON[910608]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 4 05:25:01 NODEBKO-01 CRON[914114]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)

Node 02

Apr 4 10:15:01 NODEBKO-02 CRON[703426]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 10:17:01 NODEBKO-02 CRON[703816]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 4 10:25:01 NODEBKO-02 CRON[705347]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 10:25:11 NODEBKO-02 smartd[572]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 80 to 81
Apr 4 10:25:11 NODEBKO-02 smartd[572]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 123 to 124
Apr 4 10:25:26 NODEBKO-02 ntpd[1117]: frequency file /var/lib/ntp/ntp.drift.TEMP: Permission denied
Apr 4 10:25:26 NODEBKO-02 ntpd[1117]: 4 Apr 10:25:26 ntpd[1117]: frequency file /var/lib/ntp/ntp.drift.TEMP: Permission denied

Hello i have 3 nodes in my cluster petasan and with random time one of them freeze and lot connection.I have q little bit of diagnostic and i found error below on each of them after freeze.Can you help please to solve this problem.In below error from syslogs

Node 3

Apr 3 04:25:01 NODEBKO-03 CRON[435822]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 3 04:35:01 NODEBKO-03 CRON[437623]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 3 04:42:32 NODEBKO-03 kernel: [148367.320951] tcp_sendpage() failure: 2444

Node 01

Apr 4 05:05:01 NODEBKO-01 CRON[905200]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 05:15:01 NODEBKO-01 CRON[909704]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 05:17:01 NODEBKO-01 CRON[910608]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 4 05:25:01 NODEBKO-01 CRON[914114]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)

Node 02

Apr 4 10:15:01 NODEBKO-02 CRON[703426]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 10:17:01 NODEBKO-02 CRON[703816]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 4 10:25:01 NODEBKO-02 CRON[705347]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 10:25:11 NODEBKO-02 smartd[572]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 80 to 81
Apr 4 10:25:11 NODEBKO-02 smartd[572]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 123 to 124
Apr 4 10:25:26 NODEBKO-02 ntpd[1117]: frequency file /var/lib/ntp/ntp.drift.TEMP: Permission denied
Apr 4 10:25:26 NODEBKO-02 ntpd[1117]: 4 Apr 10:25:26 ntpd[1117]: frequency file /var/lib/ntp/ntp.drift.TEMP: Permission denied

#1

admin
2,930 Posts

April 4, 2019, 12:12 pm
Quote from admin on April 4, 2019, 12:12 pm
I would check the hardware / network for problems.

It could also be due to hardware being loaded / under powered, make sure you have enough ram/cpu/disk resources, look at the resource utilization charts if they go above 90%

I would check the hardware / network for problems.

It could also be due to hardware being loaded / under powered, make sure you have enough ram/cpu/disk resources, look at the resource utilization charts if they go above 90%

#2

elwan
14 Posts

April 4, 2019, 12:35 pm
Quote from elwan on April 4, 2019, 12:35 pm
Network card : 4x Intel 1Gbps

Motherboard : X11SSL-F Micro ATX

processor : Xeon E3-1230V5 - 3,4 Ghz - 4 cores

Memory : RAM 10 Go DDR3 UDIMM ECC - 1600Mhz (2x4Go,2Go)

RAID card : ARECA ARC-1882IX-16 - 16 ports SAS2 PCIe

on each node we have raid 5 for 16 disk of 2Tb.The resource utilization was very low for each node.

in my zabbix solution i got this error from all node :"    Lack of free swap space on PETASAN3"

Quote from admin on April 4, 2019, 12:12 pm

I would check the hardware / network for problems.

It could also be due to hardware being loaded / under powered, make sure you have enough ram/cpu/disk resources, look at the resource utilization charts if they go above 90%

Network card : 4x Intel 1Gbps

Motherboard : X11SSL-F Micro ATX

processor : Xeon E3-1230V5 - 3,4 Ghz - 4 cores

Memory : RAM 10 Go DDR3 UDIMM ECC - 1600Mhz (2x4Go,2Go)

RAID card : ARECA ARC-1882IX-16 - 16 ports SAS2 PCIe

on each node we have raid 5 for 16 disk of 2Tb.The resource utilization was very low for each node.

in my zabbix solution i got this error from all node :"    Lack of free swap space on PETASAN3"

Quote from admin on April 4, 2019, 12:12 pm

I would check the hardware / network for problems.

It could also be due to hardware being loaded / under powered, make sure you have enough ram/cpu/disk resources, look at the resource utilization charts if they go above 90%

#3

admin
2,930 Posts

April 4, 2019, 12:51 pm
Quote from admin on April 4, 2019, 12:51 pm
The "tcp_sendpage() failure: 2444" looks alarming, can you do a dmesg , what network interface do you use ?

10 GB RAM is low, how many OSDs do you have ? Why do you use RAID 5 for ?

What is the health status in the dashboard..is it OK ? In the PG Status chart do you see all active/clean or do you see any status changes going back to when the freeze occurs ?

Do you see any errors in /opt/petasan/log/PetaSAN.log

The "tcp_sendpage() failure: 2444" looks alarming, can you do a dmesg , what network interface do you use ?

10 GB RAM is low, how many OSDs do you have ? Why do you use RAID 5 for ?

What is the health status in the dashboard..is it OK ? In the PG Status chart do you see all active/clean or do you see any status changes going back to when the freeze occurs ?

Do you see any errors in /opt/petasan/log/PetaSAN.log

#4

elwan
14 Posts

April 4, 2019, 1:42 pm
Quote from elwan on April 4, 2019, 1:42 pm

Quote from admin on April 4, 2019, 12:51 pm

The "tcp_sendpage() failure: 2444" looks alarming, can you do a dmesg , what network interface do you use ?

Ethernet controller: Intel Corporation I350 Gigabit Network Connection

root@NODEBKO-03:~# dmesg | grep Network
[    1.646588] e1000e: Intel(R) PRO/1000 Network Driver - 3.2.6-k
[    1.676550] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
[    1.738934] igb 0000:03:00.0: Intel(R) Gigabit Ethernet Network Connection
[    1.759969] e1000e 0000:04:00.0 eth1: Intel(R) PRO/1000 Network Connection
[    1.803160] igb 0000:03:00.1: Intel(R) Gigabit Ethernet Network Connection
[    1.866400] igb 0000:03:00.2: Intel(R) Gigabit Ethernet Network Connection
[    1.887277] e1000e 0000:05:00.0 eth4: Intel(R) PRO/1000 Network Connection
[    1.928496] igb 0000:03:00.3: Intel(R) Gigabit Ethernet Network Connection

10 GB RAM is low, how many OSDs do you have ? Why do you use RAID 5 for ?

i will increase it for sure.i have 3 osds .Servers came with Raid5 configuration and i did not change it . does we shouldn't do with raid5?

What is the health status in the dashboard..is it OK ? In the PG Status chart do you see all active/clean or do you see any status changes going back to when the freeze occurs ?

root@NODEBKO-03:~# ceph --cluster ClusterCCTVBKO -s
cluster:
id: ed96e77e-1ff8-4e6a-aa02-3f5caed963a8
health: HEALTH_ERR
17 scrub errors
Possible data damage: 17 pgs inconsistent

services:
mon: 3 daemons, quorum NODEBKO-01,NODEBKO-02,NODEBKO-03
mgr: NODEBKO-01(active), standbys: NODEBKO-02
osd: 3 osds: 3 up, 3 in

data:
pools: 1 pools, 256 pgs
objects: 3024k objects, 12098 GB
usage: 24249 GB used, 59569 GB / 83818 GB avail
pgs: 238 active+clean
17 active+clean+inconsistent
1 active+clean+scrubbing+deep

io:
client: 405 kB/s rd, 5759 kB/s wr, 1 op/s rd, 50 op/s wr

Do you see any errors in /opt/petasan/log/PetaSAN.log

i see many errors in PetaSAN.log and the most frequent was :

Exception: Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 3.13' `date +%s` | nc -q0 10.1.5.115 2003
19/03/2019 17:01:38 ERROR    Node Stats exception.
19/03/2019 17:01:38 ERROR    Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 1.54' `date +%s` | nc -q0
10.1.5.115 2003
Traceback (most recent call last):
File "/opt/petasan/scripts/node_stats.py", line 137, in <module>
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 54, in get_stats
_get_cpu(sar_result)
File "/opt/petasan/scripts/node_stats.py", line 74, in _get_cpu
_send_graphite(path_key, val)
File "/opt/petasan/scripts/node_stats.py", line 50, in _send_graphite
raise Exception("Error running echo command :" + cmd)
Exception: Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 1.54' `date +%s` | nc -q0 10.1.5.115 2003
19/03/2019 17:02:41 ERROR    Node Stats exception.
19/03/2019 17:02:41 ERROR    Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 2.13' `date +%s` | nc -q0
10.1.5.115 2003

Quote from admin on April 4, 2019, 12:51 pm

The "tcp_sendpage() failure: 2444" looks alarming, can you do a dmesg , what network interface do you use ?

Ethernet controller: Intel Corporation I350 Gigabit Network Connection

root@NODEBKO-03:~# dmesg | grep Network
[    1.646588] e1000e: Intel(R) PRO/1000 Network Driver - 3.2.6-k
[    1.676550] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
[    1.738934] igb 0000:03:00.0: Intel(R) Gigabit Ethernet Network Connection
[    1.759969] e1000e 0000:04:00.0 eth1: Intel(R) PRO/1000 Network Connection
[    1.803160] igb 0000:03:00.1: Intel(R) Gigabit Ethernet Network Connection
[    1.866400] igb 0000:03:00.2: Intel(R) Gigabit Ethernet Network Connection
[    1.887277] e1000e 0000:05:00.0 eth4: Intel(R) PRO/1000 Network Connection
[    1.928496] igb 0000:03:00.3: Intel(R) Gigabit Ethernet Network Connection

10 GB RAM is low, how many OSDs do you have ? Why do you use RAID 5 for ?

i will increase it for sure.i have 3 osds .Servers came with Raid5 configuration and i did not change it . does we shouldn't do with raid5?

What is the health status in the dashboard..is it OK ? In the PG Status chart do you see all active/clean or do you see any status changes going back to when the freeze occurs ?

root@NODEBKO-03:~# ceph --cluster ClusterCCTVBKO -s
cluster:
id: ed96e77e-1ff8-4e6a-aa02-3f5caed963a8
health: HEALTH_ERR
17 scrub errors
Possible data damage: 17 pgs inconsistent

services:
mon: 3 daemons, quorum NODEBKO-01,NODEBKO-02,NODEBKO-03
mgr: NODEBKO-01(active), standbys: NODEBKO-02
osd: 3 osds: 3 up, 3 in

data:
pools: 1 pools, 256 pgs
objects: 3024k objects, 12098 GB
usage: 24249 GB used, 59569 GB / 83818 GB avail
pgs: 238 active+clean
17 active+clean+inconsistent
1 active+clean+scrubbing+deep

io:
client: 405 kB/s rd, 5759 kB/s wr, 1 op/s rd, 50 op/s wr

Do you see any errors in /opt/petasan/log/PetaSAN.log

i see many errors in PetaSAN.log and the most frequent was :

Exception: Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 3.13' `date +%s` | nc -q0 10.1.5.115 2003
19/03/2019 17:01:38 ERROR    Node Stats exception.
19/03/2019 17:01:38 ERROR    Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 1.54' `date +%s` | nc -q0
10.1.5.115 2003
Traceback (most recent call last):
File "/opt/petasan/scripts/node_stats.py", line 137, in <module>
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 54, in get_stats
_get_cpu(sar_result)
File "/opt/petasan/scripts/node_stats.py", line 74, in _get_cpu
_send_graphite(path_key, val)
File "/opt/petasan/scripts/node_stats.py", line 50, in _send_graphite
raise Exception("Error running echo command :" + cmd)
Exception: Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 1.54' `date +%s` | nc -q0 10.1.5.115 2003
19/03/2019 17:02:41 ERROR    Node Stats exception.
19/03/2019 17:02:41 ERROR    Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 2.13' `date +%s` | nc -q0
10.1.5.115 2003

#5

admin
2,930 Posts

April 4, 2019, 3:45 pm
Quote from admin on April 4, 2019, 3:45 pm
you need to investigate tcp_sendpage() in more detail. look in dmesg for errors such as dmesg | grep -i error

the PetaSAN logs show error sending stats data to the stats server, this is either connection is down or overload.

another serious error is the cluster itself is in error due to inconsistent data, the health should be on the dashboard, is this something that just happened ? did you look at the PG stats chart to see how far back the cluster was in error ?

inconsistent means some data stored in different replicas do not match or do not match their crc metada, this could be due to bad media or sometimes due to power loss with disks/controller that do not handle power loss gracefully. a link to deal with inconsistent data is

http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/

10 G RAM is too low

Typically you would not use RAID 5 with Ceph and need to split the 16 disks in each node to a separate OSD, but you do not have enough RAM for this.

you need to investigate tcp_sendpage() in more detail. look in dmesg for errors such as dmesg | grep -i error

the PetaSAN logs show error sending stats data to the stats server, this is either connection is down or overload.

another serious error is the cluster itself is in error due to inconsistent data, the health should be on the dashboard, is this something that just happened ? did you look at the PG stats chart to see how far back the cluster was in error ?

inconsistent means some data stored in different replicas do not match or do not match their crc metada, this could be due to bad media or sometimes due to power loss with disks/controller that do not handle power loss gracefully. a link to deal with inconsistent data is

http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/

10 G RAM is too low

Typically you would not use RAID 5 with Ceph and need to split the 16 disks in each node to a separate OSD, but you do not have enough RAM for this.

#6

elwan
14 Posts

April 5, 2019, 10:11 am
Quote from elwan on April 5, 2019, 10:11 am

Quote from admin on April 4, 2019, 3:45 pm

you need to investigate tcp_sendpage() in more detail. look in dmesg for errors such as dmesg | grep -i error

the PetaSAN logs show error sending stats data to the stats server, this is either connection is down or overload.

another serious error is the cluster itself is in error due to inconsistent data, the health should be on the dashboard, is this something that just happened ? did you look at the PG stats chart to see how far back the cluster was in error ?

inconsistent means some data stored in different replicas do not match or do not match their crc metada, this could be due to bad media or sometimes due to power loss with disks/controller that do not handle power loss gracefully. a link to deal with inconsistent data is

http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/

10 G RAM is too low

Typically you would not use RAID 5 with Ceph and need to split the 16 disks in each node to a separate OSD, but you do not have enough RAM for this.

Thanks for your support.

Quote from admin on April 4, 2019, 3:45 pm

you need to investigate tcp_sendpage() in more detail. look in dmesg for errors such as dmesg | grep -i error

the PetaSAN logs show error sending stats data to the stats server, this is either connection is down or overload.

another serious error is the cluster itself is in error due to inconsistent data, the health should be on the dashboard, is this something that just happened ? did you look at the PG stats chart to see how far back the cluster was in error ?

inconsistent means some data stored in different replicas do not match or do not match their crc metada, this could be due to bad media or sometimes due to power loss with disks/controller that do not handle power loss gracefully. a link to deal with inconsistent data is

http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/

10 G RAM is too low

Typically you would not use RAID 5 with Ceph and need to split the 16 disks in each node to a separate OSD, but you do not have enough RAM for this.

Thanks for your support.

#7

Post Reply: petasan freeze

Cancel