petasan freeze
elwan
14 Posts
April 4, 2019, 11:29 amQuote from elwan on April 4, 2019, 11:29 amHello i have 3 nodes in my cluster petasan and with random time one of them freeze and lot connection.I have q little bit of diagnostic and i found error below on each of them after freeze.Can you help please to solve this problem.In below error from syslogs
- Node 3
Apr 3 04:25:01 NODEBKO-03 CRON[435822]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 3 04:35:01 NODEBKO-03 CRON[437623]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 3 04:42:32 NODEBKO-03 kernel: [148367.320951] tcp_sendpage() failure: 2444
- Node 01
Apr 4 05:05:01 NODEBKO-01 CRON[905200]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 05:15:01 NODEBKO-01 CRON[909704]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 05:17:01 NODEBKO-01 CRON[910608]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Apr 4 05:25:01 NODEBKO-01 CRON[914114]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
- Node 02
Apr 4 10:15:01 NODEBKO-02 CRON[703426]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 10:17:01 NODEBKO-02 CRON[703816]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Apr 4 10:25:01 NODEBKO-02 CRON[705347]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 10:25:11 NODEBKO-02 smartd[572]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 80 to 81
Apr 4 10:25:11 NODEBKO-02 smartd[572]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 123 to 124
Apr 4 10:25:26 NODEBKO-02 ntpd[1117]: frequency file /var/lib/ntp/ntp.drift.TEMP: Permission denied
Apr 4 10:25:26 NODEBKO-02 ntpd[1117]: 4 Apr 10:25:26 ntpd[1117]: frequency file /var/lib/ntp/ntp.drift.TEMP: Permission denied
Hello i have 3 nodes in my cluster petasan and with random time one of them freeze and lot connection.I have q little bit of diagnostic and i found error below on each of them after freeze.Can you help please to solve this problem.In below error from syslogs
- Node 3
Apr 3 04:25:01 NODEBKO-03 CRON[435822]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 3 04:35:01 NODEBKO-03 CRON[437623]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 3 04:42:32 NODEBKO-03 kernel: [148367.320951] tcp_sendpage() failure: 2444
- Node 01
Apr 4 05:05:01 NODEBKO-01 CRON[905200]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 05:15:01 NODEBKO-01 CRON[909704]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 05:17:01 NODEBKO-01 CRON[910608]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Apr 4 05:25:01 NODEBKO-01 CRON[914114]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
- Node 02
Apr 4 10:15:01 NODEBKO-02 CRON[703426]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 10:17:01 NODEBKO-02 CRON[703816]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Apr 4 10:25:01 NODEBKO-02 CRON[705347]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 10:25:11 NODEBKO-02 smartd[572]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 80 to 81
Apr 4 10:25:11 NODEBKO-02 smartd[572]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 123 to 124
Apr 4 10:25:26 NODEBKO-02 ntpd[1117]: frequency file /var/lib/ntp/ntp.drift.TEMP: Permission denied
Apr 4 10:25:26 NODEBKO-02 ntpd[1117]: 4 Apr 10:25:26 ntpd[1117]: frequency file /var/lib/ntp/ntp.drift.TEMP: Permission denied
admin
2,930 Posts
April 4, 2019, 12:12 pmQuote from admin on April 4, 2019, 12:12 pmI would check the hardware / network for problems.
It could also be due to hardware being loaded / under powered, make sure you have enough ram/cpu/disk resources, look at the resource utilization charts if they go above 90%
I would check the hardware / network for problems.
It could also be due to hardware being loaded / under powered, make sure you have enough ram/cpu/disk resources, look at the resource utilization charts if they go above 90%
elwan
14 Posts
April 4, 2019, 12:35 pmQuote from elwan on April 4, 2019, 12:35 pmNetwork card : 4x Intel 1Gbps
Motherboard : X11SSL-F Micro ATX
processor : Xeon E3-1230V5 - 3,4 Ghz - 4 cores
Memory : RAM 10 Go DDR3 UDIMM ECC - 1600Mhz (2x4Go,2Go)
RAID card : ARECA ARC-1882IX-16 - 16 ports SAS2 PCIe
on each node we have raid 5 for 16 disk of 2Tb.The resource utilization was very low for each node.
in my zabbix solution i got this error from all node :" Lack of free swap space on PETASAN3"
Quote from admin on April 4, 2019, 12:12 pm
I would check the hardware / network for problems.
It could also be due to hardware being loaded / under powered, make sure you have enough ram/cpu/disk resources, look at the resource utilization charts if they go above 90%
Network card : 4x Intel 1Gbps
Motherboard : X11SSL-F Micro ATX
processor : Xeon E3-1230V5 - 3,4 Ghz - 4 cores
Memory : RAM 10 Go DDR3 UDIMM ECC - 1600Mhz (2x4Go,2Go)
RAID card : ARECA ARC-1882IX-16 - 16 ports SAS2 PCIe
on each node we have raid 5 for 16 disk of 2Tb.The resource utilization was very low for each node.
in my zabbix solution i got this error from all node :" Lack of free swap space on PETASAN3"
Quote from admin on April 4, 2019, 12:12 pm
I would check the hardware / network for problems.
It could also be due to hardware being loaded / under powered, make sure you have enough ram/cpu/disk resources, look at the resource utilization charts if they go above 90%
admin
2,930 Posts
April 4, 2019, 12:51 pmQuote from admin on April 4, 2019, 12:51 pmThe "tcp_sendpage() failure: 2444" looks alarming, can you do a dmesg , what network interface do you use ?
10 GB RAM is low, how many OSDs do you have ? Why do you use RAID 5 for ?
What is the health status in the dashboard..is it OK ? In the PG Status chart do you see all active/clean or do you see any status changes going back to when the freeze occurs ?
Do you see any errors in /opt/petasan/log/PetaSAN.log
The "tcp_sendpage() failure: 2444" looks alarming, can you do a dmesg , what network interface do you use ?
10 GB RAM is low, how many OSDs do you have ? Why do you use RAID 5 for ?
What is the health status in the dashboard..is it OK ? In the PG Status chart do you see all active/clean or do you see any status changes going back to when the freeze occurs ?
Do you see any errors in /opt/petasan/log/PetaSAN.log
elwan
14 Posts
April 4, 2019, 1:42 pmQuote from elwan on April 4, 2019, 1:42 pm
Quote from admin on April 4, 2019, 12:51 pm
The "tcp_sendpage() failure: 2444" looks alarming, can you do a dmesg , what network interface do you use ?
Ethernet controller: Intel Corporation I350 Gigabit Network Connection
root@NODEBKO-03:~# dmesg | grep Network
[ 1.646588] e1000e: Intel(R) PRO/1000 Network Driver - 3.2.6-k
[ 1.676550] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
[ 1.738934] igb 0000:03:00.0: Intel(R) Gigabit Ethernet Network Connection
[ 1.759969] e1000e 0000:04:00.0 eth1: Intel(R) PRO/1000 Network Connection
[ 1.803160] igb 0000:03:00.1: Intel(R) Gigabit Ethernet Network Connection
[ 1.866400] igb 0000:03:00.2: Intel(R) Gigabit Ethernet Network Connection
[ 1.887277] e1000e 0000:05:00.0 eth4: Intel(R) PRO/1000 Network Connection
[ 1.928496] igb 0000:03:00.3: Intel(R) Gigabit Ethernet Network Connection
10 GB RAM is low, how many OSDs do you have ? Why do you use RAID 5 for ?
i will increase it for sure.i have 3 osds .Servers came with Raid5 configuration and i did not change it . does we shouldn't do with raid5?
What is the health status in the dashboard..is it OK ? In the PG Status chart do you see all active/clean or do you see any status changes going back to when the freeze occurs ?
root@NODEBKO-03:~# ceph --cluster ClusterCCTVBKO -s
cluster:
id: ed96e77e-1ff8-4e6a-aa02-3f5caed963a8
health: HEALTH_ERR
17 scrub errors
Possible data damage: 17 pgs inconsistent
services:
mon: 3 daemons, quorum NODEBKO-01,NODEBKO-02,NODEBKO-03
mgr: NODEBKO-01(active), standbys: NODEBKO-02
osd: 3 osds: 3 up, 3 in
data:
pools: 1 pools, 256 pgs
objects: 3024k objects, 12098 GB
usage: 24249 GB used, 59569 GB / 83818 GB avail
pgs: 238 active+clean
17 active+clean+inconsistent
1 active+clean+scrubbing+deep
io:
client: 405 kB/s rd, 5759 kB/s wr, 1 op/s rd, 50 op/s wr
Do you see any errors in /opt/petasan/log/PetaSAN.log
i see many errors in PetaSAN.log and the most frequent was :
Exception: Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 3.13' `date +%s` | nc -q0 10.1.5.115 2003
19/03/2019 17:01:38 ERROR Node Stats exception.
19/03/2019 17:01:38 ERROR Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 1.54' `date +%s` | nc -q0
10.1.5.115 2003
Traceback (most recent call last):
File "/opt/petasan/scripts/node_stats.py", line 137, in <module>
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 54, in get_stats
_get_cpu(sar_result)
File "/opt/petasan/scripts/node_stats.py", line 74, in _get_cpu
_send_graphite(path_key, val)
File "/opt/petasan/scripts/node_stats.py", line 50, in _send_graphite
raise Exception("Error running echo command :" + cmd)
Exception: Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 1.54' `date +%s` | nc -q0 10.1.5.115 2003
19/03/2019 17:02:41 ERROR Node Stats exception.
19/03/2019 17:02:41 ERROR Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 2.13' `date +%s` | nc -q0
10.1.5.115 2003
Quote from admin on April 4, 2019, 12:51 pm
The "tcp_sendpage() failure: 2444" looks alarming, can you do a dmesg , what network interface do you use ?
Ethernet controller: Intel Corporation I350 Gigabit Network Connection
root@NODEBKO-03:~# dmesg | grep Network
[ 1.646588] e1000e: Intel(R) PRO/1000 Network Driver - 3.2.6-k
[ 1.676550] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
[ 1.738934] igb 0000:03:00.0: Intel(R) Gigabit Ethernet Network Connection
[ 1.759969] e1000e 0000:04:00.0 eth1: Intel(R) PRO/1000 Network Connection
[ 1.803160] igb 0000:03:00.1: Intel(R) Gigabit Ethernet Network Connection
[ 1.866400] igb 0000:03:00.2: Intel(R) Gigabit Ethernet Network Connection
[ 1.887277] e1000e 0000:05:00.0 eth4: Intel(R) PRO/1000 Network Connection
[ 1.928496] igb 0000:03:00.3: Intel(R) Gigabit Ethernet Network Connection
10 GB RAM is low, how many OSDs do you have ? Why do you use RAID 5 for ?
i will increase it for sure.i have 3 osds .Servers came with Raid5 configuration and i did not change it . does we shouldn't do with raid5?
What is the health status in the dashboard..is it OK ? In the PG Status chart do you see all active/clean or do you see any status changes going back to when the freeze occurs ?
root@NODEBKO-03:~# ceph --cluster ClusterCCTVBKO -s
cluster:
id: ed96e77e-1ff8-4e6a-aa02-3f5caed963a8
health: HEALTH_ERR
17 scrub errors
Possible data damage: 17 pgs inconsistent
services:
mon: 3 daemons, quorum NODEBKO-01,NODEBKO-02,NODEBKO-03
mgr: NODEBKO-01(active), standbys: NODEBKO-02
osd: 3 osds: 3 up, 3 in
data:
pools: 1 pools, 256 pgs
objects: 3024k objects, 12098 GB
usage: 24249 GB used, 59569 GB / 83818 GB avail
pgs: 238 active+clean
17 active+clean+inconsistent
1 active+clean+scrubbing+deep
io:
client: 405 kB/s rd, 5759 kB/s wr, 1 op/s rd, 50 op/s wr
Do you see any errors in /opt/petasan/log/PetaSAN.log
i see many errors in PetaSAN.log and the most frequent was :
Exception: Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 3.13' `date +%s` | nc -q0 10.1.5.115 2003
19/03/2019 17:01:38 ERROR Node Stats exception.
19/03/2019 17:01:38 ERROR Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 1.54' `date +%s` | nc -q0
10.1.5.115 2003
Traceback (most recent call last):
File "/opt/petasan/scripts/node_stats.py", line 137, in <module>
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 54, in get_stats
_get_cpu(sar_result)
File "/opt/petasan/scripts/node_stats.py", line 74, in _get_cpu
_send_graphite(path_key, val)
File "/opt/petasan/scripts/node_stats.py", line 50, in _send_graphite
raise Exception("Error running echo command :" + cmd)
Exception: Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 1.54' `date +%s` | nc -q0 10.1.5.115 2003
19/03/2019 17:02:41 ERROR Node Stats exception.
19/03/2019 17:02:41 ERROR Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 2.13' `date +%s` | nc -q0
10.1.5.115 2003
admin
2,930 Posts
April 4, 2019, 3:45 pmQuote from admin on April 4, 2019, 3:45 pmyou need to investigate tcp_sendpage() in more detail. look in dmesg for errors such as dmesg | grep -i error
the PetaSAN logs show error sending stats data to the stats server, this is either connection is down or overload.
another serious error is the cluster itself is in error due to inconsistent data, the health should be on the dashboard, is this something that just happened ? did you look at the PG stats chart to see how far back the cluster was in error ?
inconsistent means some data stored in different replicas do not match or do not match their crc metada, this could be due to bad media or sometimes due to power loss with disks/controller that do not handle power loss gracefully. a link to deal with inconsistent data is
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/
10 G RAM is too low
Typically you would not use RAID 5 with Ceph and need to split the 16 disks in each node to a separate OSD, but you do not have enough RAM for this.
you need to investigate tcp_sendpage() in more detail. look in dmesg for errors such as dmesg | grep -i error
the PetaSAN logs show error sending stats data to the stats server, this is either connection is down or overload.
another serious error is the cluster itself is in error due to inconsistent data, the health should be on the dashboard, is this something that just happened ? did you look at the PG stats chart to see how far back the cluster was in error ?
inconsistent means some data stored in different replicas do not match or do not match their crc metada, this could be due to bad media or sometimes due to power loss with disks/controller that do not handle power loss gracefully. a link to deal with inconsistent data is
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/
10 G RAM is too low
Typically you would not use RAID 5 with Ceph and need to split the 16 disks in each node to a separate OSD, but you do not have enough RAM for this.
elwan
14 Posts
April 5, 2019, 10:11 amQuote from elwan on April 5, 2019, 10:11 am
Quote from admin on April 4, 2019, 3:45 pm
you need to investigate tcp_sendpage() in more detail. look in dmesg for errors such as dmesg | grep -i error
the PetaSAN logs show error sending stats data to the stats server, this is either connection is down or overload.
another serious error is the cluster itself is in error due to inconsistent data, the health should be on the dashboard, is this something that just happened ? did you look at the PG stats chart to see how far back the cluster was in error ?
inconsistent means some data stored in different replicas do not match or do not match their crc metada, this could be due to bad media or sometimes due to power loss with disks/controller that do not handle power loss gracefully. a link to deal with inconsistent data is
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/
10 G RAM is too low
Typically you would not use RAID 5 with Ceph and need to split the 16 disks in each node to a separate OSD, but you do not have enough RAM for this.
Thanks for your support.
Quote from admin on April 4, 2019, 3:45 pm
you need to investigate tcp_sendpage() in more detail. look in dmesg for errors such as dmesg | grep -i error
the PetaSAN logs show error sending stats data to the stats server, this is either connection is down or overload.
another serious error is the cluster itself is in error due to inconsistent data, the health should be on the dashboard, is this something that just happened ? did you look at the PG stats chart to see how far back the cluster was in error ?
inconsistent means some data stored in different replicas do not match or do not match their crc metada, this could be due to bad media or sometimes due to power loss with disks/controller that do not handle power loss gracefully. a link to deal with inconsistent data is
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/
10 G RAM is too low
Typically you would not use RAID 5 with Ceph and need to split the 16 disks in each node to a separate OSD, but you do not have enough RAM for this.
Thanks for your support.
petasan freeze
elwan
14 Posts
Quote from elwan on April 4, 2019, 11:29 amHello i have 3 nodes in my cluster petasan and with random time one of them freeze and lot connection.I have q little bit of diagnostic and i found error below on each of them after freeze.Can you help please to solve this problem.In below error from syslogs
- Node 3
Apr 3 04:25:01 NODEBKO-03 CRON[435822]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 3 04:35:01 NODEBKO-03 CRON[437623]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 3 04:42:32 NODEBKO-03 kernel: [148367.320951] tcp_sendpage() failure: 2444
- Node 01
Apr 4 05:05:01 NODEBKO-01 CRON[905200]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 05:15:01 NODEBKO-01 CRON[909704]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 05:17:01 NODEBKO-01 CRON[910608]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Apr 4 05:25:01 NODEBKO-01 CRON[914114]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
- Node 02
Apr 4 10:15:01 NODEBKO-02 CRON[703426]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 10:17:01 NODEBKO-02 CRON[703816]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Apr 4 10:25:01 NODEBKO-02 CRON[705347]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 10:25:11 NODEBKO-02 smartd[572]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 80 to 81
Apr 4 10:25:11 NODEBKO-02 smartd[572]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 123 to 124
Apr 4 10:25:26 NODEBKO-02 ntpd[1117]: frequency file /var/lib/ntp/ntp.drift.TEMP: Permission denied
Apr 4 10:25:26 NODEBKO-02 ntpd[1117]: 4 Apr 10:25:26 ntpd[1117]: frequency file /var/lib/ntp/ntp.drift.TEMP: Permission denied
Hello i have 3 nodes in my cluster petasan and with random time one of them freeze and lot connection.I have q little bit of diagnostic and i found error below on each of them after freeze.Can you help please to solve this problem.In below error from syslogs
- Node 3
Apr 3 04:25:01 NODEBKO-03 CRON[435822]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 3 04:35:01 NODEBKO-03 CRON[437623]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 3 04:42:32 NODEBKO-03 kernel: [148367.320951] tcp_sendpage() failure: 2444
- Node 01
Apr 4 05:05:01 NODEBKO-01 CRON[905200]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 05:15:01 NODEBKO-01 CRON[909704]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 05:17:01 NODEBKO-01 CRON[910608]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Apr 4 05:25:01 NODEBKO-01 CRON[914114]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
- Node 02
Apr 4 10:15:01 NODEBKO-02 CRON[703426]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 10:17:01 NODEBKO-02 CRON[703816]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Apr 4 10:25:01 NODEBKO-02 CRON[705347]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 4 10:25:11 NODEBKO-02 smartd[572]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 80 to 81
Apr 4 10:25:11 NODEBKO-02 smartd[572]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 123 to 124
Apr 4 10:25:26 NODEBKO-02 ntpd[1117]: frequency file /var/lib/ntp/ntp.drift.TEMP: Permission denied
Apr 4 10:25:26 NODEBKO-02 ntpd[1117]: 4 Apr 10:25:26 ntpd[1117]: frequency file /var/lib/ntp/ntp.drift.TEMP: Permission denied
admin
2,930 Posts
Quote from admin on April 4, 2019, 12:12 pmI would check the hardware / network for problems.
It could also be due to hardware being loaded / under powered, make sure you have enough ram/cpu/disk resources, look at the resource utilization charts if they go above 90%
I would check the hardware / network for problems.
It could also be due to hardware being loaded / under powered, make sure you have enough ram/cpu/disk resources, look at the resource utilization charts if they go above 90%
elwan
14 Posts
Quote from elwan on April 4, 2019, 12:35 pmNetwork card : 4x Intel 1Gbps
Motherboard : X11SSL-F Micro ATX
processor : Xeon E3-1230V5 - 3,4 Ghz - 4 cores
Memory : RAM 10 Go DDR3 UDIMM ECC - 1600Mhz (2x4Go,2Go)
RAID card : ARECA ARC-1882IX-16 - 16 ports SAS2 PCIe
on each node we have raid 5 for 16 disk of 2Tb.The resource utilization was very low for each node.
in my zabbix solution i got this error from all node :" Lack of free swap space on PETASAN3"
Quote from admin on April 4, 2019, 12:12 pmI would check the hardware / network for problems.
It could also be due to hardware being loaded / under powered, make sure you have enough ram/cpu/disk resources, look at the resource utilization charts if they go above 90%
Network card : 4x Intel 1Gbps
Motherboard : X11SSL-F Micro ATX
processor : Xeon E3-1230V5 - 3,4 Ghz - 4 cores
Memory : RAM 10 Go DDR3 UDIMM ECC - 1600Mhz (2x4Go,2Go)
RAID card : ARECA ARC-1882IX-16 - 16 ports SAS2 PCIe
on each node we have raid 5 for 16 disk of 2Tb.The resource utilization was very low for each node.
in my zabbix solution i got this error from all node :" Lack of free swap space on PETASAN3"
Quote from admin on April 4, 2019, 12:12 pmI would check the hardware / network for problems.
It could also be due to hardware being loaded / under powered, make sure you have enough ram/cpu/disk resources, look at the resource utilization charts if they go above 90%
admin
2,930 Posts
Quote from admin on April 4, 2019, 12:51 pmThe "tcp_sendpage() failure: 2444" looks alarming, can you do a dmesg , what network interface do you use ?
10 GB RAM is low, how many OSDs do you have ? Why do you use RAID 5 for ?
What is the health status in the dashboard..is it OK ? In the PG Status chart do you see all active/clean or do you see any status changes going back to when the freeze occurs ?
Do you see any errors in /opt/petasan/log/PetaSAN.log
The "tcp_sendpage() failure: 2444" looks alarming, can you do a dmesg , what network interface do you use ?
10 GB RAM is low, how many OSDs do you have ? Why do you use RAID 5 for ?
What is the health status in the dashboard..is it OK ? In the PG Status chart do you see all active/clean or do you see any status changes going back to when the freeze occurs ?
Do you see any errors in /opt/petasan/log/PetaSAN.log
elwan
14 Posts
Quote from elwan on April 4, 2019, 1:42 pmQuote from admin on April 4, 2019, 12:51 pmThe "tcp_sendpage() failure: 2444" looks alarming, can you do a dmesg , what network interface do you use ?
Ethernet controller: Intel Corporation I350 Gigabit Network Connection
root@NODEBKO-03:~# dmesg | grep Network
[ 1.646588] e1000e: Intel(R) PRO/1000 Network Driver - 3.2.6-k
[ 1.676550] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
[ 1.738934] igb 0000:03:00.0: Intel(R) Gigabit Ethernet Network Connection
[ 1.759969] e1000e 0000:04:00.0 eth1: Intel(R) PRO/1000 Network Connection
[ 1.803160] igb 0000:03:00.1: Intel(R) Gigabit Ethernet Network Connection
[ 1.866400] igb 0000:03:00.2: Intel(R) Gigabit Ethernet Network Connection
[ 1.887277] e1000e 0000:05:00.0 eth4: Intel(R) PRO/1000 Network Connection
[ 1.928496] igb 0000:03:00.3: Intel(R) Gigabit Ethernet Network Connection
10 GB RAM is low, how many OSDs do you have ? Why do you use RAID 5 for ?
i will increase it for sure.i have 3 osds .Servers came with Raid5 configuration and i did not change it . does we shouldn't do with raid5?
What is the health status in the dashboard..is it OK ? In the PG Status chart do you see all active/clean or do you see any status changes going back to when the freeze occurs ?
root@NODEBKO-03:~# ceph --cluster ClusterCCTVBKO -s
cluster:
id: ed96e77e-1ff8-4e6a-aa02-3f5caed963a8
health: HEALTH_ERR
17 scrub errors
Possible data damage: 17 pgs inconsistentservices:
mon: 3 daemons, quorum NODEBKO-01,NODEBKO-02,NODEBKO-03
mgr: NODEBKO-01(active), standbys: NODEBKO-02
osd: 3 osds: 3 up, 3 indata:
pools: 1 pools, 256 pgs
objects: 3024k objects, 12098 GB
usage: 24249 GB used, 59569 GB / 83818 GB avail
pgs: 238 active+clean
17 active+clean+inconsistent
1 active+clean+scrubbing+deepio:
client: 405 kB/s rd, 5759 kB/s wr, 1 op/s rd, 50 op/s wrDo you see any errors in /opt/petasan/log/PetaSAN.log
i see many errors in PetaSAN.log and the most frequent was :
Exception: Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 3.13' `date +%s` | nc -q0 10.1.5.115 2003
19/03/2019 17:01:38 ERROR Node Stats exception.
19/03/2019 17:01:38 ERROR Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 1.54' `date +%s` | nc -q0
10.1.5.115 2003
Traceback (most recent call last):
File "/opt/petasan/scripts/node_stats.py", line 137, in <module>
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 54, in get_stats
_get_cpu(sar_result)
File "/opt/petasan/scripts/node_stats.py", line 74, in _get_cpu
_send_graphite(path_key, val)
File "/opt/petasan/scripts/node_stats.py", line 50, in _send_graphite
raise Exception("Error running echo command :" + cmd)
Exception: Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 1.54' `date +%s` | nc -q0 10.1.5.115 2003
19/03/2019 17:02:41 ERROR Node Stats exception.
19/03/2019 17:02:41 ERROR Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 2.13' `date +%s` | nc -q0
10.1.5.115 2003
Quote from admin on April 4, 2019, 12:51 pmThe "tcp_sendpage() failure: 2444" looks alarming, can you do a dmesg , what network interface do you use ?
Ethernet controller: Intel Corporation I350 Gigabit Network Connection
root@NODEBKO-03:~# dmesg | grep Network
[ 1.646588] e1000e: Intel(R) PRO/1000 Network Driver - 3.2.6-k
[ 1.676550] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
[ 1.738934] igb 0000:03:00.0: Intel(R) Gigabit Ethernet Network Connection
[ 1.759969] e1000e 0000:04:00.0 eth1: Intel(R) PRO/1000 Network Connection
[ 1.803160] igb 0000:03:00.1: Intel(R) Gigabit Ethernet Network Connection
[ 1.866400] igb 0000:03:00.2: Intel(R) Gigabit Ethernet Network Connection
[ 1.887277] e1000e 0000:05:00.0 eth4: Intel(R) PRO/1000 Network Connection
[ 1.928496] igb 0000:03:00.3: Intel(R) Gigabit Ethernet Network Connection
10 GB RAM is low, how many OSDs do you have ? Why do you use RAID 5 for ?
i will increase it for sure.i have 3 osds .Servers came with Raid5 configuration and i did not change it . does we shouldn't do with raid5?
What is the health status in the dashboard..is it OK ? In the PG Status chart do you see all active/clean or do you see any status changes going back to when the freeze occurs ?
root@NODEBKO-03:~# ceph --cluster ClusterCCTVBKO -s
cluster:
id: ed96e77e-1ff8-4e6a-aa02-3f5caed963a8
health: HEALTH_ERR
17 scrub errors
Possible data damage: 17 pgs inconsistent
services:
mon: 3 daemons, quorum NODEBKO-01,NODEBKO-02,NODEBKO-03
mgr: NODEBKO-01(active), standbys: NODEBKO-02
osd: 3 osds: 3 up, 3 in
data:
pools: 1 pools, 256 pgs
objects: 3024k objects, 12098 GB
usage: 24249 GB used, 59569 GB / 83818 GB avail
pgs: 238 active+clean
17 active+clean+inconsistent
1 active+clean+scrubbing+deep
io:
client: 405 kB/s rd, 5759 kB/s wr, 1 op/s rd, 50 op/s wr
Do you see any errors in /opt/petasan/log/PetaSAN.log
i see many errors in PetaSAN.log and the most frequent was :
Exception: Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 3.13' `date +%s` | nc -q0 10.1.5.115 2003
19/03/2019 17:01:38 ERROR Node Stats exception.
19/03/2019 17:01:38 ERROR Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 1.54' `date +%s` | nc -q0
10.1.5.115 2003
Traceback (most recent call last):
File "/opt/petasan/scripts/node_stats.py", line 137, in <module>
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 54, in get_stats
_get_cpu(sar_result)
File "/opt/petasan/scripts/node_stats.py", line 74, in _get_cpu
_send_graphite(path_key, val)
File "/opt/petasan/scripts/node_stats.py", line 50, in _send_graphite
raise Exception("Error running echo command :" + cmd)
Exception: Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 1.54' `date +%s` | nc -q0 10.1.5.115 2003
19/03/2019 17:02:41 ERROR Node Stats exception.
19/03/2019 17:02:41 ERROR Error running echo command :echo 'PetaSAN.NodeStats.NODEBKO-03.cpu_all.percent_util 2.13' `date +%s` | nc -q0
10.1.5.115 2003
admin
2,930 Posts
Quote from admin on April 4, 2019, 3:45 pmyou need to investigate tcp_sendpage() in more detail. look in dmesg for errors such as dmesg | grep -i error
the PetaSAN logs show error sending stats data to the stats server, this is either connection is down or overload.
another serious error is the cluster itself is in error due to inconsistent data, the health should be on the dashboard, is this something that just happened ? did you look at the PG stats chart to see how far back the cluster was in error ?
inconsistent means some data stored in different replicas do not match or do not match their crc metada, this could be due to bad media or sometimes due to power loss with disks/controller that do not handle power loss gracefully. a link to deal with inconsistent data is
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/
10 G RAM is too low
Typically you would not use RAID 5 with Ceph and need to split the 16 disks in each node to a separate OSD, but you do not have enough RAM for this.
you need to investigate tcp_sendpage() in more detail. look in dmesg for errors such as dmesg | grep -i error
the PetaSAN logs show error sending stats data to the stats server, this is either connection is down or overload.
another serious error is the cluster itself is in error due to inconsistent data, the health should be on the dashboard, is this something that just happened ? did you look at the PG stats chart to see how far back the cluster was in error ?
inconsistent means some data stored in different replicas do not match or do not match their crc metada, this could be due to bad media or sometimes due to power loss with disks/controller that do not handle power loss gracefully. a link to deal with inconsistent data is
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/
10 G RAM is too low
Typically you would not use RAID 5 with Ceph and need to split the 16 disks in each node to a separate OSD, but you do not have enough RAM for this.
elwan
14 Posts
Quote from elwan on April 5, 2019, 10:11 amQuote from admin on April 4, 2019, 3:45 pmyou need to investigate tcp_sendpage() in more detail. look in dmesg for errors such as dmesg | grep -i error
the PetaSAN logs show error sending stats data to the stats server, this is either connection is down or overload.
another serious error is the cluster itself is in error due to inconsistent data, the health should be on the dashboard, is this something that just happened ? did you look at the PG stats chart to see how far back the cluster was in error ?
inconsistent means some data stored in different replicas do not match or do not match their crc metada, this could be due to bad media or sometimes due to power loss with disks/controller that do not handle power loss gracefully. a link to deal with inconsistent data is
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/
10 G RAM is too low
Typically you would not use RAID 5 with Ceph and need to split the 16 disks in each node to a separate OSD, but you do not have enough RAM for this.
Thanks for your support.
Quote from admin on April 4, 2019, 3:45 pmyou need to investigate tcp_sendpage() in more detail. look in dmesg for errors such as dmesg | grep -i error
the PetaSAN logs show error sending stats data to the stats server, this is either connection is down or overload.
another serious error is the cluster itself is in error due to inconsistent data, the health should be on the dashboard, is this something that just happened ? did you look at the PG stats chart to see how far back the cluster was in error ?
inconsistent means some data stored in different replicas do not match or do not match their crc metada, this could be due to bad media or sometimes due to power loss with disks/controller that do not handle power loss gracefully. a link to deal with inconsistent data is
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/
10 G RAM is too low
Typically you would not use RAID 5 with Ceph and need to split the 16 disks in each node to a separate OSD, but you do not have enough RAM for this.
Thanks for your support.