ForumGeneral DiscussionPetaSAN 2.2 Released!

You need to log in to create posts and topics. Login · Register

PetaSAN 2.2 Released!

moh
16 Posts

March 29, 2020, 10:57 am

ceph status --cluster MyCluster

root@NODE-01:~# ceph status --cluster MyCluster
cluster:
id: ed96e77e-1ff8-4e6a-aa02-3f5caed963a8
health: HEALTH_WARN
2 backfillfull osd(s)
1 pool(s) backfillfull
Reduced data availability: 175 pgs inactive, 175 pgs down

services:
mon: 3 daemons, quorum NODE-01,NODE-02,NODE-03
mgr: NODE-01(active), standbys: NODE-02
osd: 3 osds: 2 up, 2 in

data:
pools: 1 pools, 256 pgs
objects: 8.24M objects, 31.3TiB
usage: 70.9TiB used, 11.0TiB / 81.9TiB avail
pgs: 68.359% pgs not active
175 down
80 active+clean
1 active+clean+scrubbing+deep

ceph osd tree --cluster MyCluster

root@NODE-01:~# ceph osd tree --cluster MyCluster
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 81.85497 root default
-5 27.28499 host NODE-01
1 hdd 27.28499 osd.1 up 1.00000 1.00000
-7 27.28499 host NODE-02
2 hdd 27.28499 osd.2 down 0 1.00000
-3 27.28499 host NODE-03
0 hdd 27.28499 osd.0 up 1.00000 1.00000

admin
2,930 Posts

March 29, 2020, 3:15 pm

on node 2 there is down OSD, try to start it with

systemctl start ceph-osd@2

systemctl status ceph-osd@2

moh
16 Posts

March 29, 2020, 4:37 pm

Quote from moh on March 29, 2020, 4:37 pm
root@NODE-02:~# systemctl start ceph-osd@2
root@NODE-02:~# systemctl status ceph-osd@2.service
● ceph-osd@2.service - Ceph object storage daemon osd.2
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
Drop-In: /etc/systemd/system/ceph-osd@.service.d
└─override.conf
Active: active (running) since Fri 2020-03-27 14:03:38 GMT; 2 days ago
Main PID: 1307 (ceph-osd)
CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@2.service
└─1307 /usr/bin/ceph-osd -f --cluster MyCluster--id 2 --setuser ceph --setgroup ceph

Mar 29 16:22:13 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:13.413794 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:21:53.180851 (cutoff 2020-03-29 16:21:53.413759)
Mar 29 16:22:33 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:33.918031 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:22:13.685423 (cutoff 2020-03-29 16:22:13.918030)
Mar 29 16:22:33 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:33.918047 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:22:13.685423 (cutoff 2020-03-29 16:22:13.918030)
Mar 29 16:22:54 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:54.422182 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:22:34.190130 (cutoff 2020-03-29 16:22:34.422181)
Mar 29 16:22:54 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:54.422198 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:22:34.190130 (cutoff 2020-03-29 16:22:34.422181)
Mar 29 16:23:14 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:14.926329 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:22:54.694746 (cutoff 2020-03-29 16:22:54.926327)
Mar 29 16:23:14 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:14.926361 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:22:54.694746 (cutoff 2020-03-29 16:22:54.926327)
Mar 29 16:23:35 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:35.430487 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:23:15.199616 (cutoff 2020-03-29 16:23:15.430486)
Mar 29 16:23:35 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:35.430506 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:23:15.199616 (cutoff 2020-03-29 16:23:15.430486)
Mar 29 16:23:35 NODE-02 systemd[1]: Started Ceph object storage daemon osd.2.

what I can't understand the IP adress displayed here is not mine, because the mienne is 10.1.1.

root@NODE-02:~# systemctl start ceph-osd@2
root@NODE-02:~# systemctl status ceph-osd@2.service
● ceph-osd@2.service - Ceph object storage daemon osd.2
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
Drop-In: /etc/systemd/system/ceph-osd@.service.d
└─override.conf
Active: active (running) since Fri 2020-03-27 14:03:38 GMT; 2 days ago
Main PID: 1307 (ceph-osd)
CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@2.service
└─1307 /usr/bin/ceph-osd -f --cluster MyCluster--id 2 --setuser ceph --setgroup ceph

Mar 29 16:22:13 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:13.413794 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:21:53.180851 (cutoff 2020-03-29 16:21:53.413759)
Mar 29 16:22:33 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:33.918031 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:22:13.685423 (cutoff 2020-03-29 16:22:13.918030)
Mar 29 16:22:33 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:33.918047 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:22:13.685423 (cutoff 2020-03-29 16:22:13.918030)
Mar 29 16:22:54 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:54.422182 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:22:34.190130 (cutoff 2020-03-29 16:22:34.422181)
Mar 29 16:22:54 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:54.422198 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:22:34.190130 (cutoff 2020-03-29 16:22:34.422181)
Mar 29 16:23:14 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:14.926329 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:22:54.694746 (cutoff 2020-03-29 16:22:54.926327)
Mar 29 16:23:14 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:14.926361 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:22:54.694746 (cutoff 2020-03-29 16:22:54.926327)
Mar 29 16:23:35 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:35.430487 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:23:15.199616 (cutoff 2020-03-29 16:23:15.430486)
Mar 29 16:23:35 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:35.430506 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:23:15.199616 (cutoff 2020-03-29 16:23:15.430486)
Mar 29 16:23:35 NODE-02 systemd[1]: Started Ceph object storage daemon osd.2.

what I can't understand the IP adress displayed here is not mine, because the mienne is 10.1.1.

admin
2,930 Posts

March 29, 2020, 4:48 pm

the osd on node 2 is not able to communicate on node 1 10.1.3.115 or node 3 10.1.3.117

can you ping these addresses node 2 ?

not sure what you mean that the ip address is 10.1.1 : note each node has several subnets, with different ips:

can you show output of

/opt/petasan/config/cluster_info.json

on each node do a "ip addr" to list its ips on different subnets: management, backend 1, backend 2 and make sure they can ping one another on all 3 networks.

moh
16 Posts

March 29, 2020, 7:38 pm

yes I understand that he is trying to communicate withe this IP, and not with the management IP (10.1.1.115 and 10.1.1.117)

Yes I can ping these addresses node 2

The output of /opt/petasan/config/cluster_info.json

{
"backend_1_base_ip": "10.1.3.0",
"backend_1_eth_name": "eth2",
"backend_1_mask": "255.255.255.0",
"backend_1_vlan_id": "",
"backend_2_base_ip": "10.1.4.0",
"backend_2_eth_name": "eth3",
"backend_2_mask": "255.255.255.0",
"backend_2_vlan_id": "",
"bonds": [],
"eth_count": 6,
"iscsi_1_eth_name": "eth0",
"iscsi_2_eth_name": "eth1",
"jumbo_frames": [],
"management_eth_name": "eth0",
"management_nodes": [
{
"backend_1_ip": "10.1.3.115",
"backend_2_ip": "10.1.4.115",
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.1.1.115",
"name": "NODE-01"
},
{
"backend_1_ip": "10.1.3.116",
"backend_2_ip": "10.1.4.116",
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.1.1.116",
"name": "NODE-02"
},
{
"backend_1_ip": "10.1.3.117",
"backend_2_ip": "10.1.4.117",
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.1.1.117",
"name": "NODE-03"
}
],
"name": "MyCluster"
}

I am sure they can ping one another on all 3 networks.

admin
2,930 Posts

March 29, 2020, 8:50 pm

on all 3 nodes, do a

systemctl restart ceph-osd.target

then look at the latest osd logs on node 2, do they still show no reply errors ? you can also look at the logs in /var/log/ceph/..

moh
16 Posts

March 29, 2020, 11:17 pm

Quote from moh on March 29, 2020, 11:17 pm
on all 3 nodes, I done systemctl restart ceph-osd.target
cat /var/log/ceph/ MyCluster-osd.2.log
2020-03-29 22:58:47.801499 7fec10176700 -1 osd.2 4160 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on either front or back, first ping sent 2020-03-29 22:58:27.347984 (cutoff 2020-03-29 22:58:27.801498)
2020-03-29 22:58:47.801515 7fec10176700 -1 osd.2 4160 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on either front or back, first ping sent 2020-03-29 22:58:27.347984 (cutoff 2020-03-29 22:58:27.801498)
2020-03-29 22:58:47.852431 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:47.852437 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:48.828816 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:48.828827 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:49.805214 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:49.805241 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:50.781628 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:50.781640 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:51.758030 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:51.758050 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:52.734426 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:52.734437 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:53.710809 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:53.710819 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:54.687202 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:54.687216 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:55.663603 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:55.663616 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:56.639964 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:56.639981 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:57.616369 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:57.616382 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:58.592759 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:58.592769 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:59.569155 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:59.569173 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:59:00.545566 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:59:00.545583 7fec10977700 1 osd.2 4160 not healthy; waiting to boot

I see that the Node 2 still down and trying again to communicate with 10.1.3.117 and 10.1.3.115.

on all 3 nodes, I done systemctl restart ceph-osd.target
cat /var/log/ceph/ MyCluster-osd.2.log
2020-03-29 22:58:47.801499 7fec10176700 -1 osd.2 4160 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on either front or back, first ping sent 2020-03-29 22:58:27.347984 (cutoff 2020-03-29 22:58:27.801498)
2020-03-29 22:58:47.801515 7fec10176700 -1 osd.2 4160 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on either front or back, first ping sent 2020-03-29 22:58:27.347984 (cutoff 2020-03-29 22:58:27.801498)
2020-03-29 22:58:47.852431 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:47.852437 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:48.828816 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:48.828827 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:49.805214 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:49.805241 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:50.781628 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:50.781640 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:51.758030 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:51.758050 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:52.734426 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:52.734437 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:53.710809 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:53.710819 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:54.687202 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:54.687216 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:55.663603 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:55.663616 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:56.639964 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:56.639981 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:57.616369 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:57.616382 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:58.592759 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:58.592769 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:59.569155 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:59.569173 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:59:00.545566 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:59:00.545583 7fec10977700 1 osd.2 4160 not healthy; waiting to boot

I see that the Node 2 still down and trying again to communicate with 10.1.3.117 and 10.1.3.115.

moh
16 Posts

April 7, 2020, 12:53 pm

Hello Admin,

finally I will remove everything and put 2.5.0. I am testing so I would like to do the following: eth0+eth1==>bond0 for management+iscsi1

eth2+eth3==>bond1 for iscsi2

eth4 for backend.

https://help.ubuntu.com/community/UbuntuBonding

{

"backend_1_base_ip": "10.10.10.0",
"backend_1_eth_name": "eth4",
"backend_1_mask": "255.255.255.0",
"backend_1_vlan_id": "",
"backend_2_base_ip": "",
"backend_2_eth_name": "",
"backend_2_mask": "",
"backend_2_vlan_id": "",
"bonds": [],
"cifs_eth_name": "eth4",
"default_pool": "both",
"default_pool_pgs": "1024",
"default_pool_replicas": "3",
"eth_count": 4,
"iscsi_1_eth_name": "bond0",
"iscsi_2_eth_name": "bond1",
"jf_mtu_size": "",
"jumbo_frames": [],
"management_eth_name": "bond0",
"management_nodes": [
{
"backend_1_ip": "10.10.10.34",
"backend_2_ip": "",
"is_backup": false,
"is_cifs": true,
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.0.0.34",
"name": "NODE-01"
},
{
"backend_1_ip": "10.10.10.35",
"backend_2_ip": "",
"is_backup": false,
"is_cifs": false,
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.0.0.35",
"name": "NODE-02"
},
{
"backend_1_ip": "10.10.10.36",
"backend_2_ip": "",
"is_backup": false,

"is_cifs": false,
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.0.0.36",
"name": "NODE-03"
}
],
"name": "MyCluster",
"storage_engine": "bluestore"

in my lab when I did it, my management interface becomes bond0 who stay down 
also I lose control of management the ping on the ip also not pass

admin
2,930 Posts

April 7, 2020, 5:05 pm

It is not clear if you tried to create the bonds via PetaSAN or you already had a built cluster and trying to add it manually. Also not sure why you sent the ubuntu link.

If you used PetaSAN, it should work, if you want to change it after building the cluster, you need to edit the cluster_info.json correctly, from a quick look, the bonds [] is empty, so my guess is you were doing it manually but the config missing.

Although the json format is quite easy, i suggest you build a temp vm machine, during deploying this test image, create a test cluster with same configuration (bonds/vlans/jumbo frames...etc) that you want..the bingo, grab its cluster_config.

If the config is correct, things should work without any further steps.