PetaSAN 2.2 Released!
moh
16 Posts
March 29, 2020, 10:57 amQuote from moh on March 29, 2020, 10:57 amceph status --cluster MyCluster
root@NODE-01:~# ceph status --cluster MyCluster
cluster:
id: ed96e77e-1ff8-4e6a-aa02-3f5caed963a8
health: HEALTH_WARN
2 backfillfull osd(s)
1 pool(s) backfillfull
Reduced data availability: 175 pgs inactive, 175 pgs down
services:
mon: 3 daemons, quorum NODE-01,NODE-02,NODE-03
mgr: NODE-01(active), standbys: NODE-02
osd: 3 osds: 2 up, 2 in
data:
pools: 1 pools, 256 pgs
objects: 8.24M objects, 31.3TiB
usage: 70.9TiB used, 11.0TiB / 81.9TiB avail
pgs: 68.359% pgs not active
175 down
80 active+clean
1 active+clean+scrubbing+deep
ceph osd tree --cluster MyCluster
root@NODE-01:~# ceph osd tree --cluster MyCluster
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 81.85497 root default
-5 27.28499 host NODE-01
1 hdd 27.28499 osd.1 up 1.00000 1.00000
-7 27.28499 host NODE-02
2 hdd 27.28499 osd.2 down 0 1.00000
-3 27.28499 host NODE-03
0 hdd 27.28499 osd.0 up 1.00000 1.00000
ceph status --cluster MyCluster
root@NODE-01:~# ceph status --cluster MyCluster
cluster:
id: ed96e77e-1ff8-4e6a-aa02-3f5caed963a8
health: HEALTH_WARN
2 backfillfull osd(s)
1 pool(s) backfillfull
Reduced data availability: 175 pgs inactive, 175 pgs down
services:
mon: 3 daemons, quorum NODE-01,NODE-02,NODE-03
mgr: NODE-01(active), standbys: NODE-02
osd: 3 osds: 2 up, 2 in
data:
pools: 1 pools, 256 pgs
objects: 8.24M objects, 31.3TiB
usage: 70.9TiB used, 11.0TiB / 81.9TiB avail
pgs: 68.359% pgs not active
175 down
80 active+clean
1 active+clean+scrubbing+deep
ceph osd tree --cluster MyCluster
root@NODE-01:~# ceph osd tree --cluster MyCluster
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 81.85497 root default
-5 27.28499 host NODE-01
1 hdd 27.28499 osd.1 up 1.00000 1.00000
-7 27.28499 host NODE-02
2 hdd 27.28499 osd.2 down 0 1.00000
-3 27.28499 host NODE-03
0 hdd 27.28499 osd.0 up 1.00000 1.00000
admin
2,930 Posts
March 29, 2020, 3:15 pmQuote from admin on March 29, 2020, 3:15 pmon node 2 there is down OSD, try to start it with
systemctl start ceph-osd@2
systemctl status ceph-osd@2
on node 2 there is down OSD, try to start it with
systemctl start ceph-osd@2
systemctl status ceph-osd@2
moh
16 Posts
March 29, 2020, 4:37 pmQuote from moh on March 29, 2020, 4:37 pmroot@NODE-02:~# systemctl start ceph-osd@2
root@NODE-02:~# systemctl status ceph-osd@2.service
● ceph-osd@2.service - Ceph object storage daemon osd.2
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
Drop-In: /etc/systemd/system/ceph-osd@.service.d
└─override.conf
Active: active (running) since Fri 2020-03-27 14:03:38 GMT; 2 days ago
Main PID: 1307 (ceph-osd)
CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@2.service
└─1307 /usr/bin/ceph-osd -f --cluster MyCluster--id 2 --setuser ceph --setgroup ceph
Mar 29 16:22:13 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:13.413794 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:21:53.180851 (cutoff 2020-03-29 16:21:53.413759)
Mar 29 16:22:33 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:33.918031 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:22:13.685423 (cutoff 2020-03-29 16:22:13.918030)
Mar 29 16:22:33 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:33.918047 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:22:13.685423 (cutoff 2020-03-29 16:22:13.918030)
Mar 29 16:22:54 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:54.422182 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:22:34.190130 (cutoff 2020-03-29 16:22:34.422181)
Mar 29 16:22:54 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:54.422198 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:22:34.190130 (cutoff 2020-03-29 16:22:34.422181)
Mar 29 16:23:14 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:14.926329 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:22:54.694746 (cutoff 2020-03-29 16:22:54.926327)
Mar 29 16:23:14 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:14.926361 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:22:54.694746 (cutoff 2020-03-29 16:22:54.926327)
Mar 29 16:23:35 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:35.430487 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:23:15.199616 (cutoff 2020-03-29 16:23:15.430486)
Mar 29 16:23:35 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:35.430506 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:23:15.199616 (cutoff 2020-03-29 16:23:15.430486)
Mar 29 16:23:35 NODE-02 systemd[1]: Started Ceph object storage daemon osd.2.
what I can't understand the IP adress displayed here is not mine, because the mienne is 10.1.1.
root@NODE-02:~# systemctl start ceph-osd@2
root@NODE-02:~# systemctl status ceph-osd@2.service
● ceph-osd@2.service - Ceph object storage daemon osd.2
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
Drop-In: /etc/systemd/system/ceph-osd@.service.d
└─override.conf
Active: active (running) since Fri 2020-03-27 14:03:38 GMT; 2 days ago
Main PID: 1307 (ceph-osd)
CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@2.service
└─1307 /usr/bin/ceph-osd -f --cluster MyCluster--id 2 --setuser ceph --setgroup ceph
Mar 29 16:22:13 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:13.413794 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:21:53.180851 (cutoff 2020-03-29 16:21:53.413759)
Mar 29 16:22:33 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:33.918031 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:22:13.685423 (cutoff 2020-03-29 16:22:13.918030)
Mar 29 16:22:33 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:33.918047 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:22:13.685423 (cutoff 2020-03-29 16:22:13.918030)
Mar 29 16:22:54 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:54.422182 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:22:34.190130 (cutoff 2020-03-29 16:22:34.422181)
Mar 29 16:22:54 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:54.422198 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:22:34.190130 (cutoff 2020-03-29 16:22:34.422181)
Mar 29 16:23:14 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:14.926329 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:22:54.694746 (cutoff 2020-03-29 16:22:54.926327)
Mar 29 16:23:14 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:14.926361 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:22:54.694746 (cutoff 2020-03-29 16:22:54.926327)
Mar 29 16:23:35 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:35.430487 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:23:15.199616 (cutoff 2020-03-29 16:23:15.430486)
Mar 29 16:23:35 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:35.430506 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:23:15.199616 (cutoff 2020-03-29 16:23:15.430486)
Mar 29 16:23:35 NODE-02 systemd[1]: Started Ceph object storage daemon osd.2.
what I can't understand the IP adress displayed here is not mine, because the mienne is 10.1.1.
admin
2,930 Posts
March 29, 2020, 4:48 pmQuote from admin on March 29, 2020, 4:48 pmthe osd on node 2 is not able to communicate on node 1 10.1.3.115 or node 3 10.1.3.117
can you ping these addresses node 2 ?
not sure what you mean that the ip address is 10.1.1 : note each node has several subnets, with different ips:
can you show output of
/opt/petasan/config/cluster_info.json
on each node do a "ip addr" to list its ips on different subnets: management, backend 1, backend 2 and make sure they can ping one another on all 3 networks.
the osd on node 2 is not able to communicate on node 1 10.1.3.115 or node 3 10.1.3.117
can you ping these addresses node 2 ?
not sure what you mean that the ip address is 10.1.1 : note each node has several subnets, with different ips:
can you show output of
/opt/petasan/config/cluster_info.json
on each node do a "ip addr" to list its ips on different subnets: management, backend 1, backend 2 and make sure they can ping one another on all 3 networks.
moh
16 Posts
March 29, 2020, 7:38 pmQuote from moh on March 29, 2020, 7:38 pmyes I understand that he is trying to communicate withe this IP, and not with the management IP (10.1.1.115 and 10.1.1.117)
Yes I can ping these addresses node 2
The output of /opt/petasan/config/cluster_info.json
{
"backend_1_base_ip": "10.1.3.0",
"backend_1_eth_name": "eth2",
"backend_1_mask": "255.255.255.0",
"backend_1_vlan_id": "",
"backend_2_base_ip": "10.1.4.0",
"backend_2_eth_name": "eth3",
"backend_2_mask": "255.255.255.0",
"backend_2_vlan_id": "",
"bonds": [],
"eth_count": 6,
"iscsi_1_eth_name": "eth0",
"iscsi_2_eth_name": "eth1",
"jumbo_frames": [],
"management_eth_name": "eth0",
"management_nodes": [
{
"backend_1_ip": "10.1.3.115",
"backend_2_ip": "10.1.4.115",
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.1.1.115",
"name": "NODE-01"
},
{
"backend_1_ip": "10.1.3.116",
"backend_2_ip": "10.1.4.116",
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.1.1.116",
"name": "NODE-02"
},
{
"backend_1_ip": "10.1.3.117",
"backend_2_ip": "10.1.4.117",
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.1.1.117",
"name": "NODE-03"
}
],
"name": "MyCluster"
}
I am sure they can ping one another on all 3 networks.
yes I understand that he is trying to communicate withe this IP, and not with the management IP (10.1.1.115 and 10.1.1.117)
Yes I can ping these addresses node 2
The output of /opt/petasan/config/cluster_info.json
{
"backend_1_base_ip": "10.1.3.0",
"backend_1_eth_name": "eth2",
"backend_1_mask": "255.255.255.0",
"backend_1_vlan_id": "",
"backend_2_base_ip": "10.1.4.0",
"backend_2_eth_name": "eth3",
"backend_2_mask": "255.255.255.0",
"backend_2_vlan_id": "",
"bonds": [],
"eth_count": 6,
"iscsi_1_eth_name": "eth0",
"iscsi_2_eth_name": "eth1",
"jumbo_frames": [],
"management_eth_name": "eth0",
"management_nodes": [
{
"backend_1_ip": "10.1.3.115",
"backend_2_ip": "10.1.4.115",
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.1.1.115",
"name": "NODE-01"
},
{
"backend_1_ip": "10.1.3.116",
"backend_2_ip": "10.1.4.116",
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.1.1.116",
"name": "NODE-02"
},
{
"backend_1_ip": "10.1.3.117",
"backend_2_ip": "10.1.4.117",
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.1.1.117",
"name": "NODE-03"
}
],
"name": "MyCluster"
}
I am sure they can ping one another on all 3 networks.
admin
2,930 Posts
March 29, 2020, 8:50 pmQuote from admin on March 29, 2020, 8:50 pmon all 3 nodes, do a
systemctl restart ceph-osd.target
then look at the latest osd logs on node 2, do they still show no reply errors ? you can also look at the logs in /var/log/ceph/..
on all 3 nodes, do a
systemctl restart ceph-osd.target
then look at the latest osd logs on node 2, do they still show no reply errors ? you can also look at the logs in /var/log/ceph/..
moh
16 Posts
March 29, 2020, 11:17 pmQuote from moh on March 29, 2020, 11:17 pmon all 3 nodes, I done systemctl restart ceph-osd.target
cat /var/log/ceph/ MyCluster-osd.2.log
2020-03-29 22:58:47.801499 7fec10176700 -1 osd.2 4160 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on either front or back, first ping sent 2020-03-29 22:58:27.347984 (cutoff 2020-03-29 22:58:27.801498)
2020-03-29 22:58:47.801515 7fec10176700 -1 osd.2 4160 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on either front or back, first ping sent 2020-03-29 22:58:27.347984 (cutoff 2020-03-29 22:58:27.801498)
2020-03-29 22:58:47.852431 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:47.852437 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:48.828816 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:48.828827 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:49.805214 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:49.805241 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:50.781628 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:50.781640 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:51.758030 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:51.758050 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:52.734426 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:52.734437 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:53.710809 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:53.710819 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:54.687202 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:54.687216 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:55.663603 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:55.663616 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:56.639964 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:56.639981 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:57.616369 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:57.616382 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:58.592759 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:58.592769 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:59.569155 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:59.569173 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:59:00.545566 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:59:00.545583 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
I see that the Node 2 still down and trying again to communicate with 10.1.3.117 and 10.1.3.115.
on all 3 nodes, I done systemctl restart ceph-osd.target
cat /var/log/ceph/ MyCluster-osd.2.log
2020-03-29 22:58:47.801499 7fec10176700 -1 osd.2 4160 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on either front or back, first ping sent 2020-03-29 22:58:27.347984 (cutoff 2020-03-29 22:58:27.801498)
2020-03-29 22:58:47.801515 7fec10176700 -1 osd.2 4160 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on either front or back, first ping sent 2020-03-29 22:58:27.347984 (cutoff 2020-03-29 22:58:27.801498)
2020-03-29 22:58:47.852431 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:47.852437 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:48.828816 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:48.828827 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:49.805214 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:49.805241 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:50.781628 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:50.781640 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:51.758030 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:51.758050 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:52.734426 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:52.734437 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:53.710809 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:53.710819 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:54.687202 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:54.687216 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:55.663603 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:55.663616 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:56.639964 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:56.639981 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:57.616369 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:57.616382 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:58.592759 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:58.592769 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:59.569155 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:59.569173 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:59:00.545566 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:59:00.545583 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
I see that the Node 2 still down and trying again to communicate with 10.1.3.117 and 10.1.3.115.
moh
16 Posts
April 7, 2020, 12:53 pmQuote from moh on April 7, 2020, 12:53 pmHello Admin,
finally I will remove everything and put 2.5.0. I am testing so I would like to do the following: eth0+eth1==>bond0 for management+iscsi1
eth2+eth3==>bond1 for iscsi2
eth4 for backend.
https://help.ubuntu.com/community/UbuntuBonding
{
"backend_1_base_ip": "10.10.10.0",
"backend_1_eth_name": "eth4",
"backend_1_mask": "255.255.255.0",
"backend_1_vlan_id": "",
"backend_2_base_ip": "",
"backend_2_eth_name": "",
"backend_2_mask": "",
"backend_2_vlan_id": "",
"bonds": [],
"cifs_eth_name": "eth4",
"default_pool": "both",
"default_pool_pgs": "1024",
"default_pool_replicas": "3",
"eth_count": 4,
"iscsi_1_eth_name": "bond0",
"iscsi_2_eth_name": "bond1",
"jf_mtu_size": "",
"jumbo_frames": [],
"management_eth_name": "bond0",
"management_nodes": [
{
"backend_1_ip": "10.10.10.34",
"backend_2_ip": "",
"is_backup": false,
"is_cifs": true,
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.0.0.34",
"name": "NODE-01"
},
{
"backend_1_ip": "10.10.10.35",
"backend_2_ip": "",
"is_backup": false,
"is_cifs": false,
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.0.0.35",
"name": "NODE-02"
},
{
"backend_1_ip": "10.10.10.36",
"backend_2_ip": "",
"is_backup": false,
"is_cifs": false,
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.0.0.36",
"name": "NODE-03"
}
],
"name": "MyCluster",
"storage_engine": "bluestore"
in my lab when I did it, my management interface becomes bond0 who stay down
also I lose control of management the ping on the ip also not pass
Hello Admin,
finally I will remove everything and put 2.5.0. I am testing so I would like to do the following: eth0+eth1==>bond0 for management+iscsi1
eth2+eth3==>bond1 for iscsi2
eth4 for backend.
https://help.ubuntu.com/community/UbuntuBonding
{
"backend_1_base_ip": "10.10.10.0",
"backend_1_eth_name": "eth4",
"backend_1_mask": "255.255.255.0",
"backend_1_vlan_id": "",
"backend_2_base_ip": "",
"backend_2_eth_name": "",
"backend_2_mask": "",
"backend_2_vlan_id": "",
"bonds": [],
"cifs_eth_name": "eth4",
"default_pool": "both",
"default_pool_pgs": "1024",
"default_pool_replicas": "3",
"eth_count": 4,
"iscsi_1_eth_name": "bond0",
"iscsi_2_eth_name": "bond1",
"jf_mtu_size": "",
"jumbo_frames": [],
"management_eth_name": "bond0",
"management_nodes": [
{
"backend_1_ip": "10.10.10.34",
"backend_2_ip": "",
"is_backup": false,
"is_cifs": true,
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.0.0.34",
"name": "NODE-01"
},
{
"backend_1_ip": "10.10.10.35",
"backend_2_ip": "",
"is_backup": false,
"is_cifs": false,
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.0.0.35",
"name": "NODE-02"
},
{
"backend_1_ip": "10.10.10.36",
"backend_2_ip": "",
"is_backup": false,
"is_cifs": false,
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.0.0.36",
"name": "NODE-03"
}
],
"name": "MyCluster",
"storage_engine": "bluestore"
in my lab when I did it, my management interface becomes bond0 who stay down
also I lose control of management the ping on the ip also not pass
Last edited on April 7, 2020, 1:07 pm by moh · #38
admin
2,930 Posts
April 7, 2020, 5:05 pmQuote from admin on April 7, 2020, 5:05 pmIt is not clear if you tried to create the bonds via PetaSAN or you already had a built cluster and trying to add it manually. Also not sure why you sent the ubuntu link.
If you used PetaSAN, it should work, if you want to change it after building the cluster, you need to edit the cluster_info.json correctly, from a quick look, the bonds [] is empty, so my guess is you were doing it manually but the config missing.
Although the json format is quite easy, i suggest you build a temp vm machine, during deploying this test image, create a test cluster with same configuration (bonds/vlans/jumbo frames...etc) that you want..the bingo, grab its cluster_config.
If the config is correct, things should work without any further steps.
It is not clear if you tried to create the bonds via PetaSAN or you already had a built cluster and trying to add it manually. Also not sure why you sent the ubuntu link.
If you used PetaSAN, it should work, if you want to change it after building the cluster, you need to edit the cluster_info.json correctly, from a quick look, the bonds [] is empty, so my guess is you were doing it manually but the config missing.
Although the json format is quite easy, i suggest you build a temp vm machine, during deploying this test image, create a test cluster with same configuration (bonds/vlans/jumbo frames...etc) that you want..the bingo, grab its cluster_config.
If the config is correct, things should work without any further steps.
Last edited on April 7, 2020, 5:06 pm by admin · #39
PetaSAN 2.2 Released!
moh
16 Posts
Quote from moh on March 29, 2020, 10:57 amceph status --cluster MyCluster
root@NODE-01:~# ceph status --cluster MyCluster
cluster:
id: ed96e77e-1ff8-4e6a-aa02-3f5caed963a8
health: HEALTH_WARN
2 backfillfull osd(s)
1 pool(s) backfillfull
Reduced data availability: 175 pgs inactive, 175 pgs downservices:
mon: 3 daemons, quorum NODE-01,NODE-02,NODE-03
mgr: NODE-01(active), standbys: NODE-02
osd: 3 osds: 2 up, 2 indata:
pools: 1 pools, 256 pgs
objects: 8.24M objects, 31.3TiB
usage: 70.9TiB used, 11.0TiB / 81.9TiB avail
pgs: 68.359% pgs not active
175 down
80 active+clean
1 active+clean+scrubbing+deep
ceph osd tree --cluster MyCluster
root@NODE-01:~# ceph osd tree --cluster MyCluster
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 81.85497 root default
-5 27.28499 host NODE-01
1 hdd 27.28499 osd.1 up 1.00000 1.00000
-7 27.28499 host NODE-02
2 hdd 27.28499 osd.2 down 0 1.00000
-3 27.28499 host NODE-03
0 hdd 27.28499 osd.0 up 1.00000 1.00000
ceph status --cluster MyCluster
root@NODE-01:~# ceph status --cluster MyCluster
cluster:
id: ed96e77e-1ff8-4e6a-aa02-3f5caed963a8
health: HEALTH_WARN
2 backfillfull osd(s)
1 pool(s) backfillfull
Reduced data availability: 175 pgs inactive, 175 pgs downservices:
mon: 3 daemons, quorum NODE-01,NODE-02,NODE-03
mgr: NODE-01(active), standbys: NODE-02
osd: 3 osds: 2 up, 2 indata:
pools: 1 pools, 256 pgs
objects: 8.24M objects, 31.3TiB
usage: 70.9TiB used, 11.0TiB / 81.9TiB avail
pgs: 68.359% pgs not active
175 down
80 active+clean
1 active+clean+scrubbing+deep
ceph osd tree --cluster MyCluster
root@NODE-01:~# ceph osd tree --cluster MyCluster
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 81.85497 root default
-5 27.28499 host NODE-01
1 hdd 27.28499 osd.1 up 1.00000 1.00000
-7 27.28499 host NODE-02
2 hdd 27.28499 osd.2 down 0 1.00000
-3 27.28499 host NODE-03
0 hdd 27.28499 osd.0 up 1.00000 1.00000
admin
2,930 Posts
Quote from admin on March 29, 2020, 3:15 pmon node 2 there is down OSD, try to start it with
systemctl start ceph-osd@2
systemctl status ceph-osd@2
on node 2 there is down OSD, try to start it with
systemctl start ceph-osd@2
systemctl status ceph-osd@2
moh
16 Posts
Quote from moh on March 29, 2020, 4:37 pmroot@NODE-02:~# systemctl start ceph-osd@2
root@NODE-02:~# systemctl status ceph-osd@2.service
● ceph-osd@2.service - Ceph object storage daemon osd.2
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
Drop-In: /etc/systemd/system/ceph-osd@.service.d
└─override.conf
Active: active (running) since Fri 2020-03-27 14:03:38 GMT; 2 days ago
Main PID: 1307 (ceph-osd)
CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@2.service
└─1307 /usr/bin/ceph-osd -f --cluster MyCluster--id 2 --setuser ceph --setgroup cephMar 29 16:22:13 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:13.413794 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:21:53.180851 (cutoff 2020-03-29 16:21:53.413759)
Mar 29 16:22:33 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:33.918031 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:22:13.685423 (cutoff 2020-03-29 16:22:13.918030)
Mar 29 16:22:33 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:33.918047 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:22:13.685423 (cutoff 2020-03-29 16:22:13.918030)
Mar 29 16:22:54 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:54.422182 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:22:34.190130 (cutoff 2020-03-29 16:22:34.422181)
Mar 29 16:22:54 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:54.422198 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:22:34.190130 (cutoff 2020-03-29 16:22:34.422181)
Mar 29 16:23:14 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:14.926329 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:22:54.694746 (cutoff 2020-03-29 16:22:54.926327)
Mar 29 16:23:14 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:14.926361 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:22:54.694746 (cutoff 2020-03-29 16:22:54.926327)
Mar 29 16:23:35 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:35.430487 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:23:15.199616 (cutoff 2020-03-29 16:23:15.430486)
Mar 29 16:23:35 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:35.430506 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:23:15.199616 (cutoff 2020-03-29 16:23:15.430486)
Mar 29 16:23:35 NODE-02 systemd[1]: Started Ceph object storage daemon osd.2.what I can't understand the IP adress displayed here is not mine, because the mienne is 10.1.1.
root@NODE-02:~# systemctl start ceph-osd@2
root@NODE-02:~# systemctl status ceph-osd@2.service
● ceph-osd@2.service - Ceph object storage daemon osd.2
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
Drop-In: /etc/systemd/system/ceph-osd@.service.d
└─override.conf
Active: active (running) since Fri 2020-03-27 14:03:38 GMT; 2 days ago
Main PID: 1307 (ceph-osd)
CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@2.service
└─1307 /usr/bin/ceph-osd -f --cluster MyCluster--id 2 --setuser ceph --setgroup cephMar 29 16:22:13 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:13.413794 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:21:53.180851 (cutoff 2020-03-29 16:21:53.413759)
Mar 29 16:22:33 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:33.918031 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:22:13.685423 (cutoff 2020-03-29 16:22:13.918030)
Mar 29 16:22:33 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:33.918047 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:22:13.685423 (cutoff 2020-03-29 16:22:13.918030)
Mar 29 16:22:54 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:54.422182 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:22:34.190130 (cutoff 2020-03-29 16:22:34.422181)
Mar 29 16:22:54 NODE-02 ceph-osd[1307]: 2020-03-29 16:22:54.422198 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:22:34.190130 (cutoff 2020-03-29 16:22:34.422181)
Mar 29 16:23:14 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:14.926329 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:22:54.694746 (cutoff 2020-03-29 16:22:54.926327)
Mar 29 16:23:14 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:14.926361 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:22:54.694746 (cutoff 2020-03-29 16:22:54.926327)
Mar 29 16:23:35 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:35.430487 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on
either front or back, first ping sent 2020-03-29 16:23:15.199616 (cutoff 2020-03-29 16:23:15.430486)
Mar 29 16:23:35 NODE-02 ceph-osd[1307]: 2020-03-29 16:23:35.430506 7f2161d31700 -1 osd.2 4117 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on
either front or back, first ping sent 2020-03-29 16:23:15.199616 (cutoff 2020-03-29 16:23:15.430486)
Mar 29 16:23:35 NODE-02 systemd[1]: Started Ceph object storage daemon osd.2.
what I can't understand the IP adress displayed here is not mine, because the mienne is 10.1.1.
admin
2,930 Posts
Quote from admin on March 29, 2020, 4:48 pmthe osd on node 2 is not able to communicate on node 1 10.1.3.115 or node 3 10.1.3.117
can you ping these addresses node 2 ?
not sure what you mean that the ip address is 10.1.1 : note each node has several subnets, with different ips:
can you show output of
/opt/petasan/config/cluster_info.json
on each node do a "ip addr" to list its ips on different subnets: management, backend 1, backend 2 and make sure they can ping one another on all 3 networks.
the osd on node 2 is not able to communicate on node 1 10.1.3.115 or node 3 10.1.3.117
can you ping these addresses node 2 ?
not sure what you mean that the ip address is 10.1.1 : note each node has several subnets, with different ips:
can you show output of
/opt/petasan/config/cluster_info.json
on each node do a "ip addr" to list its ips on different subnets: management, backend 1, backend 2 and make sure they can ping one another on all 3 networks.
moh
16 Posts
Quote from moh on March 29, 2020, 7:38 pmyes I understand that he is trying to communicate withe this IP, and not with the management IP (10.1.1.115 and 10.1.1.117)
Yes I can ping these addresses node 2
The output of /opt/petasan/config/cluster_info.json
{
"backend_1_base_ip": "10.1.3.0",
"backend_1_eth_name": "eth2",
"backend_1_mask": "255.255.255.0",
"backend_1_vlan_id": "",
"backend_2_base_ip": "10.1.4.0",
"backend_2_eth_name": "eth3",
"backend_2_mask": "255.255.255.0",
"backend_2_vlan_id": "",
"bonds": [],
"eth_count": 6,
"iscsi_1_eth_name": "eth0",
"iscsi_2_eth_name": "eth1",
"jumbo_frames": [],
"management_eth_name": "eth0",
"management_nodes": [
{
"backend_1_ip": "10.1.3.115",
"backend_2_ip": "10.1.4.115",
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.1.1.115",
"name": "NODE-01"
},
{
"backend_1_ip": "10.1.3.116",
"backend_2_ip": "10.1.4.116",
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.1.1.116",
"name": "NODE-02"
},
{
"backend_1_ip": "10.1.3.117",
"backend_2_ip": "10.1.4.117",
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.1.1.117",
"name": "NODE-03"
}
],
"name": "MyCluster"
}I am sure they can ping one another on all 3 networks.
yes I understand that he is trying to communicate withe this IP, and not with the management IP (10.1.1.115 and 10.1.1.117)
Yes I can ping these addresses node 2
The output of /opt/petasan/config/cluster_info.json
{
"backend_1_base_ip": "10.1.3.0",
"backend_1_eth_name": "eth2",
"backend_1_mask": "255.255.255.0",
"backend_1_vlan_id": "",
"backend_2_base_ip": "10.1.4.0",
"backend_2_eth_name": "eth3",
"backend_2_mask": "255.255.255.0",
"backend_2_vlan_id": "",
"bonds": [],
"eth_count": 6,
"iscsi_1_eth_name": "eth0",
"iscsi_2_eth_name": "eth1",
"jumbo_frames": [],
"management_eth_name": "eth0",
"management_nodes": [
{
"backend_1_ip": "10.1.3.115",
"backend_2_ip": "10.1.4.115",
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.1.1.115",
"name": "NODE-01"
},
{
"backend_1_ip": "10.1.3.116",
"backend_2_ip": "10.1.4.116",
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.1.1.116",
"name": "NODE-02"
},
{
"backend_1_ip": "10.1.3.117",
"backend_2_ip": "10.1.4.117",
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.1.1.117",
"name": "NODE-03"
}
],
"name": "MyCluster"
}
I am sure they can ping one another on all 3 networks.
admin
2,930 Posts
Quote from admin on March 29, 2020, 8:50 pmon all 3 nodes, do a
systemctl restart ceph-osd.target
then look at the latest osd logs on node 2, do they still show no reply errors ? you can also look at the logs in /var/log/ceph/..
on all 3 nodes, do a
systemctl restart ceph-osd.target
then look at the latest osd logs on node 2, do they still show no reply errors ? you can also look at the logs in /var/log/ceph/..
moh
16 Posts
Quote from moh on March 29, 2020, 11:17 pmon all 3 nodes, I done systemctl restart ceph-osd.target
cat /var/log/ceph/ MyCluster-osd.2.log
2020-03-29 22:58:47.801499 7fec10176700 -1 osd.2 4160 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on either front or back, first ping sent 2020-03-29 22:58:27.347984 (cutoff 2020-03-29 22:58:27.801498)
2020-03-29 22:58:47.801515 7fec10176700 -1 osd.2 4160 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on either front or back, first ping sent 2020-03-29 22:58:27.347984 (cutoff 2020-03-29 22:58:27.801498)
2020-03-29 22:58:47.852431 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:47.852437 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:48.828816 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:48.828827 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:49.805214 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:49.805241 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:50.781628 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:50.781640 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:51.758030 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:51.758050 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:52.734426 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:52.734437 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:53.710809 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:53.710819 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:54.687202 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:54.687216 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:55.663603 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:55.663616 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:56.639964 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:56.639981 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:57.616369 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:57.616382 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:58.592759 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:58.592769 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:59.569155 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:59.569173 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:59:00.545566 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:59:00.545583 7fec10977700 1 osd.2 4160 not healthy; waiting to bootI see that the Node 2 still down and trying again to communicate with 10.1.3.117 and 10.1.3.115.
on all 3 nodes, I done systemctl restart ceph-osd.target
cat /var/log/ceph/ MyCluster-osd.2.log
2020-03-29 22:58:47.801499 7fec10176700 -1 osd.2 4160 heartbeat_check: no reply from 10.1.3.117:6801 osd.0 ever on either front or back, first ping sent 2020-03-29 22:58:27.347984 (cutoff 2020-03-29 22:58:27.801498)
2020-03-29 22:58:47.801515 7fec10176700 -1 osd.2 4160 heartbeat_check: no reply from 10.1.3.115:6801 osd.1 ever on either front or back, first ping sent 2020-03-29 22:58:27.347984 (cutoff 2020-03-29 22:58:27.801498)
2020-03-29 22:58:47.852431 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:47.852437 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:48.828816 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:48.828827 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:49.805214 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:49.805241 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:50.781628 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:50.781640 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:51.758030 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:51.758050 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:52.734426 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:52.734437 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:53.710809 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:53.710819 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:54.687202 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:54.687216 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:55.663603 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:55.663616 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:56.639964 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:56.639981 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:57.616369 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:57.616382 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:58.592759 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:58.592769 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:58:59.569155 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:58:59.569173 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
2020-03-29 22:59:00.545566 7fec10977700 1 osd.2 4160 is_healthy false -- only 0/2 up peers (less than 33%)
2020-03-29 22:59:00.545583 7fec10977700 1 osd.2 4160 not healthy; waiting to boot
I see that the Node 2 still down and trying again to communicate with 10.1.3.117 and 10.1.3.115.
moh
16 Posts
Quote from moh on April 7, 2020, 12:53 pmHello Admin,
finally I will remove everything and put 2.5.0. I am testing so I would like to do the following: eth0+eth1==>bond0 for management+iscsi1
eth2+eth3==>bond1 for iscsi2
eth4 for backend.
https://help.ubuntu.com/community/UbuntuBonding{
"backend_1_base_ip": "10.10.10.0",
"backend_1_eth_name": "eth4",
"backend_1_mask": "255.255.255.0",
"backend_1_vlan_id": "",
"backend_2_base_ip": "",
"backend_2_eth_name": "",
"backend_2_mask": "",
"backend_2_vlan_id": "",
"bonds": [],
"cifs_eth_name": "eth4",
"default_pool": "both",
"default_pool_pgs": "1024",
"default_pool_replicas": "3",
"eth_count": 4,
"iscsi_1_eth_name": "bond0",
"iscsi_2_eth_name": "bond1",
"jf_mtu_size": "",
"jumbo_frames": [],
"management_eth_name": "bond0",
"management_nodes": [
{
"backend_1_ip": "10.10.10.34",
"backend_2_ip": "",
"is_backup": false,
"is_cifs": true,
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.0.0.34",
"name": "NODE-01"
},
{
"backend_1_ip": "10.10.10.35",
"backend_2_ip": "",
"is_backup": false,
"is_cifs": false,
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.0.0.35",
"name": "NODE-02"
},
{
"backend_1_ip": "10.10.10.36",
"backend_2_ip": "",
"is_backup": false,"is_cifs": false,
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.0.0.36",
"name": "NODE-03"
}
],
"name": "MyCluster",
"storage_engine": "bluestore"in my lab when I did it, my management interface becomes bond0 who stay down also I lose control of management the ping on the ip also not pass
Hello Admin,
finally I will remove everything and put 2.5.0. I am testing so I would like to do the following: eth0+eth1==>bond0 for management+iscsi1
eth2+eth3==>bond1 for iscsi2
eth4 for backend.
https://help.ubuntu.com/community/UbuntuBonding
{
"backend_1_base_ip": "10.10.10.0",
"backend_1_eth_name": "eth4",
"backend_1_mask": "255.255.255.0",
"backend_1_vlan_id": "",
"backend_2_base_ip": "",
"backend_2_eth_name": "",
"backend_2_mask": "",
"backend_2_vlan_id": "",
"bonds": [],
"cifs_eth_name": "eth4",
"default_pool": "both",
"default_pool_pgs": "1024",
"default_pool_replicas": "3",
"eth_count": 4,
"iscsi_1_eth_name": "bond0",
"iscsi_2_eth_name": "bond1",
"jf_mtu_size": "",
"jumbo_frames": [],
"management_eth_name": "bond0",
"management_nodes": [
{
"backend_1_ip": "10.10.10.34",
"backend_2_ip": "",
"is_backup": false,
"is_cifs": true,
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.0.0.34",
"name": "NODE-01"
},
{
"backend_1_ip": "10.10.10.35",
"backend_2_ip": "",
"is_backup": false,
"is_cifs": false,
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.0.0.35",
"name": "NODE-02"
},
{
"backend_1_ip": "10.10.10.36",
"backend_2_ip": "",
"is_backup": false,"is_cifs": false,
"is_iscsi": true,
"is_management": true,
"is_storage": true,
"management_ip": "10.0.0.36",
"name": "NODE-03"
}
],
"name": "MyCluster",
"storage_engine": "bluestore"
in my lab when I did it, my management interface becomes bond0 who stay down also I lose control of management the ping on the ip also not pass
admin
2,930 Posts
Quote from admin on April 7, 2020, 5:05 pmIt is not clear if you tried to create the bonds via PetaSAN or you already had a built cluster and trying to add it manually. Also not sure why you sent the ubuntu link.
If you used PetaSAN, it should work, if you want to change it after building the cluster, you need to edit the cluster_info.json correctly, from a quick look, the bonds [] is empty, so my guess is you were doing it manually but the config missing.
Although the json format is quite easy, i suggest you build a temp vm machine, during deploying this test image, create a test cluster with same configuration (bonds/vlans/jumbo frames...etc) that you want..the bingo, grab its cluster_config.
If the config is correct, things should work without any further steps.
It is not clear if you tried to create the bonds via PetaSAN or you already had a built cluster and trying to add it manually. Also not sure why you sent the ubuntu link.
If you used PetaSAN, it should work, if you want to change it after building the cluster, you need to edit the cluster_info.json correctly, from a quick look, the bonds [] is empty, so my guess is you were doing it manually but the config missing.
Although the json format is quite easy, i suggest you build a temp vm machine, during deploying this test image, create a test cluster with same configuration (bonds/vlans/jumbo frames...etc) that you want..the bingo, grab its cluster_config.
If the config is correct, things should work without any further steps.