CIFS services not up following cluster powerdown.
sparxalt
3 Posts
December 30, 2020, 2:50 pmQuote from sparxalt on December 30, 2020, 2:50 pmHello, I'm really enjoying PetaSAN and digging into learning CEPH through its more approachable interface. I've setup an 8 node cluster with each role MGMT/iSCSI/NFS/CIFS spread out across 3 of the 8, with some obvious overlaps. So far it's been working great, but last night I had an unexpected power outage. Due to very little battery remaining on the UPS I decided to initiate shutdowns on all nodes via a power button press. Post power restore, the the cluster was brought back up and returned to all green within minutes. The one exception is the CIFS service which remains unavailable. The CIFS Status page display the red banner with "Cannot get CIFS Status." and any attempt to add a CIFS share displays the banner "CIFS services not up."
The nodes which run the CIFS role repeat two events over and over in their logs: "ERROR WatchBase Exception :" and "INFO CIFSService key change action." The petasan-cifs service shows as running on the CIFS nodes, and I've placed the cluster in maintenance and cleanly rebooted each node one at a time. Where else can I begin troubleshooting this to restore the service?
Hello, I'm really enjoying PetaSAN and digging into learning CEPH through its more approachable interface. I've setup an 8 node cluster with each role MGMT/iSCSI/NFS/CIFS spread out across 3 of the 8, with some obvious overlaps. So far it's been working great, but last night I had an unexpected power outage. Due to very little battery remaining on the UPS I decided to initiate shutdowns on all nodes via a power button press. Post power restore, the the cluster was brought back up and returned to all green within minutes. The one exception is the CIFS service which remains unavailable. The CIFS Status page display the red banner with "Cannot get CIFS Status." and any attempt to add a CIFS share displays the banner "CIFS services not up."
The nodes which run the CIFS role repeat two events over and over in their logs: "ERROR WatchBase Exception :" and "INFO CIFSService key change action." The petasan-cifs service shows as running on the CIFS nodes, and I've placed the cluster in maintenance and cleanly rebooted each node one at a time. Where else can I begin troubleshooting this to restore the service?
admin
2,930 Posts
December 30, 2020, 3:31 pmQuote from admin on December 30, 2020, 3:31 pmOn CIFS node, what is the output of
ceph status
ceph fs status
mount | grep mnt
mount | grep shared
ctdb status
On CIFS node, what is the output of
ceph status
ceph fs status
mount | grep mnt
mount | grep shared
ctdb status
sparxalt
3 Posts
December 30, 2020, 3:54 pmQuote from sparxalt on December 30, 2020, 3:54 pmceph status
cluster:
id: 987c9aea-fc2f-4a20-88a8-ac7bf626e7e9
health: HEALTH_OK
services:
mon: 3 daemons, quorum ceph-8,ceph-1,ceph-4 (age 11h)
mgr: ceph-8(active, since 11h), standbys: ceph-4, ceph-1
mds: cephfs:1 {0=ceph-4=up:active} 2 up:standby
osd: 58 osds: 58 up (since 109m), 58 in (since 6d)
task status:
scrub status:
mds.ceph-4: idle
data:
pools: 7 pools, 576 pgs
objects: 423.80k objects, 1.5 TiB
usage: 4.4 TiB used, 159 TiB / 164 TiB avail
pgs: 576 active+clean
io:
client: 19 KiB/s rd, 3.8 KiB/s wr, 11 op/s rd, 3 op/s wr
ceph fs status
cephfs - 39 clients
======
+------+--------+--------+---------------+-------+-------+
| Rank | State | MDS | Activity | dns | inos |
+------+--------+--------+---------------+-------+-------+
| 0 | active | ceph-4 | Reqs: 0 /s | 19.8k | 13.5k |
+------+--------+--------+---------------+-------+-------+
+-----------------+----------+-------+-------+
| Pool | type | used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata | 727M | 10.6T |
| cephfs_root | data | 12.1k | 10.6T |
| cephfs_ec_hdd | data | 247M | 76.9T |
| cephfs_ec_ssd | data | 2368G | 21.3T |
+-----------------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
| ceph-1 |
| ceph-8 |
+-------------+
MDS version: ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable)
mount | grep mnt
10.1.110.81,10.1.110.84,10.1.110.88:/ on /mnt/cephfs type ceph (rw,relatime,name=admin,secret=<hidden>,acl,mds_namespace=cephfs)
mount | grep shared
10.1.110.81:gfs-vol on /opt/petasan/config/shared type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
ctdb status
connect() failed, errno=2
Failed to connect to CTDB daemon (/var/run/ctdb/ctdbd.socket)
ceph status
cluster:
id: 987c9aea-fc2f-4a20-88a8-ac7bf626e7e9
health: HEALTH_OK
services:
mon: 3 daemons, quorum ceph-8,ceph-1,ceph-4 (age 11h)
mgr: ceph-8(active, since 11h), standbys: ceph-4, ceph-1
mds: cephfs:1 {0=ceph-4=up:active} 2 up:standby
osd: 58 osds: 58 up (since 109m), 58 in (since 6d)
task status:
scrub status:
mds.ceph-4: idle
data:
pools: 7 pools, 576 pgs
objects: 423.80k objects, 1.5 TiB
usage: 4.4 TiB used, 159 TiB / 164 TiB avail
pgs: 576 active+clean
io:
client: 19 KiB/s rd, 3.8 KiB/s wr, 11 op/s rd, 3 op/s wr
ceph fs status
cephfs - 39 clients
======
+------+--------+--------+---------------+-------+-------+
| Rank | State | MDS | Activity | dns | inos |
+------+--------+--------+---------------+-------+-------+
| 0 | active | ceph-4 | Reqs: 0 /s | 19.8k | 13.5k |
+------+--------+--------+---------------+-------+-------+
+-----------------+----------+-------+-------+
| Pool | type | used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata | 727M | 10.6T |
| cephfs_root | data | 12.1k | 10.6T |
| cephfs_ec_hdd | data | 247M | 76.9T |
| cephfs_ec_ssd | data | 2368G | 21.3T |
+-----------------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
| ceph-1 |
| ceph-8 |
+-------------+
MDS version: ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable)
mount | grep mnt
10.1.110.81,10.1.110.84,10.1.110.88:/ on /mnt/cephfs type ceph (rw,relatime,name=admin,secret=<hidden>,acl,mds_namespace=cephfs)
mount | grep shared
10.1.110.81:gfs-vol on /opt/petasan/config/shared type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
ctdb status
connect() failed, errno=2
Failed to connect to CTDB daemon (/var/run/ctdb/ctdbd.socket)
admin
2,930 Posts
December 30, 2020, 4:22 pmQuote from admin on December 30, 2020, 4:22 pmon 1 of the CIFS nodes :
systemctl stop petasan-cifs
systemctl start ctdb
wait 1 min then
systemctl status smbd
systemctl status ctdb
ctdb status
if you get error on screen, what error do you get ?
you can get more logs from:
/var/log/samba/log.ctdb
/var/log/samba/log.smbd
on 1 of the CIFS nodes :
systemctl stop petasan-cifs
systemctl start ctdb
wait 1 min then
systemctl status smbd
systemctl status ctdb
ctdb status
if you get error on screen, what error do you get ?
you can get more logs from:
/var/log/samba/log.ctdb
/var/log/samba/log.smbd
sparxalt
3 Posts
December 30, 2020, 4:26 pmQuote from sparxalt on December 30, 2020, 4:26 pmThanks Admin!
That was the nudge I needed, I found the ctdb service wasn't running, so I started it which allowed the CIFS Status page to show "down" on all three nodes. Once I reapplied the CIFS settings from the Configuration section all nodes came online and started serving. now ctdb status displays:
Number of nodes:3
pnn:0 10.1.110.82 OK (THIS NODE)
pnn:1 10.1.110.85 OK
pnn:2 10.1.110.87 OK
Generation:63243887
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:2
Thanks Admin!
That was the nudge I needed, I found the ctdb service wasn't running, so I started it which allowed the CIFS Status page to show "down" on all three nodes. Once I reapplied the CIFS settings from the Configuration section all nodes came online and started serving. now ctdb status displays:
Number of nodes:3
pnn:0 10.1.110.82 OK (THIS NODE)
pnn:1 10.1.110.85 OK
pnn:2 10.1.110.87 OK
Generation:63243887
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:2
the.only.chaos.lucifer
31 Posts
December 21, 2023, 8:33 amQuote from the.only.chaos.lucifer on December 21, 2023, 8:33 am@admin How do i avoid this as i notice this occurs every time there is a power down. I am doing this in homelab and testing envirnoment and this is definitely not ideal. Technically it should auto connect by itself. Is there anything i can do to resolve this? No worries if it is too much for you guys. But thanks a lot if there is any suggestions!!!
@admin How do i avoid this as i notice this occurs every time there is a power down. I am doing this in homelab and testing envirnoment and this is definitely not ideal. Technically it should auto connect by itself. Is there anything i can do to resolve this? No worries if it is too much for you guys. But thanks a lot if there is any suggestions!!!
Last edited on December 21, 2023, 8:34 am by the.only.chaos.lucifer · #6
admin
2,930 Posts
December 21, 2023, 3:58 pmQuote from admin on December 21, 2023, 3:58 pmnot really sure since we do not see this. it could be related to your environment, if you can test in a different setup it would be great.. Else make sure that ceph/cephfs have no issues when you restart and if all ok, look at the ctdb/samba logs.
not really sure since we do not see this. it could be related to your environment, if you can test in a different setup it would be great.. Else make sure that ceph/cephfs have no issues when you restart and if all ok, look at the ctdb/samba logs.
the.only.chaos.lucifer
31 Posts
December 22, 2023, 6:36 pmQuote from the.only.chaos.lucifer on December 22, 2023, 6:36 pmThe suggested code you provide definitely work just that every time a reboot occur I need to run code below. Just wondering is this because ISCSI, CIFS, NFS, S3 are all on the same subnet? Should it split the subnet? The backend and management subnet are on its unique/own subnets. Total I have 3 subnet.
systemctl stop petasan-cifs
systemctl start ctdb
The suggested code you provide definitely work just that every time a reboot occur I need to run code below. Just wondering is this because ISCSI, CIFS, NFS, S3 are all on the same subnet? Should it split the subnet? The backend and management subnet are on its unique/own subnets. Total I have 3 subnet.
systemctl stop petasan-cifs
systemctl start ctdb
Last edited on December 23, 2023, 9:06 am by the.only.chaos.lucifer · #8
CIFS services not up following cluster powerdown.
sparxalt
3 Posts
Quote from sparxalt on December 30, 2020, 2:50 pmHello, I'm really enjoying PetaSAN and digging into learning CEPH through its more approachable interface. I've setup an 8 node cluster with each role MGMT/iSCSI/NFS/CIFS spread out across 3 of the 8, with some obvious overlaps. So far it's been working great, but last night I had an unexpected power outage. Due to very little battery remaining on the UPS I decided to initiate shutdowns on all nodes via a power button press. Post power restore, the the cluster was brought back up and returned to all green within minutes. The one exception is the CIFS service which remains unavailable. The CIFS Status page display the red banner with "Cannot get CIFS Status." and any attempt to add a CIFS share displays the banner "CIFS services not up."
The nodes which run the CIFS role repeat two events over and over in their logs: "ERROR WatchBase Exception :" and "INFO CIFSService key change action." The petasan-cifs service shows as running on the CIFS nodes, and I've placed the cluster in maintenance and cleanly rebooted each node one at a time. Where else can I begin troubleshooting this to restore the service?
Hello, I'm really enjoying PetaSAN and digging into learning CEPH through its more approachable interface. I've setup an 8 node cluster with each role MGMT/iSCSI/NFS/CIFS spread out across 3 of the 8, with some obvious overlaps. So far it's been working great, but last night I had an unexpected power outage. Due to very little battery remaining on the UPS I decided to initiate shutdowns on all nodes via a power button press. Post power restore, the the cluster was brought back up and returned to all green within minutes. The one exception is the CIFS service which remains unavailable. The CIFS Status page display the red banner with "Cannot get CIFS Status." and any attempt to add a CIFS share displays the banner "CIFS services not up."
The nodes which run the CIFS role repeat two events over and over in their logs: "ERROR WatchBase Exception :" and "INFO CIFSService key change action." The petasan-cifs service shows as running on the CIFS nodes, and I've placed the cluster in maintenance and cleanly rebooted each node one at a time. Where else can I begin troubleshooting this to restore the service?
admin
2,930 Posts
Quote from admin on December 30, 2020, 3:31 pmOn CIFS node, what is the output of
ceph status
ceph fs status
mount | grep mnt
mount | grep shared
ctdb status
On CIFS node, what is the output of
ceph status
ceph fs status
mount | grep mnt
mount | grep shared
ctdb status
sparxalt
3 Posts
Quote from sparxalt on December 30, 2020, 3:54 pmceph status
cluster:
id: 987c9aea-fc2f-4a20-88a8-ac7bf626e7e9
health: HEALTH_OKservices:
mon: 3 daemons, quorum ceph-8,ceph-1,ceph-4 (age 11h)
mgr: ceph-8(active, since 11h), standbys: ceph-4, ceph-1
mds: cephfs:1 {0=ceph-4=up:active} 2 up:standby
osd: 58 osds: 58 up (since 109m), 58 in (since 6d)task status:
scrub status:
mds.ceph-4: idledata:
pools: 7 pools, 576 pgs
objects: 423.80k objects, 1.5 TiB
usage: 4.4 TiB used, 159 TiB / 164 TiB avail
pgs: 576 active+cleanio:
client: 19 KiB/s rd, 3.8 KiB/s wr, 11 op/s rd, 3 op/s wrceph fs status
cephfs - 39 clients
======
+------+--------+--------+---------------+-------+-------+
| Rank | State | MDS | Activity | dns | inos |
+------+--------+--------+---------------+-------+-------+
| 0 | active | ceph-4 | Reqs: 0 /s | 19.8k | 13.5k |
+------+--------+--------+---------------+-------+-------+
+-----------------+----------+-------+-------+
| Pool | type | used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata | 727M | 10.6T |
| cephfs_root | data | 12.1k | 10.6T |
| cephfs_ec_hdd | data | 247M | 76.9T |
| cephfs_ec_ssd | data | 2368G | 21.3T |
+-----------------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
| ceph-1 |
| ceph-8 |
+-------------+
MDS version: ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable)mount | grep mnt
10.1.110.81,10.1.110.84,10.1.110.88:/ on /mnt/cephfs type ceph (rw,relatime,name=admin,secret=<hidden>,acl,mds_namespace=cephfs)
mount | grep shared
10.1.110.81:gfs-vol on /opt/petasan/config/shared type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
ctdb status
connect() failed, errno=2
Failed to connect to CTDB daemon (/var/run/ctdb/ctdbd.socket)
ceph status
cluster:
id: 987c9aea-fc2f-4a20-88a8-ac7bf626e7e9
health: HEALTH_OKservices:
mon: 3 daemons, quorum ceph-8,ceph-1,ceph-4 (age 11h)
mgr: ceph-8(active, since 11h), standbys: ceph-4, ceph-1
mds: cephfs:1 {0=ceph-4=up:active} 2 up:standby
osd: 58 osds: 58 up (since 109m), 58 in (since 6d)task status:
scrub status:
mds.ceph-4: idledata:
pools: 7 pools, 576 pgs
objects: 423.80k objects, 1.5 TiB
usage: 4.4 TiB used, 159 TiB / 164 TiB avail
pgs: 576 active+cleanio:
client: 19 KiB/s rd, 3.8 KiB/s wr, 11 op/s rd, 3 op/s wr
ceph fs status
cephfs - 39 clients
======
+------+--------+--------+---------------+-------+-------+
| Rank | State | MDS | Activity | dns | inos |
+------+--------+--------+---------------+-------+-------+
| 0 | active | ceph-4 | Reqs: 0 /s | 19.8k | 13.5k |
+------+--------+--------+---------------+-------+-------+
+-----------------+----------+-------+-------+
| Pool | type | used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata | 727M | 10.6T |
| cephfs_root | data | 12.1k | 10.6T |
| cephfs_ec_hdd | data | 247M | 76.9T |
| cephfs_ec_ssd | data | 2368G | 21.3T |
+-----------------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
| ceph-1 |
| ceph-8 |
+-------------+
MDS version: ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable)
mount | grep mnt
10.1.110.81,10.1.110.84,10.1.110.88:/ on /mnt/cephfs type ceph (rw,relatime,name=admin,secret=<hidden>,acl,mds_namespace=cephfs)
mount | grep shared
10.1.110.81:gfs-vol on /opt/petasan/config/shared type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
ctdb status
connect() failed, errno=2
Failed to connect to CTDB daemon (/var/run/ctdb/ctdbd.socket)
admin
2,930 Posts
Quote from admin on December 30, 2020, 4:22 pmon 1 of the CIFS nodes :
systemctl stop petasan-cifs
systemctl start ctdbwait 1 min then
systemctl status smbd
systemctl status ctdb
ctdb statusif you get error on screen, what error do you get ?
you can get more logs from:
/var/log/samba/log.ctdb
/var/log/samba/log.smbd
on 1 of the CIFS nodes :
systemctl stop petasan-cifs
systemctl start ctdb
wait 1 min then
systemctl status smbd
systemctl status ctdb
ctdb status
if you get error on screen, what error do you get ?
you can get more logs from:
/var/log/samba/log.ctdb
/var/log/samba/log.smbd
sparxalt
3 Posts
Quote from sparxalt on December 30, 2020, 4:26 pmThanks Admin!
That was the nudge I needed, I found the ctdb service wasn't running, so I started it which allowed the CIFS Status page to show "down" on all three nodes. Once I reapplied the CIFS settings from the Configuration section all nodes came online and started serving. now ctdb status displays:
Number of nodes:3
pnn:0 10.1.110.82 OK (THIS NODE)
pnn:1 10.1.110.85 OK
pnn:2 10.1.110.87 OK
Generation:63243887
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:2
Thanks Admin!
That was the nudge I needed, I found the ctdb service wasn't running, so I started it which allowed the CIFS Status page to show "down" on all three nodes. Once I reapplied the CIFS settings from the Configuration section all nodes came online and started serving. now ctdb status displays:
Number of nodes:3
pnn:0 10.1.110.82 OK (THIS NODE)
pnn:1 10.1.110.85 OK
pnn:2 10.1.110.87 OK
Generation:63243887
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:2
the.only.chaos.lucifer
31 Posts
Quote from the.only.chaos.lucifer on December 21, 2023, 8:33 am@admin How do i avoid this as i notice this occurs every time there is a power down. I am doing this in homelab and testing envirnoment and this is definitely not ideal. Technically it should auto connect by itself. Is there anything i can do to resolve this? No worries if it is too much for you guys. But thanks a lot if there is any suggestions!!!
@admin How do i avoid this as i notice this occurs every time there is a power down. I am doing this in homelab and testing envirnoment and this is definitely not ideal. Technically it should auto connect by itself. Is there anything i can do to resolve this? No worries if it is too much for you guys. But thanks a lot if there is any suggestions!!!
admin
2,930 Posts
Quote from admin on December 21, 2023, 3:58 pmnot really sure since we do not see this. it could be related to your environment, if you can test in a different setup it would be great.. Else make sure that ceph/cephfs have no issues when you restart and if all ok, look at the ctdb/samba logs.
not really sure since we do not see this. it could be related to your environment, if you can test in a different setup it would be great.. Else make sure that ceph/cephfs have no issues when you restart and if all ok, look at the ctdb/samba logs.
the.only.chaos.lucifer
31 Posts
Quote from the.only.chaos.lucifer on December 22, 2023, 6:36 pmThe suggested code you provide definitely work just that every time a reboot occur I need to run code below. Just wondering is this because ISCSI, CIFS, NFS, S3 are all on the same subnet? Should it split the subnet? The backend and management subnet are on its unique/own subnets. Total I have 3 subnet.
systemctl stop petasan-cifs
systemctl start ctdb
The suggested code you provide definitely work just that every time a reboot occur I need to run code below. Just wondering is this because ISCSI, CIFS, NFS, S3 are all on the same subnet? Should it split the subnet? The backend and management subnet are on its unique/own subnets. Total I have 3 subnet.
systemctl stop petasan-cifs
systemctl start ctdb