Forums - PetaSAN

ForumBug ReportingNew Deployment Never Completes
You need to log in to create posts and topics. Login · Register
New Deployment Never Completes

Pages: 1 2

distributed-mind
10 Posts

January 29, 2021, 3:47 pm
Quote from distributed-mind on January 29, 2021, 3:47 pm
I've got 8 servers with 11 rotating disks and 3 SAS SSD's at PetaSAN's disposal. 1Gb management network. 10Gb backend network. 40Gb iSCSI network. When attempting to setup the third host, it never completes. It is stuck (for 3 days now) at the "Final Deployment Stage" page.

A tail of /opt/petasan/log/ceph-volume.log reads...

[2021-01-26 12:04:24,155][ceph_volume.main][INFO ] Running command: ceph-volume --log-path /opt/petasan/log lvm list --format json
[2021-01-26 12:04:24,155][ceph_volume.main][ERROR ] ignoring inability to load ceph.conf
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/ceph_volume/main.py", line 142, in main
conf.ceph = configuration.load(conf.path)
File "/usr/lib/python2.7/dist-packages/ceph_volume/configuration.py", line 51, in load
raise exceptions.ConfigurationError(abspath=abspath)
ConfigurationError: Unable to load expected Ceph config at: /etc/ceph/ceph.conf
[2021-01-26 12:04:24,158][ceph_volume.process][INFO ] Running command: /sbin/lvs --noheadings --readonly --separator=";" -a -S -o lv_tags,lv_path,lv_name,vg_name,lv_uuid,lv_size

Looking at the permissions of /etc/ceph/ceph.conf, I see that 0644 permissions with root:root as the owner. This seems like it should be acceptable so I suspect the error is a red herring, and the real problem lies elsewhere.

I also noticed there is nothing listening on tcp/3300 or tcp/6789 which is where monitors are usually listening. This seemed sub-optimal.

Where else should I be looking to troubleshoot our deployment?

I've got 8 servers with 11 rotating disks and 3 SAS SSD's at PetaSAN's disposal. 1Gb management network. 10Gb backend network. 40Gb iSCSI network. When attempting to setup the third host, it never completes. It is stuck (for 3 days now) at the "Final Deployment Stage" page.

A tail of /opt/petasan/log/ceph-volume.log reads...

[2021-01-26 12:04:24,155][ceph_volume.main][INFO ] Running command: ceph-volume --log-path /opt/petasan/log lvm list --format json
[2021-01-26 12:04:24,155][ceph_volume.main][ERROR ] ignoring inability to load ceph.conf
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/ceph_volume/main.py", line 142, in main
conf.ceph = configuration.load(conf.path)
File "/usr/lib/python2.7/dist-packages/ceph_volume/configuration.py", line 51, in load
raise exceptions.ConfigurationError(abspath=abspath)
ConfigurationError: Unable to load expected Ceph config at: /etc/ceph/ceph.conf
[2021-01-26 12:04:24,158][ceph_volume.process][INFO ] Running command: /sbin/lvs --noheadings --readonly --separator=";" -a -S -o lv_tags,lv_path,lv_name,vg_name,lv_uuid,lv_size

Looking at the permissions of /etc/ceph/ceph.conf, I see that 0644 permissions with root:root as the owner. This seems like it should be acceptable so I suspect the error is a red herring, and the real problem lies elsewhere.

I also noticed there is nothing listening on tcp/3300 or tcp/6789 which is where monitors are usually listening. This seemed sub-optimal.

Where else should I be looking to troubleshoot our deployment?

#1

distributed-mind
10 Posts

January 29, 2021, 9:45 pm
Quote from distributed-mind on January 29, 2021, 9:45 pm
I should have mentioned that ceph-mon processes are running on the servers I expect them to be running on. I've only got 4 servers up at the moment: 1, 2, 4, and 5. I'm having an unrelated, internal issue with 3. Current 1, 2, and 4 have ceph-mon processes running...they just aren't listening on any ports.

I should have mentioned that ceph-mon processes are running on the servers I expect them to be running on. I've only got 4 servers up at the moment: 1, 2, 4, and 5. I'm having an unrelated, internal issue with 3. Current 1, 2, and 4 have ceph-mon processes running...they just aren't listening on any ports.

#2

admin
2,930 Posts

January 30, 2021, 9:45 am
Quote from admin on January 30, 2021, 9:45 am
node 3 deployment builds the cluster as the cluster requires at least 3 nodes to be up. if this failed you should hold off adding 4 and 5 until you fix the node 3 build issue.

i recommend you re-install, the 3rd node should build the cluster in about 10 min, in case of error i recommend you look at the logs on node 3.

node 3 deployment builds the cluster as the cluster requires at least 3 nodes to be up. if this failed you should hold off adding 4 and 5 until you fix the node 3 build issue.

i recommend you re-install, the 3rd node should build the cluster in about 10 min, in case of error i recommend you look at the logs on node 3.

#3

distributed-mind
10 Posts

February 1, 2021, 2:40 pm
Quote from distributed-mind on February 1, 2021, 2:40 pm
I've worked with larger Ceph deployments before both stand-alone and with OpenStack, but honestly I think PetaSAN might be better suited for this organization's needs--certainly simpler on the surface. In my instance, the 3rd physical machine has configuration problems in ESX. A ticket with VMware is open and a solution is being pursued.

Some of this is supposition based on my understanding of the technologies involved, but what it looks like to me is a Gluster block device (w/ MPIO) sits on top of a CephFS implementation in Ceph. The former providing MPIO iSCSI, the latter providing resiliancy and IO scalability.

I'm still looking through the code to see what Flask is doing so I can get to the underlying actions, but there's no technical reason why I cannot add them in the order: 1, 2, 5, 4, 6, 7, 8, 3 or any other order. As long as 3 nodes (with a minimum of 1 OSD each) are present and can communicate with each other, a quorum is present and the cluster should come online. To make this perfectly clear: PetaSAN has not been installed on what I am calling our 3rd node so it should not impact deployment. I am referring to them by hostname/number, not by the order in which PetaSAN was installed.

On the first PetaSAN node, I see 14 instances of "consul agent -raft-protocol 2 -config-dir /opt/petasan/config/etc/consul.d/server -bind <this node's backend IP> -retry-join <node2's backend IP> -retry-join <node5's backend IP>" and around 20 or so instances of /usr/bin/ceph-mon -f --cluster ceph --id <node1 hostname> --setuser ceph --setgroup ceph. I would have thought one of those ceph-mon processes would have opened a couple ports.

I can ping each node from management interface to management interface, and backend interface to backend interface.

I've worked with larger Ceph deployments before both stand-alone and with OpenStack, but honestly I think PetaSAN might be better suited for this organization's needs--certainly simpler on the surface. In my instance, the 3rd physical machine has configuration problems in ESX. A ticket with VMware is open and a solution is being pursued.

Some of this is supposition based on my understanding of the technologies involved, but what it looks like to me is a Gluster block device (w/ MPIO) sits on top of a CephFS implementation in Ceph. The former providing MPIO iSCSI, the latter providing resiliancy and IO scalability.

I'm still looking through the code to see what Flask is doing so I can get to the underlying actions, but there's no technical reason why I cannot add them in the order: 1, 2, 5, 4, 6, 7, 8, 3 or any other order. As long as 3 nodes (with a minimum of 1 OSD each) are present and can communicate with each other, a quorum is present and the cluster should come online. To make this perfectly clear: PetaSAN has not been installed on what I am calling our 3rd node so it should not impact deployment. I am referring to them by hostname/number, not by the order in which PetaSAN was installed.

On the first PetaSAN node, I see 14 instances of "consul agent -raft-protocol 2 -config-dir /opt/petasan/config/etc/consul.d/server -bind <this node's backend IP> -retry-join <node2's backend IP> -retry-join <node5's backend IP>" and around 20 or so instances of /usr/bin/ceph-mon -f --cluster ceph --id <node1 hostname> --setuser ceph --setgroup ceph. I would have thought one of those ceph-mon processes would have opened a couple ports.

I can ping each node from management interface to management interface, and backend interface to backend interface.

#4

admin
2,930 Posts

February 1, 2021, 6:20 pm
Quote from admin on February 1, 2021, 6:20 pm
not sure if i understand correctly : but you do have a PetaSAN cluster built, first you deployed 3 nodes and this ran fine and the cluster was built, then you added 5 other nodes which was successful apart from node you call the third host ? if i this is not the case and you have not built the cluster then i suggest as per my prev post to refrain from adding nodes until you successfully join the third node and have an up cluster.

not sure if i understand correctly : but you do have a PetaSAN cluster built, first you deployed 3 nodes and this ran fine and the cluster was built, then you added 5 other nodes which was successful apart from node you call the third host ? if i this is not the case and you have not built the cluster then i suggest as per my prev post to refrain from adding nodes until you successfully join the third node and have an up cluster.

#5

distributed-mind
10 Posts

February 1, 2021, 8:44 pm
Quote from distributed-mind on February 1, 2021, 8:44 pm
I think I may have added too much information so early on, so let's start over.

Let's say we have node1, node2, and node3. Each with 3 SSD, and 11 HDD. I have gone through the process to add these to build a cluster, but it never completes.

Management interfaces are on 10.0.43.1/22

Backend interfaces are on 10.0.36.1/24

Each can be pinged from the others on their management and backend interfaces. I can SSH to each and login. At present and for the last five days, I am stuck at the screen below on the last server, node3.

I think I may have added too much information so early on, so let's start over.

Let's say we have node1, node2, and node3. Each with 3 SSD, and 11 HDD. I have gone through the process to add these to build a cluster, but it never completes.

Management interfaces are on 10.0.43.1/22

Backend interfaces are on 10.0.36.1/24

Each can be pinged from the others on their management and backend interfaces. I can SSH to each and login. At present and for the last five days, I am stuck at the screen below on the last server, node3.

#6

admin
2,930 Posts

February 1, 2021, 9:28 pm
Quote from admin on February 1, 2021, 9:28 pm
so please re-install as per my first post, if still you have an issue ( wait no more than 15 min ) look at the log file /opt/petasan/log/PetaSAN.log of the third node for errors. you can also post it hete if you need help.

so please re-install as per my first post, if still you have an issue ( wait no more than 15 min ) look at the log file /opt/petasan/log/PetaSAN.log of the third node for errors. you can also post it hete if you need help.

Last edited on February 1, 2021, 9:28 pm by admin · #7

distributed-mind
10 Posts

February 2, 2021, 3:24 pm
Quote from distributed-mind on February 2, 2021, 3:24 pm
Tried to install twice, both times failed.

Attempt 1 this morning was with all the HDD's for OSD's and 1 SSD for journal, 1 SSD for caching. It complained of an LVM VG that couldn't find a specific PV. I looked, and neither could I. Here are the logs from all 3 hosts for that attempt.

Attempt 2 this morning, I simplified the install. 3 HDD OSD's, 1SSD for journal, 1 SSD for caching. It failed instantly upon clicking Next for the last time on the 3rd node. Here are the logs from all three hosts for that attempt.

Tried to install twice, both times failed.

Attempt 1 this morning was with all the HDD's for OSD's and 1 SSD for journal, 1 SSD for caching. It complained of an LVM VG that couldn't find a specific PV. I looked, and neither could I. Here are the logs from all 3 hosts for that attempt.

Attempt 2 this morning, I simplified the install. 3 HDD OSD's, 1SSD for journal, 1 SSD for caching. It failed instantly upon clicking Next for the last time on the 3rd node. Here are the logs from all three hosts for that attempt.

#8

admin
2,930 Posts

February 2, 2021, 6:52 pm
Quote from admin on February 2, 2021, 6:52 pm
Can you try without using write cache ?

it seems write cache creation on the third drive on node 1 was stuck. Do you have enough ram ? the requirement as indicated in ui is 2% of SSD partition size used as ram.

Can you try without using write cache ?

it seems write cache creation on the third drive on node 1 was stuck. Do you have enough ram ? the requirement as indicated in ui is 2% of SSD partition size used as ram.

Last edited on February 2, 2021, 6:52 pm by admin · #9

distributed-mind
10 Posts

February 3, 2021, 2:32 pm
Quote from distributed-mind on February 3, 2021, 2:32 pm
How is SSD partition size calculated?...or did I miss that in the docs?

How is SSD partition size calculated?...or did I miss that in the docs?

#10

Post Reply: New Deployment Never Completes

Cancel

Pages: 1 2