Forums - PetaSAN

ForumBug ReportingNew Deployment Never Completes
You need to log in to create posts and topics. Login · Register
New Deployment Never Completes

Pages: 1 2

admin
2,930 Posts

February 3, 2021, 7:56 pm
Quote from admin on February 3, 2021, 7:56 pm
From Management UI after deployment from Physical Disk page, when you add a cache disk you specify the number of partitions from 1 to 8, recommended is 4.

From Deployment Wizard, if you add a cache device it will use the default of 4 partitions.

Each cache partition will serve 1 hdd OSD.

From Management UI after deployment from Physical Disk page, when you add a cache disk you specify the number of partitions from 1 to 8, recommended is 4.

From Deployment Wizard, if you add a cache device it will use the default of 4 partitions.

Each cache partition will serve 1 hdd OSD.

#11

distributed-mind
10 Posts

February 3, 2021, 8:40 pm
Quote from distributed-mind on February 3, 2021, 8:40 pm
I've attempted the install 15 more times and no permutation of Journal/Cache on SSD works if I use a Cache drive at all regardless of how many OSD's I have. So, I'm guessing if I want to use cache, I'll need to add it after the cluster is up. It seems that since the cache is a non-standard add-on to Ceph that there should be a test to see what drives qualify (or at least would succeed). I ended up going with 2 journal drives and 8 OSD's per node. I've got some issues with IO and pg's to work out (standard Ceph tuning), but otherwise I think I can call this cluster "up."

It's odd that this is happening. When I tested this in my dev environment, I had no problems at all. 8 servers: 3 SSD, 12 HDD, 2x Xeon E5-2683v4, 256GB RAM (32GB for PetaSAN instances). But Dev is a collection of SuperMicro servers, and Prod is Dell PowerEdge R730xd servers. I'm guessing there's something not quite right with firmware on either the SAS controller (PERC H730) or the drives themselves (Toshiba) on the Dell boxes because that's the only thing that's different besides BIOS firmware. I expect to have a little more time in a couple months; I'll peruse the code and see if I can figure out why creating a cache drive fails. Is there a repo one could submit pull requests to?

I've attempted the install 15 more times and no permutation of Journal/Cache on SSD works if I use a Cache drive at all regardless of how many OSD's I have. So, I'm guessing if I want to use cache, I'll need to add it after the cluster is up. It seems that since the cache is a non-standard add-on to Ceph that there should be a test to see what drives qualify (or at least would succeed). I ended up going with 2 journal drives and 8 OSD's per node. I've got some issues with IO and pg's to work out (standard Ceph tuning), but otherwise I think I can call this cluster "up."

It's odd that this is happening. When I tested this in my dev environment, I had no problems at all. 8 servers: 3 SSD, 12 HDD, 2x Xeon E5-2683v4, 256GB RAM (32GB for PetaSAN instances). But Dev is a collection of SuperMicro servers, and Prod is Dell PowerEdge R730xd servers. I'm guessing there's something not quite right with firmware on either the SAS controller (PERC H730) or the drives themselves (Toshiba) on the Dell boxes because that's the only thing that's different besides BIOS firmware. I expect to have a little more time in a couple months; I'll peruse the code and see if I can figure out why creating a cache drive fails. Is there a repo one could submit pull requests to?

#12

admin
2,930 Posts

February 3, 2021, 9:13 pm
Quote from admin on February 3, 2021, 9:13 pm
Ok it is clearer now it is an issue with cache as you can build the system without it.

If you need to use write cache, try to add it Physical Disk list after you have built your cluster, you can build your cluster without OSDs to start with or with OSDs with journals only. It could give more info when/why adding an OSD with cache fails without the need of a complete build failure and your cluster will still be up.

I do not know why you experience failures adding OSD with cache in your Dell servers and worked on your Supermicro. The only guess i had as posted earlier is you need 2% of ram for cache from your total SSD cache size. How much ram do you have on Dell servers ? How large is your SSD cache disk ? How many OSDs per host you tried to add with cache ?

Ok it is clearer now it is an issue with cache as you can build the system without it.

If you need to use write cache, try to add it Physical Disk list after you have built your cluster, you can build your cluster without OSDs to start with or with OSDs with journals only. It could give more info when/why adding an OSD with cache fails without the need of a complete build failure and your cluster will still be up.

I do not know why you experience failures adding OSD with cache in your Dell servers and worked on your Supermicro. The only guess i had as posted earlier is you need 2% of ram for cache from your total SSD cache size. How much ram do you have on Dell servers ? How large is your SSD cache disk ? How many OSDs per host you tried to add with cache ?

Last edited on February 3, 2021, 9:17 pm by admin · #13

distributed-mind
10 Posts

February 4, 2021, 1:30 pm
Quote from distributed-mind on February 4, 2021, 1:30 pm
What your suggesting about there not being enough RAM is very plausible if in fact 2% is required. Our flash disks are 1,920 GB. The most I can feasibly provide to PetaSAN is 48 of the 256 GB of the host; other VM's need RAM, too. Given that, it looks like flash cache is out of the question.

As it stands I'm moving iSCSI targets to hosts that aren't monitors/managers. Per the hardware guide, with 8 OSD's, we should be allocating a minimum of 34 GB, and a our iSCSI targets should have 48 GB. This is actually pretty close to a typical Ceph install so I conservatively gave each VM 32 GB. I'll be bumping that up shortly.

In the meantime, I have a new issue which is preventing me from completing this deployment. I cannot attach new nodes. And since Bash access is no longer available from the console, and I don't know what credentials to use to SSH to each host, I cannot further troubleshoot this issue. Should I create a new post or continue in this one?

What your suggesting about there not being enough RAM is very plausible if in fact 2% is required. Our flash disks are 1,920 GB. The most I can feasibly provide to PetaSAN is 48 of the 256 GB of the host; other VM's need RAM, too. Given that, it looks like flash cache is out of the question.

As it stands I'm moving iSCSI targets to hosts that aren't monitors/managers. Per the hardware guide, with 8 OSD's, we should be allocating a minimum of 34 GB, and a our iSCSI targets should have 48 GB. This is actually pretty close to a typical Ceph install so I conservatively gave each VM 32 GB. I'll be bumping that up shortly.

In the meantime, I have a new issue which is preventing me from completing this deployment. I cannot attach new nodes. And since Bash access is no longer available from the console, and I don't know what credentials to use to SSH to each host, I cannot further troubleshoot this issue. Should I create a new post or continue in this one?

#14

distributed-mind
10 Posts

February 4, 2021, 2:01 pm
Quote from distributed-mind on February 4, 2021, 2:01 pm
I think I need to rephrase something said above: I can add a node to the cluster, but it's OSD's do not show up in the list. During the process of configuring the node, I am left with the same screen as in post #6

I think I need to rephrase something said above: I can add a node to the cluster, but it's OSD's do not show up in the list. During the process of configuring the node, I am left with the same screen as in post #6

#15

admin
2,930 Posts

February 4, 2021, 2:33 pm
Quote from admin on February 4, 2021, 2:33 pm

did you re-install and did the cluster build successfully after 3rd node and all was ok before you join new nodes ?

if so is was the cluster status ok and now you have some errors ? what is the cluster status now, is it ok ? if not when/at what action did the status fail ?

the root ssh password is the same password you set for the cluster

did you re-install and did the cluster build successfully after 3rd node and all was ok before you join new nodes ?

if so is was the cluster status ok and now you have some errors ? what is the cluster status now, is it ok ? if not when/at what action did the status fail ?

the root ssh password is the same password you set for the cluster

#16

distributed-mind
10 Posts

February 4, 2021, 2:47 pm
Quote from distributed-mind on February 4, 2021, 2:47 pm
The cluster was successfully built, without cache. I currently have a cluster with 3 nodes, all functional with no errors. It's complaining that there are too many PGs per OSD, but this is expected since the rest of the nodes and their OSD's haven't been added yet. I am trying to add a 4th node with 8 OSD's.

The cluster was successfully built, without cache. I currently have a cluster with 3 nodes, all functional with no errors. It's complaining that there are too many PGs per OSD, but this is expected since the rest of the nodes and their OSD's haven't been added yet. I am trying to add a 4th node with 8 OSD's.

Last edited on February 4, 2021, 5:02 pm by distributed-mind · #17

admin
2,930 Posts

February 4, 2021, 5:57 pm
Quote from admin on February 4, 2021, 5:57 pm
I can add a node to the cluster, but it's OSD's do not show up in the list.

Does the node report success when it is added or does it report an error ?

Do you select OSDs to add during node deployment or are you trying to add them after you successfully added the node ?

You are referring to the Physical Disk list ? the disk do not show ? or the list is totally blank ?

I can add a node to the cluster, but it's OSD's do not show up in the list.

Does the node report success when it is added or does it report an error ?

Do you select OSDs to add during node deployment or are you trying to add them after you successfully added the node ?

You are referring to the Physical Disk list ? the disk do not show ? or the list is totally blank ?

#18

Post Reply: New Deployment Never Completes

Cancel

Pages: 1 2