General Questions
Pages: 1 2
rickbharper
11 Posts
September 7, 2017, 9:42 pmQuote from rickbharper on September 7, 2017, 9:42 pmI recently stumbled across your project and I'm very interested in trying it - I've already read the guides and I'm in the process of setting up a testing lab to run through the installers and start 'playing', but I have a couple of questions:
- How much traffic does the management interface generate? On a server with 10GbE and 1GbE interfaces would running the management interface over 1GbE be appropriate?
- Are link aggregations supported?
- What are the best practices for connecting physical disks to the system? Is it best to use hardware RAID cards and present logical volumes to the system or pass the disks directly through to the OS - I have a lot of IT mode LSI 2008 based HBA's in my environment that we use for ZFS file-systems, would these be a good fit for PetaSAN?
- How does the system handle individual disk failures? The administration guide mentions several email alerts, including complete node failures, but I don't see anything about drive failures. What is the process for replacing a failed drive?
- Are snapshots currently available or on the road-map?
- Can the system be upgraded without downtime by upgrading one node at a time or is it necessary for all nodes to be running the same version for the iscsi target to be active?
- Are there any recommendations for setting up PetaSAN with vmWare iscsi multipathing? Is the best performance gained by using round-robin policies? Are there recommendations for iops balancing settings?
- Is it possible to have different storage pools backed by separate nodes? IE: A high-performance pool backed by ssd nodes and a high-capacity pool backed by spinning disk nodes.
- What does your licensing model look like? Will the base product always be free with revenue coming from support contracts? Or will the platform eventually transition to a paid-model?
I recently stumbled across your project and I'm very interested in trying it - I've already read the guides and I'm in the process of setting up a testing lab to run through the installers and start 'playing', but I have a couple of questions:
- How much traffic does the management interface generate? On a server with 10GbE and 1GbE interfaces would running the management interface over 1GbE be appropriate?
- Are link aggregations supported?
- What are the best practices for connecting physical disks to the system? Is it best to use hardware RAID cards and present logical volumes to the system or pass the disks directly through to the OS - I have a lot of IT mode LSI 2008 based HBA's in my environment that we use for ZFS file-systems, would these be a good fit for PetaSAN?
- How does the system handle individual disk failures? The administration guide mentions several email alerts, including complete node failures, but I don't see anything about drive failures. What is the process for replacing a failed drive?
- Are snapshots currently available or on the road-map?
- Can the system be upgraded without downtime by upgrading one node at a time or is it necessary for all nodes to be running the same version for the iscsi target to be active?
- Are there any recommendations for setting up PetaSAN with vmWare iscsi multipathing? Is the best performance gained by using round-robin policies? Are there recommendations for iops balancing settings?
- Is it possible to have different storage pools backed by separate nodes? IE: A high-performance pool backed by ssd nodes and a high-capacity pool backed by spinning disk nodes.
- What does your licensing model look like? Will the base product always be free with revenue coming from support contracts? Or will the platform eventually transition to a paid-model?
admin
2,930 Posts
September 8, 2017, 1:17 pmQuote from admin on September 8, 2017, 1:17 pmHow much traffic does the management interface generate? On a server with 10GbE and 1GbE interfaces would running the management interface over 1GbE be appropriate?
It requires very neglible traffic, it is used while managing the system during user operations. Sure a 1G will do, you can also make the management subnet share the same nic as iSCSI 1. We recommend having 2(min) to 4 10G nics.
Are link aggregations supported?
Yes they were introduced in v1.3, It is setup once at cluster creation time and apply to all nodes. You can see it in the screen shots section.
What are the best practices for connecting physical disks to the system? Is it best to use hardware RAID cards and present logical volumes to the system or pass the disks directly through to the OS - I have a lot of IT mode LSI 2008 based HBA's in my environment that we use for ZFS file-systems, would these be a good fit for PetaSAN?
Ceph likes to see and handle many disks/OSDs itself, so do not use disks RAIDed together. Configure your card for JBOD mode or single disk RAID-0 with write back cache if it is battery backed. The more you add OSDs the faster the system becomes, until you saturate your cpu % usage then you cannot add more disk and have to add more nodes, the cluster benchmark page will help with this tuning.
How does the system handle individual disk failures? The administration guide mentions several email alerts, including complete node failures, but I don't see anything about drive failures. What is the process for replacing a failed drive?
Ceph is self-healing, if you have 20 disks and 1 fails, the lost replicas will automatically be re-created and stored on all remaining 19 disks, this ensures even balancing and more importantly each disk will share 5% of the recovery load so it will not degrade performance.
The PetaSAN UI will show you the disks error and will send you email notification. You remove/delete the failed disk from the ui so it does not show up in red but Ceph does its recovery regardless. You can, if you wish, add 1 or more disk later and Ceph will happily rebalance its data and offer better performance and capacity, but technically there is no 1 to 1 disk replacement.
If an entire node fails, assuming it is not one of te first 3 nodes, then from Ceph's point of view it lost a couple of disks and will do its rebalancing. If you add/join a new node Ceph will use the new disks and rebalance again, it is not viewed as a replacement, the new node need not have the same number of disk or have the same ip addresses/hostname..etc. Also in case of node failure PetaSAN does send email notifications.
The only exception requiring replacement is a node failure that is one of the first 3 nodes (ie management node), they include the Ceph monitors and Consul servers as well as the PetaSAN management functions and are the brains of the system. The custer can offord 1 failure of the management nodes but to 2. So such a node failure needs to be substituted as soon as possibe. The deployment wizard offers to replace a management node as its third options when deploying a new node.
Are snapshots currently available or on the road-map?
Yes we have a lot of new features we plan in our road-map, snapshots are among them.
Can the system be upgraded without downtime by upgrading one node at a time or is it necessary for all nodes to be running the same version for the iscsi target to be active?
When upgrading using the installer the node will be down during the process which should take 5 minutes, but the cluster will be functioning. All your disks and client io will be active, we take care of any version issues but this is transparent to you.
Are there any recommendations for setting up PetaSAN with vmWare iscsi multipathing? Is the best performance gained by using round-robin policies? Are there recommendations for iops balancing settings?
It is all for better load balancing. If you have many ESXs that have excatly the same io load pattern then probably an active/active policy will not buy much. However it many cases you may have some ESX with higher load requirements than others as well as the io patterns may have have high burst and not even. In such cases having round-round and iops balancing will distribute the load better on the PetaSAN nodes as well as on the network and will result in better performance.
Is it possible to have different storage pools backed by separate nodes? IE: A high-performance pool backed by ssd nodes and a high-capacity pool backed by spinning disk nodes.
It will be supported in the future
What does your licensing model look like? Will the base product always be free with revenue coming from support contracts? Or will the platform eventually transition to a paid-model?
PetaSAN will always remain open source, the commercial support will be optional. It will be similar to licensing of Proxmox project for example.
How much traffic does the management interface generate? On a server with 10GbE and 1GbE interfaces would running the management interface over 1GbE be appropriate?
It requires very neglible traffic, it is used while managing the system during user operations. Sure a 1G will do, you can also make the management subnet share the same nic as iSCSI 1. We recommend having 2(min) to 4 10G nics.
Are link aggregations supported?
Yes they were introduced in v1.3, It is setup once at cluster creation time and apply to all nodes. You can see it in the screen shots section.
What are the best practices for connecting physical disks to the system? Is it best to use hardware RAID cards and present logical volumes to the system or pass the disks directly through to the OS - I have a lot of IT mode LSI 2008 based HBA's in my environment that we use for ZFS file-systems, would these be a good fit for PetaSAN?
Ceph likes to see and handle many disks/OSDs itself, so do not use disks RAIDed together. Configure your card for JBOD mode or single disk RAID-0 with write back cache if it is battery backed. The more you add OSDs the faster the system becomes, until you saturate your cpu % usage then you cannot add more disk and have to add more nodes, the cluster benchmark page will help with this tuning.
How does the system handle individual disk failures? The administration guide mentions several email alerts, including complete node failures, but I don't see anything about drive failures. What is the process for replacing a failed drive?
Ceph is self-healing, if you have 20 disks and 1 fails, the lost replicas will automatically be re-created and stored on all remaining 19 disks, this ensures even balancing and more importantly each disk will share 5% of the recovery load so it will not degrade performance.
The PetaSAN UI will show you the disks error and will send you email notification. You remove/delete the failed disk from the ui so it does not show up in red but Ceph does its recovery regardless. You can, if you wish, add 1 or more disk later and Ceph will happily rebalance its data and offer better performance and capacity, but technically there is no 1 to 1 disk replacement.
If an entire node fails, assuming it is not one of te first 3 nodes, then from Ceph's point of view it lost a couple of disks and will do its rebalancing. If you add/join a new node Ceph will use the new disks and rebalance again, it is not viewed as a replacement, the new node need not have the same number of disk or have the same ip addresses/hostname..etc. Also in case of node failure PetaSAN does send email notifications.
The only exception requiring replacement is a node failure that is one of the first 3 nodes (ie management node), they include the Ceph monitors and Consul servers as well as the PetaSAN management functions and are the brains of the system. The custer can offord 1 failure of the management nodes but to 2. So such a node failure needs to be substituted as soon as possibe. The deployment wizard offers to replace a management node as its third options when deploying a new node.
Are snapshots currently available or on the road-map?
Yes we have a lot of new features we plan in our road-map, snapshots are among them.
Can the system be upgraded without downtime by upgrading one node at a time or is it necessary for all nodes to be running the same version for the iscsi target to be active?
When upgrading using the installer the node will be down during the process which should take 5 minutes, but the cluster will be functioning. All your disks and client io will be active, we take care of any version issues but this is transparent to you.
Are there any recommendations for setting up PetaSAN with vmWare iscsi multipathing? Is the best performance gained by using round-robin policies? Are there recommendations for iops balancing settings?
It is all for better load balancing. If you have many ESXs that have excatly the same io load pattern then probably an active/active policy will not buy much. However it many cases you may have some ESX with higher load requirements than others as well as the io patterns may have have high burst and not even. In such cases having round-round and iops balancing will distribute the load better on the PetaSAN nodes as well as on the network and will result in better performance.
Is it possible to have different storage pools backed by separate nodes? IE: A high-performance pool backed by ssd nodes and a high-capacity pool backed by spinning disk nodes.
It will be supported in the future
What does your licensing model look like? Will the base product always be free with revenue coming from support contracts? Or will the platform eventually transition to a paid-model?
PetaSAN will always remain open source, the commercial support will be optional. It will be similar to licensing of Proxmox project for example.
rickbharper
11 Posts
September 8, 2017, 1:34 pmQuote from rickbharper on September 8, 2017, 1:34 pmThanks, I appreciate the information. One more question:
Can nodes be safely removed from the cluster once they've been added? In the case of hardware refreshes, is it possible to add a new node, have Ceph rebalance the load and then tag the old node for removal?
Thanks, I appreciate the information. One more question:
Can nodes be safely removed from the cluster once they've been added? In the case of hardware refreshes, is it possible to add a new node, have Ceph rebalance the load and then tag the old node for removal?
admin
2,930 Posts
September 8, 2017, 2:54 pmQuote from admin on September 8, 2017, 2:54 pmYou can remove a node. Ceph will re-create the lost replicas just as if the node died.
If your replica count is 3 and a node with all its disk dies, Ceph still has 2 running copies of every object that was present on dead none and will create a third copy.
If your replica count is 2 and a node dies, Ceph has 1 running copy left of those objects and will recreate a second copy. Now in unlucky event that another node dies before Ceph completes its recovery then data loss could happen if disks physically fail.
Note that you can take OSD disk from one node and place them in another running node(s) and Ceph will happily use them. Ceph even supports hot plugging OSDs (if not using external journal disks ) .
In PetaSAN, if you want to remove a node but use its OSDs with data, you need to first move its OSDs to other running nodes and then remove the original node since the act of node removal removes any locally running OSDs from cluster.
Lastly if you want to remove a node and not use its data disks, but are concerned that during Ceph recovery other failures happen. You can use the command line for this so you keep your original replica and not remove it until its replacement copy is complete.
You can remove a node. Ceph will re-create the lost replicas just as if the node died.
If your replica count is 3 and a node with all its disk dies, Ceph still has 2 running copies of every object that was present on dead none and will create a third copy.
If your replica count is 2 and a node dies, Ceph has 1 running copy left of those objects and will recreate a second copy. Now in unlucky event that another node dies before Ceph completes its recovery then data loss could happen if disks physically fail.
Note that you can take OSD disk from one node and place them in another running node(s) and Ceph will happily use them. Ceph even supports hot plugging OSDs (if not using external journal disks ) .
In PetaSAN, if you want to remove a node but use its OSDs with data, you need to first move its OSDs to other running nodes and then remove the original node since the act of node removal removes any locally running OSDs from cluster.
Lastly if you want to remove a node and not use its data disks, but are concerned that during Ceph recovery other failures happen. You can use the command line for this so you keep your original replica and not remove it until its replacement copy is complete.
Last edited on September 8, 2017, 3:02 pm by admin · #4
rickbharper
11 Posts
September 26, 2017, 9:18 pmQuote from rickbharper on September 26, 2017, 9:18 pmI've been experimenting with PetaSAN in a virtual environment for the last few weeks and I've been quite impressed with the redundancy and ease of set-up. Today I started my testing on 'bare-metal' utilizing some spare servers I had laying around. If all goes well here, I'll will most likely migrate several of my FreeNAS servers over to PetaSAN...
I've already hit a weird issue with some of my disks not being added to the cluster...
First my hardware layout:
- 3 Dell R310 servers
- 1 sata boot disk
- 3 sata 7.2k 500 gb storage disks
- 2 onboard 1GbE nics - one used for Management Network
- eth0 - management
- eth1 - not used
- 2 add-in dual port 10GbE nics (mix of Chelsio and Mellanox)
- eth2 & eth3 combined into lagg0 - used for both back-end networks
- eth4 - iscsi-1
- eth5 - iscsi-2
- Management network is part of my normal management subnet
- Back-end networks on running on a dedicated 10GbE switch
- iscsi networks are connected to my production iscsi switches so that I can test from my VMware hosts
Setup completed without an issue and my cluster created successfully; however my dashboard shows only 5 OSD's (I should have 9 - 3 servers x 3 storage disks) - digging around I see that all 3 disks of node-3 were used, 2 disks on node-2 were used, and none of the disks on node-1 were used. I can see all the disks through the node/list/disk_list page.
I have verified that ping tests from all nodes on all management and back-end interfaces complete correctly so I'm fairly certain that there aren't any networking issues.
I have tried adding the disks to the cluster using the GUI - I click the + button and confirm - the gui shows 'Adding' for the status - then it simply resets without any error messages. I have logged into node-1 and looked at the log, but I don't see any errors:
root@psNode01:/etc# tail /opt/petasan/log/PetaSAN.log
26/09/2017 14:54:36 INFO Start cleaning disks
26/09/2017 14:54:38 INFO Starting ceph-disk zap /dev/sdc
26/09/2017 15:05:00 INFO Success saving application config
26/09/2017 15:05:00 INFO Success saving application config
26/09/2017 15:05:00 INFO Success saving application config
26/09/2017 15:11:30 INFO Start job for add osd for disk sdb.
26/09/2017 15:11:30 INFO Start add osd job 22193
26/09/2017 15:11:31 INFO Start cleaning disks
26/09/2017 15:11:32 INFO Starting ceph-disk zap /dev/sdb
I have also looked at the disks using fdisk:
root@psNode01:/etc# fdisk -l
Disk /dev/sda: 232.9 GiB, 250000000000 bytes, 488281250 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x00000000
Device Boot Start End Sectors Size Id Type
/dev/sda1 * 63 1975994 1975932 964.8M 83 Linux
/dev/sda2 1975995 31294619 29318625 14G 83 Linux
/dev/sda3 31294620 50845724 19551105 9.3G 83 Linux
/dev/sda4 50845725 488279609 437433885 208.6G 83 Linux
Disk /dev/sdb: 465.8 GiB, 500107862016 bytes, 976773168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: F76EDC6F-5B73-4EB6-89C4-3EBDDB8CD61D
Device Start End Sectors Size Type
/dev/sdb1 10487808 976773134 966285327 460.8G Ceph OSD
/dev/sdb2 2048 10487807 10485760 5G Ceph Journal
Partition table entries are not in disk order.
Disk /dev/sdc: 465.8 GiB, 500107862016 bytes, 976773168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 3529DD35-31F7-41B8-98FE-C15512B00A56
Device Start End Sectors Size Type
/dev/sdc1 10487808 976773134 966285327 460.8G Ceph OSD
/dev/sdc2 2048 10487807 10485760 5G Ceph Journal
Partition table entries are not in disk order.
Disk /dev/sdd: 465.8 GiB, 500107862016 bytes, 976773168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 915F2B85-EAC5-48D7-A285-1C7202507918
Device Start End Sectors Size Type
/dev/sdd1 10487808 976773134 966285327 460.8G Ceph OSD
/dev/sdd2 2048 10487807 10485760 5G Ceph Journal
Partition table entries are not in disk order.
It seems that the disks are being partitioned correctly, but they simply aren't being added.
I ran a SMART test against one of the drives in question (thinking that perhaps PetaSAN wouldn't add a disk that was failing) - but that came back clear as well.
I would appreciate any help here...
I've been experimenting with PetaSAN in a virtual environment for the last few weeks and I've been quite impressed with the redundancy and ease of set-up. Today I started my testing on 'bare-metal' utilizing some spare servers I had laying around. If all goes well here, I'll will most likely migrate several of my FreeNAS servers over to PetaSAN...
I've already hit a weird issue with some of my disks not being added to the cluster...
First my hardware layout:
- 3 Dell R310 servers
- 1 sata boot disk
- 3 sata 7.2k 500 gb storage disks
- 2 onboard 1GbE nics - one used for Management Network
- eth0 - management
- eth1 - not used
- 2 add-in dual port 10GbE nics (mix of Chelsio and Mellanox)
- eth2 & eth3 combined into lagg0 - used for both back-end networks
- eth4 - iscsi-1
- eth5 - iscsi-2
- Management network is part of my normal management subnet
- Back-end networks on running on a dedicated 10GbE switch
- iscsi networks are connected to my production iscsi switches so that I can test from my VMware hosts
Setup completed without an issue and my cluster created successfully; however my dashboard shows only 5 OSD's (I should have 9 - 3 servers x 3 storage disks) - digging around I see that all 3 disks of node-3 were used, 2 disks on node-2 were used, and none of the disks on node-1 were used. I can see all the disks through the node/list/disk_list page.
I have verified that ping tests from all nodes on all management and back-end interfaces complete correctly so I'm fairly certain that there aren't any networking issues.
I have tried adding the disks to the cluster using the GUI - I click the + button and confirm - the gui shows 'Adding' for the status - then it simply resets without any error messages. I have logged into node-1 and looked at the log, but I don't see any errors:
root@psNode01:/etc# tail /opt/petasan/log/PetaSAN.log
26/09/2017 14:54:36 INFO Start cleaning disks
26/09/2017 14:54:38 INFO Starting ceph-disk zap /dev/sdc
26/09/2017 15:05:00 INFO Success saving application config
26/09/2017 15:05:00 INFO Success saving application config
26/09/2017 15:05:00 INFO Success saving application config
26/09/2017 15:11:30 INFO Start job for add osd for disk sdb.
26/09/2017 15:11:30 INFO Start add osd job 22193
26/09/2017 15:11:31 INFO Start cleaning disks
26/09/2017 15:11:32 INFO Starting ceph-disk zap /dev/sdb
I have also looked at the disks using fdisk:
root@psNode01:/etc# fdisk -l
Disk /dev/sda: 232.9 GiB, 250000000000 bytes, 488281250 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x00000000
Device Boot Start End Sectors Size Id Type
/dev/sda1 * 63 1975994 1975932 964.8M 83 Linux
/dev/sda2 1975995 31294619 29318625 14G 83 Linux
/dev/sda3 31294620 50845724 19551105 9.3G 83 Linux
/dev/sda4 50845725 488279609 437433885 208.6G 83 Linux
Disk /dev/sdb: 465.8 GiB, 500107862016 bytes, 976773168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: F76EDC6F-5B73-4EB6-89C4-3EBDDB8CD61D
Device Start End Sectors Size Type
/dev/sdb1 10487808 976773134 966285327 460.8G Ceph OSD
/dev/sdb2 2048 10487807 10485760 5G Ceph Journal
Partition table entries are not in disk order.
Disk /dev/sdc: 465.8 GiB, 500107862016 bytes, 976773168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 3529DD35-31F7-41B8-98FE-C15512B00A56
Device Start End Sectors Size Type
/dev/sdc1 10487808 976773134 966285327 460.8G Ceph OSD
/dev/sdc2 2048 10487807 10485760 5G Ceph Journal
Partition table entries are not in disk order.
Disk /dev/sdd: 465.8 GiB, 500107862016 bytes, 976773168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 915F2B85-EAC5-48D7-A285-1C7202507918
Device Start End Sectors Size Type
/dev/sdd1 10487808 976773134 966285327 460.8G Ceph OSD
/dev/sdd2 2048 10487807 10485760 5G Ceph Journal
Partition table entries are not in disk order.
It seems that the disks are being partitioned correctly, but they simply aren't being added.
I ran a SMART test against one of the drives in question (thinking that perhaps PetaSAN wouldn't add a disk that was failing) - but that came back clear as well.
I would appreciate any help here...
admin
2,930 Posts
September 26, 2017, 10:00 pmQuote from admin on September 26, 2017, 10:00 pmHi,
on node-1 can you post the output of:
ceph-disk list
ceph osd tree --cluster CLUSTER_NAME
also attempt to add /dev/sdb osd manually using the following command:
ceph-disk prepare --cluster CLUSTER_NAME --zap-disk --fs-type xfs /dev/sdb
Does it complete successfully ? any error ? does the osd get added to the cluster ?
Lastly if you can email me the PetaSAN.log file for node-1 admin @ petasan.org
Hi,
on node-1 can you post the output of:
ceph-disk list
ceph osd tree --cluster CLUSTER_NAME
also attempt to add /dev/sdb osd manually using the following command:
ceph-disk prepare --cluster CLUSTER_NAME --zap-disk --fs-type xfs /dev/sdb
Does it complete successfully ? any error ? does the osd get added to the cluster ?
Lastly if you can email me the PetaSAN.log file for node-1 admin @ petasan.org
Last edited on September 26, 2017, 10:02 pm by admin · #6
rickbharper
11 Posts
September 26, 2017, 10:10 pmQuote from rickbharper on September 26, 2017, 10:10 pmHere is the disk list:
root@psNode01:/etc# ceph-disk list
/dev/rbd0 other, unknown
/dev/sda :
/dev/sda2 other, ext4, mounted on /
/dev/sda1 other, ext4, mounted on /boot
/dev/sda4 other, ext4, mounted on /opt/petasan/config
/dev/sda3 other, ext4, mounted on /var/lib/ceph
/dev/sdb :
/dev/sdb1 other
/dev/sdb2 ceph journal
/dev/sdc :
/dev/sdc1 other
/dev/sdc2 ceph journal
/dev/sdd :
/dev/sdd1 other
/dev/sdd2 ceph journal
/dev/sr0 other, unknown
/dev/sr1 other, unknown
OSD Tree:
root@psNode01:/etc# ceph osd tree --cluster psCluster01
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 2.24846 root default
-2 0.89938 host psNode03
0 0.44969 osd.0 up 1.00000 1.00000
1 0.44969 osd.1 up 1.00000 1.00000
-3 1.34908 host psNode02
2 0.44969 osd.2 up 1.00000 1.00000
3 0.44969 osd.3 up 1.00000 1.00000
4 0.44969 osd.4 up 1.00000 1.00000
This is the output when I try to manually add the disk:
root@psNode01:/etc# ceph osd tree --cluster psCluster01
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 2.24846 root default
-2 0.89938 host psNode03
0 0.44969 osd.0 up 1.00000 1.00000
1 0.44969 osd.1 up 1.00000 1.00000
-3 1.34908 host psNode02
2 0.44969 osd.2 up 1.00000 1.00000
3 0.44969 osd.3 up 1.00000 1.00000
4 0.44969 osd.4 up 1.00000 1.00000
root@psNode01:/etc# ceph-disk prepare --cluster psCluster01 --zap-disk --fs-type xfs /dev/sdb
Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header.
Warning! Main and backup partition tables differ! Use the 'c' and 'e' options
on the recovery & transformation menu to examine the two tables.
Warning! One or more CRCs don't match. You should repair the disk!
****************************************************************************
Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
verification and recovery are STRONGLY recommended.
****************************************************************************
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
Setting name!
partNum is 1
REALLY setting name!
The operation has completed successfully.
Setting name!
partNum is 0
REALLY setting name!
The operation has completed successfully.
meta-data=/dev/sdb1 isize=2048 agcount=4, agsize=30196417 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0
data = bsize=4096 blocks=120785665, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=58977, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
The operation has completed successfully.
It would seem that there is something wrong with the disk partition table - despite the last line stating the operation completed correctly the OSD is not added....
UPDATE:
I used gdisk to wipe out the gpt and blank the MBR then tried to manually add the disk again. This time there were no errors in the manual add process:
root@psNode01:/etc# ceph-disk prepare --cluster psCluster01 --zap-disk --fs-type xfs /dev/sdb
Creating new GPT entries.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
Setting name!
partNum is 1
REALLY setting name!
The operation has completed successfully.
Setting name!
partNum is 0
REALLY setting name!
The operation has completed successfully.
meta-data=/dev/sdb1 isize=2048 agcount=4, agsize=30196417 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0
data = bsize=4096 blocks=120785665, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=58977, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
The operation has completed successfully.
But again, this disk is not added to the cluster:
root@psNode01:/etc# ceph osd tree --cluster psCluster01
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 2.24846 root default
-2 0.89938 host psNode03
0 0.44969 osd.0 up 1.00000 1.00000
1 0.44969 osd.1 up 1.00000 1.00000
-3 1.34908 host psNode02
2 0.44969 osd.2 up 1.00000 1.00000
3 0.44969 osd.3 up 1.00000 1.00000
4 0.44969 osd.4 up 1.00000 1.00000
Here is the disk list:
root@psNode01:/etc# ceph-disk list
/dev/rbd0 other, unknown
/dev/sda :
/dev/sda2 other, ext4, mounted on /
/dev/sda1 other, ext4, mounted on /boot
/dev/sda4 other, ext4, mounted on /opt/petasan/config
/dev/sda3 other, ext4, mounted on /var/lib/ceph
/dev/sdb :
/dev/sdb1 other
/dev/sdb2 ceph journal
/dev/sdc :
/dev/sdc1 other
/dev/sdc2 ceph journal
/dev/sdd :
/dev/sdd1 other
/dev/sdd2 ceph journal
/dev/sr0 other, unknown
/dev/sr1 other, unknown
OSD Tree:
root@psNode01:/etc# ceph osd tree --cluster psCluster01
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 2.24846 root default
-2 0.89938 host psNode03
0 0.44969 osd.0 up 1.00000 1.00000
1 0.44969 osd.1 up 1.00000 1.00000
-3 1.34908 host psNode02
2 0.44969 osd.2 up 1.00000 1.00000
3 0.44969 osd.3 up 1.00000 1.00000
4 0.44969 osd.4 up 1.00000 1.00000
This is the output when I try to manually add the disk:
root@psNode01:/etc# ceph osd tree --cluster psCluster01
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 2.24846 root default
-2 0.89938 host psNode03
0 0.44969 osd.0 up 1.00000 1.00000
1 0.44969 osd.1 up 1.00000 1.00000
-3 1.34908 host psNode02
2 0.44969 osd.2 up 1.00000 1.00000
3 0.44969 osd.3 up 1.00000 1.00000
4 0.44969 osd.4 up 1.00000 1.00000
root@psNode01:/etc# ceph-disk prepare --cluster psCluster01 --zap-disk --fs-type xfs /dev/sdb
Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header.
Warning! Main and backup partition tables differ! Use the 'c' and 'e' options
on the recovery & transformation menu to examine the two tables.
Warning! One or more CRCs don't match. You should repair the disk!
****************************************************************************
Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
verification and recovery are STRONGLY recommended.
****************************************************************************
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
Setting name!
partNum is 1
REALLY setting name!
The operation has completed successfully.
Setting name!
partNum is 0
REALLY setting name!
The operation has completed successfully.
meta-data=/dev/sdb1 isize=2048 agcount=4, agsize=30196417 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0
data = bsize=4096 blocks=120785665, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=58977, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
The operation has completed successfully.
It would seem that there is something wrong with the disk partition table - despite the last line stating the operation completed correctly the OSD is not added....
UPDATE:
I used gdisk to wipe out the gpt and blank the MBR then tried to manually add the disk again. This time there were no errors in the manual add process:
root@psNode01:/etc# ceph-disk prepare --cluster psCluster01 --zap-disk --fs-type xfs /dev/sdb
Creating new GPT entries.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
Setting name!
partNum is 1
REALLY setting name!
The operation has completed successfully.
Setting name!
partNum is 0
REALLY setting name!
The operation has completed successfully.
meta-data=/dev/sdb1 isize=2048 agcount=4, agsize=30196417 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0
data = bsize=4096 blocks=120785665, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=58977, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
The operation has completed successfully.
But again, this disk is not added to the cluster:
root@psNode01:/etc# ceph osd tree --cluster psCluster01
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 2.24846 root default
-2 0.89938 host psNode03
0 0.44969 osd.0 up 1.00000 1.00000
1 0.44969 osd.1 up 1.00000 1.00000
-3 1.34908 host psNode02
2 0.44969 osd.2 up 1.00000 1.00000
3 0.44969 osd.3 up 1.00000 1.00000
4 0.44969 osd.4 up 1.00000 1.00000
Last edited on September 26, 2017, 10:20 pm by rickbharper · #7
admin
2,930 Posts
September 26, 2017, 10:33 pmQuote from admin on September 26, 2017, 10:33 pmCan you edit the ceph conf file /etc/ceph/CLUSTER.conf and comment out osd_mkfs_options_xfs and osd_mount_options_xfs
# osd_mkfs_options_xfs = -f -i size=2048
# osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k,allocsize=4M"
then re-run
ceph-disk prepare --cluster CLUSTER_NAME --zap-disk --fs-type xfs /dev/sdb
Can you edit the ceph conf file /etc/ceph/CLUSTER.conf and comment out osd_mkfs_options_xfs and osd_mount_options_xfs
# osd_mkfs_options_xfs = -f -i size=2048
# osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k,allocsize=4M"
then re-run
ceph-disk prepare --cluster CLUSTER_NAME --zap-disk --fs-type xfs /dev/sdb
Last edited on September 26, 2017, 10:33 pm by admin · #8
rickbharper
11 Posts
September 26, 2017, 10:42 pmQuote from rickbharper on September 26, 2017, 10:42 pmroot@psNode01:/etc/ceph# cat /etc/ceph/psCluster01.conf | grep -i osd
osd_pool_default_pg_num = 1024
osd_pool_default_pgp_num = 1024
osd_pool_default_size = 2
osd_pool_default_min_size = 1
debug_osd = 0/0
mon_pg_warn_min_per_osd = 0
mon_pg_warn_max_per_osd = 500
[osd]
osd_crush_update_on_start = true
osd_pool_default_pg_num = 1024
osd_pool_default_pgp_num = 1024
osd_pool_default_size = 2
osd_pool_default_min_size = 1
osd_mkfs_type = xfs
#osd_mkfs_options_xfs = -m crc=0,finobt=0 -f -i size=2048
#osd_mkfs_options_xfs = -f -i size=2048
#osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k,allocsize=4M"
osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_priority = 1
osd_recovery_op_priority = 1
osd_recovery_threads = 1
osd_client_op_priority = 63
osd_recovery_max_start = 1
osd_max_scrubs = 1
osd_scrub_during_recovery = false
osd_scrub_priority = 1
osd_scrub_begin_hour = 20
osd_scrub_end_hour = 8
osd_op_threads=4
root@psNode01:/etc/ceph# ceph-disk prepare --cluster psCluster01 --zap-disk --fs-type xfs /dev/sdb
Creating new GPT entries.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
Setting name!
partNum is 1
REALLY setting name!
The operation has completed successfully.
Setting name!
partNum is 0
REALLY setting name!
The operation has completed successfully.
meta-data=/dev/sdb1 isize=2048 agcount=4, agsize=30196417 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0
data = bsize=4096 blocks=120785665, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=58977, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
The operation has completed successfully.
root@psNode01:/etc/ceph# ceph osd tree --cluster psCluster01
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 2.24846 root default
-2 0.89938 host psNode03
0 0.44969 osd.0 up 1.00000 1.00000
1 0.44969 osd.1 up 1.00000 1.00000
-3 1.34908 host psNode02
2 0.44969 osd.2 up 1.00000 1.00000
3 0.44969 osd.3 up 1.00000 1.00000
4 0.44969 osd.4 up 1.00000 1.00000
Still same result - no errors and it claims to have completed successfully but the OSD is not added....
root@psNode01:/etc/ceph# cat /etc/ceph/psCluster01.conf | grep -i osd
osd_pool_default_pg_num = 1024
osd_pool_default_pgp_num = 1024
osd_pool_default_size = 2
osd_pool_default_min_size = 1
debug_osd = 0/0
mon_pg_warn_min_per_osd = 0
mon_pg_warn_max_per_osd = 500
[osd]
osd_crush_update_on_start = true
osd_pool_default_pg_num = 1024
osd_pool_default_pgp_num = 1024
osd_pool_default_size = 2
osd_pool_default_min_size = 1
osd_mkfs_type = xfs
#osd_mkfs_options_xfs = -m crc=0,finobt=0 -f -i size=2048
#osd_mkfs_options_xfs = -f -i size=2048
#osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k,allocsize=4M"
osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_priority = 1
osd_recovery_op_priority = 1
osd_recovery_threads = 1
osd_client_op_priority = 63
osd_recovery_max_start = 1
osd_max_scrubs = 1
osd_scrub_during_recovery = false
osd_scrub_priority = 1
osd_scrub_begin_hour = 20
osd_scrub_end_hour = 8
osd_op_threads=4
root@psNode01:/etc/ceph# ceph-disk prepare --cluster psCluster01 --zap-disk --fs-type xfs /dev/sdb
Creating new GPT entries.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
Setting name!
partNum is 1
REALLY setting name!
The operation has completed successfully.
Setting name!
partNum is 0
REALLY setting name!
The operation has completed successfully.
meta-data=/dev/sdb1 isize=2048 agcount=4, agsize=30196417 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0
data = bsize=4096 blocks=120785665, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=58977, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
The operation has completed successfully.
root@psNode01:/etc/ceph# ceph osd tree --cluster psCluster01
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 2.24846 root default
-2 0.89938 host psNode03
0 0.44969 osd.0 up 1.00000 1.00000
1 0.44969 osd.1 up 1.00000 1.00000
-3 1.34908 host psNode02
2 0.44969 osd.2 up 1.00000 1.00000
3 0.44969 osd.3 up 1.00000 1.00000
4 0.44969 osd.4 up 1.00000 1.00000
Still same result - no errors and it claims to have completed successfully but the OSD is not added....
admin
2,930 Posts
September 26, 2017, 11:16 pmQuote from admin on September 26, 2017, 11:16 pmCan you please try the following command:
ceph-disk activate /dev/sdb1
and see if it succeeds
What we see so far is the ceph tool ceph-disk is failing to prepare the disk,,i will try tomorrow to add some logs to this tool and post it so we can try to identify in which area it is failing. If you have other disks you can try it will help know if it is disk related or not.
Can you please try the following command:
ceph-disk activate /dev/sdb1
and see if it succeeds
What we see so far is the ceph tool ceph-disk is failing to prepare the disk,,i will try tomorrow to add some logs to this tool and post it so we can try to identify in which area it is failing. If you have other disks you can try it will help know if it is disk related or not.
Last edited on September 26, 2017, 11:18 pm by admin · #10
Pages: 1 2
General Questions
rickbharper
11 Posts
Quote from rickbharper on September 7, 2017, 9:42 pmI recently stumbled across your project and I'm very interested in trying it - I've already read the guides and I'm in the process of setting up a testing lab to run through the installers and start 'playing', but I have a couple of questions:
- How much traffic does the management interface generate? On a server with 10GbE and 1GbE interfaces would running the management interface over 1GbE be appropriate?
- Are link aggregations supported?
- What are the best practices for connecting physical disks to the system? Is it best to use hardware RAID cards and present logical volumes to the system or pass the disks directly through to the OS - I have a lot of IT mode LSI 2008 based HBA's in my environment that we use for ZFS file-systems, would these be a good fit for PetaSAN?
- How does the system handle individual disk failures? The administration guide mentions several email alerts, including complete node failures, but I don't see anything about drive failures. What is the process for replacing a failed drive?
- Are snapshots currently available or on the road-map?
- Can the system be upgraded without downtime by upgrading one node at a time or is it necessary for all nodes to be running the same version for the iscsi target to be active?
- Are there any recommendations for setting up PetaSAN with vmWare iscsi multipathing? Is the best performance gained by using round-robin policies? Are there recommendations for iops balancing settings?
- Is it possible to have different storage pools backed by separate nodes? IE: A high-performance pool backed by ssd nodes and a high-capacity pool backed by spinning disk nodes.
- What does your licensing model look like? Will the base product always be free with revenue coming from support contracts? Or will the platform eventually transition to a paid-model?
I recently stumbled across your project and I'm very interested in trying it - I've already read the guides and I'm in the process of setting up a testing lab to run through the installers and start 'playing', but I have a couple of questions:
- How much traffic does the management interface generate? On a server with 10GbE and 1GbE interfaces would running the management interface over 1GbE be appropriate?
- Are link aggregations supported?
- What are the best practices for connecting physical disks to the system? Is it best to use hardware RAID cards and present logical volumes to the system or pass the disks directly through to the OS - I have a lot of IT mode LSI 2008 based HBA's in my environment that we use for ZFS file-systems, would these be a good fit for PetaSAN?
- How does the system handle individual disk failures? The administration guide mentions several email alerts, including complete node failures, but I don't see anything about drive failures. What is the process for replacing a failed drive?
- Are snapshots currently available or on the road-map?
- Can the system be upgraded without downtime by upgrading one node at a time or is it necessary for all nodes to be running the same version for the iscsi target to be active?
- Are there any recommendations for setting up PetaSAN with vmWare iscsi multipathing? Is the best performance gained by using round-robin policies? Are there recommendations for iops balancing settings?
- Is it possible to have different storage pools backed by separate nodes? IE: A high-performance pool backed by ssd nodes and a high-capacity pool backed by spinning disk nodes.
- What does your licensing model look like? Will the base product always be free with revenue coming from support contracts? Or will the platform eventually transition to a paid-model?
admin
2,930 Posts
Quote from admin on September 8, 2017, 1:17 pmHow much traffic does the management interface generate? On a server with 10GbE and 1GbE interfaces would running the management interface over 1GbE be appropriate?
It requires very neglible traffic, it is used while managing the system during user operations. Sure a 1G will do, you can also make the management subnet share the same nic as iSCSI 1. We recommend having 2(min) to 4 10G nics.
Are link aggregations supported?
Yes they were introduced in v1.3, It is setup once at cluster creation time and apply to all nodes. You can see it in the screen shots section.
What are the best practices for connecting physical disks to the system? Is it best to use hardware RAID cards and present logical volumes to the system or pass the disks directly through to the OS - I have a lot of IT mode LSI 2008 based HBA's in my environment that we use for ZFS file-systems, would these be a good fit for PetaSAN?
Ceph likes to see and handle many disks/OSDs itself, so do not use disks RAIDed together. Configure your card for JBOD mode or single disk RAID-0 with write back cache if it is battery backed. The more you add OSDs the faster the system becomes, until you saturate your cpu % usage then you cannot add more disk and have to add more nodes, the cluster benchmark page will help with this tuning.
How does the system handle individual disk failures? The administration guide mentions several email alerts, including complete node failures, but I don't see anything about drive failures. What is the process for replacing a failed drive?
Ceph is self-healing, if you have 20 disks and 1 fails, the lost replicas will automatically be re-created and stored on all remaining 19 disks, this ensures even balancing and more importantly each disk will share 5% of the recovery load so it will not degrade performance.
The PetaSAN UI will show you the disks error and will send you email notification. You remove/delete the failed disk from the ui so it does not show up in red but Ceph does its recovery regardless. You can, if you wish, add 1 or more disk later and Ceph will happily rebalance its data and offer better performance and capacity, but technically there is no 1 to 1 disk replacement.
If an entire node fails, assuming it is not one of te first 3 nodes, then from Ceph's point of view it lost a couple of disks and will do its rebalancing. If you add/join a new node Ceph will use the new disks and rebalance again, it is not viewed as a replacement, the new node need not have the same number of disk or have the same ip addresses/hostname..etc. Also in case of node failure PetaSAN does send email notifications.
The only exception requiring replacement is a node failure that is one of the first 3 nodes (ie management node), they include the Ceph monitors and Consul servers as well as the PetaSAN management functions and are the brains of the system. The custer can offord 1 failure of the management nodes but to 2. So such a node failure needs to be substituted as soon as possibe. The deployment wizard offers to replace a management node as its third options when deploying a new node.Are snapshots currently available or on the road-map?
Yes we have a lot of new features we plan in our road-map, snapshots are among them.
Can the system be upgraded without downtime by upgrading one node at a time or is it necessary for all nodes to be running the same version for the iscsi target to be active?
When upgrading using the installer the node will be down during the process which should take 5 minutes, but the cluster will be functioning. All your disks and client io will be active, we take care of any version issues but this is transparent to you.
Are there any recommendations for setting up PetaSAN with vmWare iscsi multipathing? Is the best performance gained by using round-robin policies? Are there recommendations for iops balancing settings?
It is all for better load balancing. If you have many ESXs that have excatly the same io load pattern then probably an active/active policy will not buy much. However it many cases you may have some ESX with higher load requirements than others as well as the io patterns may have have high burst and not even. In such cases having round-round and iops balancing will distribute the load better on the PetaSAN nodes as well as on the network and will result in better performance.
Is it possible to have different storage pools backed by separate nodes? IE: A high-performance pool backed by ssd nodes and a high-capacity pool backed by spinning disk nodes.
It will be supported in the future
What does your licensing model look like? Will the base product always be free with revenue coming from support contracts? Or will the platform eventually transition to a paid-model?
PetaSAN will always remain open source, the commercial support will be optional. It will be similar to licensing of Proxmox project for example.
How much traffic does the management interface generate? On a server with 10GbE and 1GbE interfaces would running the management interface over 1GbE be appropriate?
It requires very neglible traffic, it is used while managing the system during user operations. Sure a 1G will do, you can also make the management subnet share the same nic as iSCSI 1. We recommend having 2(min) to 4 10G nics.
Are link aggregations supported?
Yes they were introduced in v1.3, It is setup once at cluster creation time and apply to all nodes. You can see it in the screen shots section.
What are the best practices for connecting physical disks to the system? Is it best to use hardware RAID cards and present logical volumes to the system or pass the disks directly through to the OS - I have a lot of IT mode LSI 2008 based HBA's in my environment that we use for ZFS file-systems, would these be a good fit for PetaSAN?
Ceph likes to see and handle many disks/OSDs itself, so do not use disks RAIDed together. Configure your card for JBOD mode or single disk RAID-0 with write back cache if it is battery backed. The more you add OSDs the faster the system becomes, until you saturate your cpu % usage then you cannot add more disk and have to add more nodes, the cluster benchmark page will help with this tuning.
How does the system handle individual disk failures? The administration guide mentions several email alerts, including complete node failures, but I don't see anything about drive failures. What is the process for replacing a failed drive?
Ceph is self-healing, if you have 20 disks and 1 fails, the lost replicas will automatically be re-created and stored on all remaining 19 disks, this ensures even balancing and more importantly each disk will share 5% of the recovery load so it will not degrade performance.
The PetaSAN UI will show you the disks error and will send you email notification. You remove/delete the failed disk from the ui so it does not show up in red but Ceph does its recovery regardless. You can, if you wish, add 1 or more disk later and Ceph will happily rebalance its data and offer better performance and capacity, but technically there is no 1 to 1 disk replacement.
If an entire node fails, assuming it is not one of te first 3 nodes, then from Ceph's point of view it lost a couple of disks and will do its rebalancing. If you add/join a new node Ceph will use the new disks and rebalance again, it is not viewed as a replacement, the new node need not have the same number of disk or have the same ip addresses/hostname..etc. Also in case of node failure PetaSAN does send email notifications.
The only exception requiring replacement is a node failure that is one of the first 3 nodes (ie management node), they include the Ceph monitors and Consul servers as well as the PetaSAN management functions and are the brains of the system. The custer can offord 1 failure of the management nodes but to 2. So such a node failure needs to be substituted as soon as possibe. The deployment wizard offers to replace a management node as its third options when deploying a new node.
Are snapshots currently available or on the road-map?
Yes we have a lot of new features we plan in our road-map, snapshots are among them.
Can the system be upgraded without downtime by upgrading one node at a time or is it necessary for all nodes to be running the same version for the iscsi target to be active?
When upgrading using the installer the node will be down during the process which should take 5 minutes, but the cluster will be functioning. All your disks and client io will be active, we take care of any version issues but this is transparent to you.
Are there any recommendations for setting up PetaSAN with vmWare iscsi multipathing? Is the best performance gained by using round-robin policies? Are there recommendations for iops balancing settings?
It is all for better load balancing. If you have many ESXs that have excatly the same io load pattern then probably an active/active policy will not buy much. However it many cases you may have some ESX with higher load requirements than others as well as the io patterns may have have high burst and not even. In such cases having round-round and iops balancing will distribute the load better on the PetaSAN nodes as well as on the network and will result in better performance.
Is it possible to have different storage pools backed by separate nodes? IE: A high-performance pool backed by ssd nodes and a high-capacity pool backed by spinning disk nodes.
It will be supported in the future
What does your licensing model look like? Will the base product always be free with revenue coming from support contracts? Or will the platform eventually transition to a paid-model?
PetaSAN will always remain open source, the commercial support will be optional. It will be similar to licensing of Proxmox project for example.
rickbharper
11 Posts
Quote from rickbharper on September 8, 2017, 1:34 pmThanks, I appreciate the information. One more question:
Can nodes be safely removed from the cluster once they've been added? In the case of hardware refreshes, is it possible to add a new node, have Ceph rebalance the load and then tag the old node for removal?
Thanks, I appreciate the information. One more question:
Can nodes be safely removed from the cluster once they've been added? In the case of hardware refreshes, is it possible to add a new node, have Ceph rebalance the load and then tag the old node for removal?
admin
2,930 Posts
Quote from admin on September 8, 2017, 2:54 pmYou can remove a node. Ceph will re-create the lost replicas just as if the node died.
If your replica count is 3 and a node with all its disk dies, Ceph still has 2 running copies of every object that was present on dead none and will create a third copy.
If your replica count is 2 and a node dies, Ceph has 1 running copy left of those objects and will recreate a second copy. Now in unlucky event that another node dies before Ceph completes its recovery then data loss could happen if disks physically fail.
Note that you can take OSD disk from one node and place them in another running node(s) and Ceph will happily use them. Ceph even supports hot plugging OSDs (if not using external journal disks ) .
In PetaSAN, if you want to remove a node but use its OSDs with data, you need to first move its OSDs to other running nodes and then remove the original node since the act of node removal removes any locally running OSDs from cluster.
Lastly if you want to remove a node and not use its data disks, but are concerned that during Ceph recovery other failures happen. You can use the command line for this so you keep your original replica and not remove it until its replacement copy is complete.
You can remove a node. Ceph will re-create the lost replicas just as if the node died.
If your replica count is 3 and a node with all its disk dies, Ceph still has 2 running copies of every object that was present on dead none and will create a third copy.
If your replica count is 2 and a node dies, Ceph has 1 running copy left of those objects and will recreate a second copy. Now in unlucky event that another node dies before Ceph completes its recovery then data loss could happen if disks physically fail.
Note that you can take OSD disk from one node and place them in another running node(s) and Ceph will happily use them. Ceph even supports hot plugging OSDs (if not using external journal disks ) .
In PetaSAN, if you want to remove a node but use its OSDs with data, you need to first move its OSDs to other running nodes and then remove the original node since the act of node removal removes any locally running OSDs from cluster.
Lastly if you want to remove a node and not use its data disks, but are concerned that during Ceph recovery other failures happen. You can use the command line for this so you keep your original replica and not remove it until its replacement copy is complete.
rickbharper
11 Posts
Quote from rickbharper on September 26, 2017, 9:18 pmI've been experimenting with PetaSAN in a virtual environment for the last few weeks and I've been quite impressed with the redundancy and ease of set-up. Today I started my testing on 'bare-metal' utilizing some spare servers I had laying around. If all goes well here, I'll will most likely migrate several of my FreeNAS servers over to PetaSAN...
I've already hit a weird issue with some of my disks not being added to the cluster...
First my hardware layout:
- 3 Dell R310 servers
- 1 sata boot disk
- 3 sata 7.2k 500 gb storage disks
- 2 onboard 1GbE nics - one used for Management Network
- eth0 - management
- eth1 - not used
- 2 add-in dual port 10GbE nics (mix of Chelsio and Mellanox)
- eth2 & eth3 combined into lagg0 - used for both back-end networks
- eth4 - iscsi-1
- eth5 - iscsi-2
- Management network is part of my normal management subnet
- Back-end networks on running on a dedicated 10GbE switch
- iscsi networks are connected to my production iscsi switches so that I can test from my VMware hosts
Setup completed without an issue and my cluster created successfully; however my dashboard shows only 5 OSD's (I should have 9 - 3 servers x 3 storage disks) - digging around I see that all 3 disks of node-3 were used, 2 disks on node-2 were used, and none of the disks on node-1 were used. I can see all the disks through the node/list/disk_list page.
I have verified that ping tests from all nodes on all management and back-end interfaces complete correctly so I'm fairly certain that there aren't any networking issues.
I have tried adding the disks to the cluster using the GUI - I click the + button and confirm - the gui shows 'Adding' for the status - then it simply resets without any error messages. I have logged into node-1 and looked at the log, but I don't see any errors:
root@psNode01:/etc# tail /opt/petasan/log/PetaSAN.log
26/09/2017 14:54:36 INFO Start cleaning disks
26/09/2017 14:54:38 INFO Starting ceph-disk zap /dev/sdc
26/09/2017 15:05:00 INFO Success saving application config
26/09/2017 15:05:00 INFO Success saving application config
26/09/2017 15:05:00 INFO Success saving application config
26/09/2017 15:11:30 INFO Start job for add osd for disk sdb.
26/09/2017 15:11:30 INFO Start add osd job 2219326/09/2017 15:11:31 INFO Start cleaning disks
26/09/2017 15:11:32 INFO Starting ceph-disk zap /dev/sdbI have also looked at the disks using fdisk:
root@psNode01:/etc# fdisk -l
Disk /dev/sda: 232.9 GiB, 250000000000 bytes, 488281250 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x00000000Device Boot Start End Sectors Size Id Type
/dev/sda1 * 63 1975994 1975932 964.8M 83 Linux
/dev/sda2 1975995 31294619 29318625 14G 83 Linux
/dev/sda3 31294620 50845724 19551105 9.3G 83 Linux
/dev/sda4 50845725 488279609 437433885 208.6G 83 LinuxDisk /dev/sdb: 465.8 GiB, 500107862016 bytes, 976773168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: F76EDC6F-5B73-4EB6-89C4-3EBDDB8CD61DDevice Start End Sectors Size Type
/dev/sdb1 10487808 976773134 966285327 460.8G Ceph OSD
/dev/sdb2 2048 10487807 10485760 5G Ceph JournalPartition table entries are not in disk order.
Disk /dev/sdc: 465.8 GiB, 500107862016 bytes, 976773168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 3529DD35-31F7-41B8-98FE-C15512B00A56Device Start End Sectors Size Type
/dev/sdc1 10487808 976773134 966285327 460.8G Ceph OSD
/dev/sdc2 2048 10487807 10485760 5G Ceph JournalPartition table entries are not in disk order.
Disk /dev/sdd: 465.8 GiB, 500107862016 bytes, 976773168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 915F2B85-EAC5-48D7-A285-1C7202507918Device Start End Sectors Size Type
/dev/sdd1 10487808 976773134 966285327 460.8G Ceph OSD
/dev/sdd2 2048 10487807 10485760 5G Ceph JournalPartition table entries are not in disk order.
It seems that the disks are being partitioned correctly, but they simply aren't being added.
I ran a SMART test against one of the drives in question (thinking that perhaps PetaSAN wouldn't add a disk that was failing) - but that came back clear as well.
I would appreciate any help here...
I've been experimenting with PetaSAN in a virtual environment for the last few weeks and I've been quite impressed with the redundancy and ease of set-up. Today I started my testing on 'bare-metal' utilizing some spare servers I had laying around. If all goes well here, I'll will most likely migrate several of my FreeNAS servers over to PetaSAN...
I've already hit a weird issue with some of my disks not being added to the cluster...
First my hardware layout:
- 3 Dell R310 servers
- 1 sata boot disk
- 3 sata 7.2k 500 gb storage disks
- 2 onboard 1GbE nics - one used for Management Network
- eth0 - management
- eth1 - not used
- 2 add-in dual port 10GbE nics (mix of Chelsio and Mellanox)
- eth2 & eth3 combined into lagg0 - used for both back-end networks
- eth4 - iscsi-1
- eth5 - iscsi-2
- Management network is part of my normal management subnet
- Back-end networks on running on a dedicated 10GbE switch
- iscsi networks are connected to my production iscsi switches so that I can test from my VMware hosts
Setup completed without an issue and my cluster created successfully; however my dashboard shows only 5 OSD's (I should have 9 - 3 servers x 3 storage disks) - digging around I see that all 3 disks of node-3 were used, 2 disks on node-2 were used, and none of the disks on node-1 were used. I can see all the disks through the node/list/disk_list page.
I have verified that ping tests from all nodes on all management and back-end interfaces complete correctly so I'm fairly certain that there aren't any networking issues.
I have tried adding the disks to the cluster using the GUI - I click the + button and confirm - the gui shows 'Adding' for the status - then it simply resets without any error messages. I have logged into node-1 and looked at the log, but I don't see any errors:
root@psNode01:/etc# tail /opt/petasan/log/PetaSAN.log
26/09/2017 14:54:36 INFO Start cleaning disks
26/09/2017 14:54:38 INFO Starting ceph-disk zap /dev/sdc
26/09/2017 15:05:00 INFO Success saving application config
26/09/2017 15:05:00 INFO Success saving application config
26/09/2017 15:05:00 INFO Success saving application config
26/09/2017 15:11:30 INFO Start job for add osd for disk sdb.
26/09/2017 15:11:30 INFO Start add osd job 22193
26/09/2017 15:11:31 INFO Start cleaning disks
26/09/2017 15:11:32 INFO Starting ceph-disk zap /dev/sdb
I have also looked at the disks using fdisk:
root@psNode01:/etc# fdisk -l
Disk /dev/sda: 232.9 GiB, 250000000000 bytes, 488281250 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x00000000
Device Boot Start End Sectors Size Id Type
/dev/sda1 * 63 1975994 1975932 964.8M 83 Linux
/dev/sda2 1975995 31294619 29318625 14G 83 Linux
/dev/sda3 31294620 50845724 19551105 9.3G 83 Linux
/dev/sda4 50845725 488279609 437433885 208.6G 83 Linux
Disk /dev/sdb: 465.8 GiB, 500107862016 bytes, 976773168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: F76EDC6F-5B73-4EB6-89C4-3EBDDB8CD61D
Device Start End Sectors Size Type
/dev/sdb1 10487808 976773134 966285327 460.8G Ceph OSD
/dev/sdb2 2048 10487807 10485760 5G Ceph Journal
Partition table entries are not in disk order.
Disk /dev/sdc: 465.8 GiB, 500107862016 bytes, 976773168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 3529DD35-31F7-41B8-98FE-C15512B00A56
Device Start End Sectors Size Type
/dev/sdc1 10487808 976773134 966285327 460.8G Ceph OSD
/dev/sdc2 2048 10487807 10485760 5G Ceph Journal
Partition table entries are not in disk order.
Disk /dev/sdd: 465.8 GiB, 500107862016 bytes, 976773168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 915F2B85-EAC5-48D7-A285-1C7202507918
Device Start End Sectors Size Type
/dev/sdd1 10487808 976773134 966285327 460.8G Ceph OSD
/dev/sdd2 2048 10487807 10485760 5G Ceph Journal
Partition table entries are not in disk order.
It seems that the disks are being partitioned correctly, but they simply aren't being added.
I ran a SMART test against one of the drives in question (thinking that perhaps PetaSAN wouldn't add a disk that was failing) - but that came back clear as well.
I would appreciate any help here...
admin
2,930 Posts
Quote from admin on September 26, 2017, 10:00 pmHi,
on node-1 can you post the output of:
ceph-disk list
ceph osd tree --cluster CLUSTER_NAME
also attempt to add /dev/sdb osd manually using the following command:
ceph-disk prepare --cluster CLUSTER_NAME --zap-disk --fs-type xfs /dev/sdb
Does it complete successfully ? any error ? does the osd get added to the cluster ?
Lastly if you can email me the PetaSAN.log file for node-1 admin @ petasan.org
Hi,
on node-1 can you post the output of:
ceph-disk list
ceph osd tree --cluster CLUSTER_NAME
also attempt to add /dev/sdb osd manually using the following command:
ceph-disk prepare --cluster CLUSTER_NAME --zap-disk --fs-type xfs /dev/sdb
Does it complete successfully ? any error ? does the osd get added to the cluster ?
Lastly if you can email me the PetaSAN.log file for node-1 admin @ petasan.org
rickbharper
11 Posts
Quote from rickbharper on September 26, 2017, 10:10 pmHere is the disk list:
root@psNode01:/etc# ceph-disk list
/dev/rbd0 other, unknown
/dev/sda :
/dev/sda2 other, ext4, mounted on /
/dev/sda1 other, ext4, mounted on /boot
/dev/sda4 other, ext4, mounted on /opt/petasan/config
/dev/sda3 other, ext4, mounted on /var/lib/ceph
/dev/sdb :
/dev/sdb1 other
/dev/sdb2 ceph journal
/dev/sdc :
/dev/sdc1 other
/dev/sdc2 ceph journal
/dev/sdd :
/dev/sdd1 other
/dev/sdd2 ceph journal
/dev/sr0 other, unknown
/dev/sr1 other, unknownOSD Tree:
root@psNode01:/etc# ceph osd tree --cluster psCluster01
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 2.24846 root default
-2 0.89938 host psNode03
0 0.44969 osd.0 up 1.00000 1.00000
1 0.44969 osd.1 up 1.00000 1.00000
-3 1.34908 host psNode02
2 0.44969 osd.2 up 1.00000 1.00000
3 0.44969 osd.3 up 1.00000 1.00000
4 0.44969 osd.4 up 1.00000 1.00000This is the output when I try to manually add the disk:
root@psNode01:/etc# ceph osd tree --cluster psCluster01
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 2.24846 root default
-2 0.89938 host psNode03
0 0.44969 osd.0 up 1.00000 1.00000
1 0.44969 osd.1 up 1.00000 1.00000
-3 1.34908 host psNode02
2 0.44969 osd.2 up 1.00000 1.00000
3 0.44969 osd.3 up 1.00000 1.00000
4 0.44969 osd.4 up 1.00000 1.00000
root@psNode01:/etc# ceph-disk prepare --cluster psCluster01 --zap-disk --fs-type xfs /dev/sdb
Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header.Warning! Main and backup partition tables differ! Use the 'c' and 'e' options
on the recovery & transformation menu to examine the two tables.Warning! One or more CRCs don't match. You should repair the disk!
****************************************************************************
Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
verification and recovery are STRONGLY recommended.
****************************************************************************
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
Setting name!
partNum is 1
REALLY setting name!
The operation has completed successfully.
Setting name!
partNum is 0
REALLY setting name!
The operation has completed successfully.
meta-data=/dev/sdb1 isize=2048 agcount=4, agsize=30196417 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0
data = bsize=4096 blocks=120785665, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=58977, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
The operation has completed successfully.It would seem that there is something wrong with the disk partition table - despite the last line stating the operation completed correctly the OSD is not added....
UPDATE:
I used gdisk to wipe out the gpt and blank the MBR then tried to manually add the disk again. This time there were no errors in the manual add process:
root@psNode01:/etc# ceph-disk prepare --cluster psCluster01 --zap-disk --fs-type xfs /dev/sdb
Creating new GPT entries.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
Setting name!
partNum is 1
REALLY setting name!
The operation has completed successfully.
Setting name!
partNum is 0
REALLY setting name!
The operation has completed successfully.
meta-data=/dev/sdb1 isize=2048 agcount=4, agsize=30196417 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0
data = bsize=4096 blocks=120785665, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=58977, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
The operation has completed successfully.But again, this disk is not added to the cluster:
root@psNode01:/etc# ceph osd tree --cluster psCluster01
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 2.24846 root default
-2 0.89938 host psNode03
0 0.44969 osd.0 up 1.00000 1.00000
1 0.44969 osd.1 up 1.00000 1.00000
-3 1.34908 host psNode02
2 0.44969 osd.2 up 1.00000 1.00000
3 0.44969 osd.3 up 1.00000 1.00000
4 0.44969 osd.4 up 1.00000 1.00000
Here is the disk list:
root@psNode01:/etc# ceph-disk list
/dev/rbd0 other, unknown
/dev/sda :
/dev/sda2 other, ext4, mounted on /
/dev/sda1 other, ext4, mounted on /boot
/dev/sda4 other, ext4, mounted on /opt/petasan/config
/dev/sda3 other, ext4, mounted on /var/lib/ceph
/dev/sdb :
/dev/sdb1 other
/dev/sdb2 ceph journal
/dev/sdc :
/dev/sdc1 other
/dev/sdc2 ceph journal
/dev/sdd :
/dev/sdd1 other
/dev/sdd2 ceph journal
/dev/sr0 other, unknown
/dev/sr1 other, unknown
OSD Tree:
root@psNode01:/etc# ceph osd tree --cluster psCluster01
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 2.24846 root default
-2 0.89938 host psNode03
0 0.44969 osd.0 up 1.00000 1.00000
1 0.44969 osd.1 up 1.00000 1.00000
-3 1.34908 host psNode02
2 0.44969 osd.2 up 1.00000 1.00000
3 0.44969 osd.3 up 1.00000 1.00000
4 0.44969 osd.4 up 1.00000 1.00000
This is the output when I try to manually add the disk:
root@psNode01:/etc# ceph osd tree --cluster psCluster01
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 2.24846 root default
-2 0.89938 host psNode03
0 0.44969 osd.0 up 1.00000 1.00000
1 0.44969 osd.1 up 1.00000 1.00000
-3 1.34908 host psNode02
2 0.44969 osd.2 up 1.00000 1.00000
3 0.44969 osd.3 up 1.00000 1.00000
4 0.44969 osd.4 up 1.00000 1.00000
root@psNode01:/etc# ceph-disk prepare --cluster psCluster01 --zap-disk --fs-type xfs /dev/sdb
Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header.
Warning! Main and backup partition tables differ! Use the 'c' and 'e' options
on the recovery & transformation menu to examine the two tables.
Warning! One or more CRCs don't match. You should repair the disk!
****************************************************************************
Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
verification and recovery are STRONGLY recommended.
****************************************************************************
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
Setting name!
partNum is 1
REALLY setting name!
The operation has completed successfully.
Setting name!
partNum is 0
REALLY setting name!
The operation has completed successfully.
meta-data=/dev/sdb1 isize=2048 agcount=4, agsize=30196417 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0
data = bsize=4096 blocks=120785665, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=58977, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
The operation has completed successfully.
It would seem that there is something wrong with the disk partition table - despite the last line stating the operation completed correctly the OSD is not added....
UPDATE:
I used gdisk to wipe out the gpt and blank the MBR then tried to manually add the disk again. This time there were no errors in the manual add process:
root@psNode01:/etc# ceph-disk prepare --cluster psCluster01 --zap-disk --fs-type xfs /dev/sdb
Creating new GPT entries.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
Setting name!
partNum is 1
REALLY setting name!
The operation has completed successfully.
Setting name!
partNum is 0
REALLY setting name!
The operation has completed successfully.
meta-data=/dev/sdb1 isize=2048 agcount=4, agsize=30196417 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0
data = bsize=4096 blocks=120785665, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=58977, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
The operation has completed successfully.
But again, this disk is not added to the cluster:
root@psNode01:/etc# ceph osd tree --cluster psCluster01
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 2.24846 root default
-2 0.89938 host psNode03
0 0.44969 osd.0 up 1.00000 1.00000
1 0.44969 osd.1 up 1.00000 1.00000
-3 1.34908 host psNode02
2 0.44969 osd.2 up 1.00000 1.00000
3 0.44969 osd.3 up 1.00000 1.00000
4 0.44969 osd.4 up 1.00000 1.00000
admin
2,930 Posts
Quote from admin on September 26, 2017, 10:33 pmCan you edit the ceph conf file /etc/ceph/CLUSTER.conf and comment out osd_mkfs_options_xfs and osd_mount_options_xfs
# osd_mkfs_options_xfs = -f -i size=2048
# osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k,allocsize=4M"then re-run
ceph-disk prepare --cluster CLUSTER_NAME --zap-disk --fs-type xfs /dev/sdb
Can you edit the ceph conf file /etc/ceph/CLUSTER.conf and comment out osd_mkfs_options_xfs and osd_mount_options_xfs
# osd_mkfs_options_xfs = -f -i size=2048
# osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k,allocsize=4M"
then re-run
ceph-disk prepare --cluster CLUSTER_NAME --zap-disk --fs-type xfs /dev/sdb
rickbharper
11 Posts
Quote from rickbharper on September 26, 2017, 10:42 pmroot@psNode01:/etc/ceph# cat /etc/ceph/psCluster01.conf | grep -i osd
osd_pool_default_pg_num = 1024
osd_pool_default_pgp_num = 1024
osd_pool_default_size = 2
osd_pool_default_min_size = 1
debug_osd = 0/0
mon_pg_warn_min_per_osd = 0
mon_pg_warn_max_per_osd = 500
[osd]
osd_crush_update_on_start = true
osd_pool_default_pg_num = 1024
osd_pool_default_pgp_num = 1024
osd_pool_default_size = 2
osd_pool_default_min_size = 1
osd_mkfs_type = xfs
#osd_mkfs_options_xfs = -m crc=0,finobt=0 -f -i size=2048
#osd_mkfs_options_xfs = -f -i size=2048
#osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k,allocsize=4M"
osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_priority = 1
osd_recovery_op_priority = 1
osd_recovery_threads = 1
osd_client_op_priority = 63
osd_recovery_max_start = 1
osd_max_scrubs = 1
osd_scrub_during_recovery = false
osd_scrub_priority = 1
osd_scrub_begin_hour = 20
osd_scrub_end_hour = 8
osd_op_threads=4root@psNode01:/etc/ceph# ceph-disk prepare --cluster psCluster01 --zap-disk --fs-type xfs /dev/sdb
Creating new GPT entries.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
Setting name!
partNum is 1
REALLY setting name!
The operation has completed successfully.
Setting name!
partNum is 0
REALLY setting name!
The operation has completed successfully.
meta-data=/dev/sdb1 isize=2048 agcount=4, agsize=30196417 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0
data = bsize=4096 blocks=120785665, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=58977, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
The operation has completed successfully.root@psNode01:/etc/ceph# ceph osd tree --cluster psCluster01
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 2.24846 root default
-2 0.89938 host psNode03
0 0.44969 osd.0 up 1.00000 1.00000
1 0.44969 osd.1 up 1.00000 1.00000
-3 1.34908 host psNode02
2 0.44969 osd.2 up 1.00000 1.00000
3 0.44969 osd.3 up 1.00000 1.00000
4 0.44969 osd.4 up 1.00000 1.00000
Still same result - no errors and it claims to have completed successfully but the OSD is not added....
root@psNode01:/etc/ceph# cat /etc/ceph/psCluster01.conf | grep -i osd
osd_pool_default_pg_num = 1024
osd_pool_default_pgp_num = 1024
osd_pool_default_size = 2
osd_pool_default_min_size = 1
debug_osd = 0/0
mon_pg_warn_min_per_osd = 0
mon_pg_warn_max_per_osd = 500
[osd]
osd_crush_update_on_start = true
osd_pool_default_pg_num = 1024
osd_pool_default_pgp_num = 1024
osd_pool_default_size = 2
osd_pool_default_min_size = 1
osd_mkfs_type = xfs
#osd_mkfs_options_xfs = -m crc=0,finobt=0 -f -i size=2048
#osd_mkfs_options_xfs = -f -i size=2048
#osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k,allocsize=4M"
osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_priority = 1
osd_recovery_op_priority = 1
osd_recovery_threads = 1
osd_client_op_priority = 63
osd_recovery_max_start = 1
osd_max_scrubs = 1
osd_scrub_during_recovery = false
osd_scrub_priority = 1
osd_scrub_begin_hour = 20
osd_scrub_end_hour = 8
osd_op_threads=4
root@psNode01:/etc/ceph# ceph-disk prepare --cluster psCluster01 --zap-disk --fs-type xfs /dev/sdb
Creating new GPT entries.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
Setting name!
partNum is 1
REALLY setting name!
The operation has completed successfully.
Setting name!
partNum is 0
REALLY setting name!
The operation has completed successfully.
meta-data=/dev/sdb1 isize=2048 agcount=4, agsize=30196417 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0
data = bsize=4096 blocks=120785665, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=58977, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
The operation has completed successfully.
root@psNode01:/etc/ceph# ceph osd tree --cluster psCluster01
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 2.24846 root default
-2 0.89938 host psNode03
0 0.44969 osd.0 up 1.00000 1.00000
1 0.44969 osd.1 up 1.00000 1.00000
-3 1.34908 host psNode02
2 0.44969 osd.2 up 1.00000 1.00000
3 0.44969 osd.3 up 1.00000 1.00000
4 0.44969 osd.4 up 1.00000 1.00000
Still same result - no errors and it claims to have completed successfully but the OSD is not added....
admin
2,930 Posts
Quote from admin on September 26, 2017, 11:16 pmCan you please try the following command:
ceph-disk activate /dev/sdb1
and see if it succeeds
What we see so far is the ceph tool ceph-disk is failing to prepare the disk,,i will try tomorrow to add some logs to this tool and post it so we can try to identify in which area it is failing. If you have other disks you can try it will help know if it is disk related or not.
Can you please try the following command:
ceph-disk activate /dev/sdb1
and see if it succeeds
What we see so far is the ceph tool ceph-disk is failing to prepare the disk,,i will try tomorrow to add some logs to this tool and post it so we can try to identify in which area it is failing. If you have other disks you can try it will help know if it is disk related or not.