Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

All 3 monitors down !

Pages: 1 2

Hi, I went back to work after 2 weeks holyday and I found all my 3 Petasan monitors down: two were compretely switched off and the third was in a hang state. I rebooted all of them but they are stack right after POST, as they enter the grub menu (see picture). What could have happened ?? :-O It cannot be an hardware issue at the same time on 3 different servers !  The last node of the cluster (the 4th) is still up, but it's not a monitor so i can't access the dashboard. How can I restart my cluster (first) and how can I investigate what happened (second) ?  PetaSAN version = 2.7.3

Thanks in advance.

grub_menu

This means boot / os disk failure or boot sector error. Can't say why this would happen on multiple nodes.

So I can't boot issuing a command in the grub prompt ? Boot disks are SSD and I can see them in BIOS.

How can I fix this problem without loosing the volumes content ? Is it possible to reinstall without reformatting the OSDs ?

Thanks, Ste.

I forced booting with the following commands:

grub> set prefix=(hd10,3)/grub
grub> set root=(hd10,3)
grub> insmod linux
grub> insmod normal
grub> normal
grub> linux /boot/vmlinuz-4.12.14-28-petasan
grub> initrd /boot/initrd.img-4.12.14-28-petasan
grub> boot

but it fails with the following error, /etc/fstab is empty on all nodes... I'm afraid it was a software issue.

On an Ubuntu forum I found that: "This bug sounds like a core problem with Ubuntu. It appears that your init file is missing or corrupt.". Actually it is missing... Did anybody else experience this bug ?

 

I would recommend you boot from a live CD and check your drives on all 3 hosts, checking the OS drives as well as OSD drives and try to assess the damage, what files ate missing..is it only boot record and /etc/fstab file on OS disks or more ?

 

Hey Ste, I'm also running into this on 3 of our 5 nodes, but happened after updating to 2.8.0 and rebooting (I like to reboot one at a time after an update).  I also ran into the fstab issue while trying a similar method to yours.

I used the instructions in the link below on the first answer to get them up and running again.  The commands don't persist after a reboot, even after running update-grub, so I've just been running them after each reboot.  Use at your own risk of course as with anything online lol.

https://unix.stackexchange.com/questions/329926/grub-starts-in-command-line-after-reboot

Node 1-3 are slightly older and Intel based, node 4-5 are Amd Epyc and about a year old, they boot without issues.

I don't have any permanent fix yet for it though.

 

Good Luck!!

Quote from Brandon on August 25, 2021, 4:26 pm

The commands don't persist after a reboot, even after running update-grub, so I've just been running them after each reboot.

Hi Brandon, at the end I found how to start my servers: the correct commands were (the first line was wrong):

grub> set prefix=(hd10,3)/boot/grub
grub> set root=(hd10,3)
grub> insmod linux
grub> insmod normal
grub> normal

But as you said, at every reboot I must issue the commands again, even after the "update-grub" command. Do you have an idea of what happened and why all servers have this issue ?

Hi admin, do you think this could be fixed if I upgrade petasan from 2.7.3 to the latest release ?

Thanks and Bye,  Ste.

 

PS: At the end also the 4th and last node was affected, they are all of the same type: I7 cpu, 32 GB ram, SSD 120 GB for OS, controller Broadcom MegaRAID SAS 9361-16i, 10x OSD, 1x NVMe Journal disk

Generally it looks likes a boot loader install issue, maybe a more recent version of grub failed to update during apt upgrade. I do not believe the PetaSAN update itself is involved, but the apt upgrade of a newer grub package maybe. The good thing is that there should not be any files that have gone missing like /etc/fstab as i understood from prev posts which would probably be more serious issue.

Have you changed any boot options in BIOS : from EFI to Legacy boot or vice versa or any secure boot options ? make sure you have secure boot disabled. Also do the OS drives have another OS in multiboot EFI ?

Can you also show the output of:

parted /dev/sdXX print

where sdXX is your OS drive

 

Actually I did not perform any upgrade, I was on vacation. Several files are missing only when the system does not boot correctly, but after the five grub commands in my previous post the system boots normally and every file is at the right place. I absolutely made no change in Bios and there's only petasan installed in these server. Here is the output log:

root@petasan01:~# fdisk -l /dev/sdk
Disk /dev/sdk: 111.8 GiB, 120034123776 bytes, 234441648 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 875AFCA8-F7D5-4BDD-B829-B91AA16523D9

Device        Start       End   Sectors  Size Type
/dev/sdk1      2048      4095      2048    1M BIOS boot
/dev/sdk2      4096    266239    262144  128M EFI System
/dev/sdk3    266240  31723519  31457280   15G Linux filesystem
/dev/sdk4  31723520  94638079  62914560   30G Linux filesystem
/dev/sdk5  94638080 234441614 139803535 66.7G Linux filesystem

root@petasan01:~# parted /dev/sdk2 print
Model: Unknown (unknown)
Disk /dev/sdk2: 134MB
Sector size (logical/physical): 512B/512B
Partition Table: loop
Disk Flags:

Number  Start  End    Size   File system  Flags
1      0.00B  134MB  134MB  fat32

root@petasan01:~# parted /dev/sdk3 print
Model: Unknown (unknown)
Disk /dev/sdk3: 16.1GB
Sector size (logical/physical): 512B/512B
Partition Table: loop
Disk Flags:

Number  Start  End     Size    File system  Flags
1      0.00B  16.1GB  16.1GB  ext4

root@petasan01:~# parted /dev/sdk4 print
Model: Unknown (unknown)
Disk /dev/sdk4: 32.2GB
Sector size (logical/physical): 512B/512B
Partition Table: loop
Disk Flags:

Number  Start  End     Size    File system  Flags
1      0.00B  32.2GB  32.2GB  ext4

root@petasan01:~# parted /dev/sdk5 print
Model: Unknown (unknown)
Disk /dev/sdk5: 71.6GB
Sector size (logical/physical): 512B/512B
Partition Table: loop
Disk Flags:

Number  Start  End     Size    File system  Flags
1      0.00B  71.6GB  71.6GB  ext4

 

This morning I found again one monitor powered off (node #3) and this is what appears in the console of node #1 and #4:

All hosts have 32 GB ram ! Why that "out of memory" error ?

Moreover, the dashboard is not reachable (504 Gateway timeout) even on the two surviving monitors.

-For ram: 32 GB is not a lot depending on osds/services, can you make sure you meet recommendations.

-For missing file(s) such as /etc/fstab, maybe you are not booting from correct drive / root partitions ?

-You could try to re-install grub

if you boot using bios:
grub-install --target=i386-pc --no-floppy /dev/sdXX # where sdXX is the drive name

if you boot EFI ( you should see a /sys/firmware/efi directory present )

grub-install --target=x86_64-efi --efi-directory=/boot/efi --no-floppy --bootloader-id=petasan

 

Pages: 1 2