Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Iscsi not start

Pages: 1 2

Good afternoon,

 

I have a problem to start iscsi disk after a power outage.

When i click in iscsi disk list, not open.
Can help us with this problem? Have a way to copy the files from servers?

ceph health
2018-12-01 18:26:22.724485 7efe3ad8f700 -1 Errors while parsing config file!
2018-12-01 18:26:22.724506 7efe3ad8f700 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory
2018-12-01 18:26:22.724508 7efe3ad8f700 -1 parse_file: cannot open ~/.ceph/ceph.conf: (2) No such file or directory
2018-12-01 18:26:22.724509 7efe3ad8f700 -1 parse_file: cannot open ceph.conf: (2) No such file or directory
Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)

consul members
Node    Address              Status  Type    Build  Protocol  DC
node41  192.168.184.41:8301  alive   server  0.7.3  2         petasan
node42  192.168.184.42:8301  alive   client  0.7.3  2         petasan
node44  192.168.184.44:8301  alive   server  0.7.3  2         petasan
node48  192.168.184.48:8301  failed  server  0.7.3  2         petasan

Thanks

it is probably a Ceph layer issue rather than iscsi

You need to check Ceph: if Ceph is recovering, then it may take time before it gets active again, , else if it is stuck then need to find out why via cli commands. Note you need to add the --cluster XX parameter to your commands, where XX is the name of your cluster.

ceph health detail --cluster XX
For PGs that are stuck, try to find out why via
ceph pg XX query --cluster XX
and look at the “recovery_status” sections.

What type/model of disks do you have ?

Hi,

 

Thank you for you answer.

The disks are the most SATA/Seagate.

Follow the output of commands (ceph not start i dont know why)

 

ceph health detail --cluster loks
HEALTH_ERR Reduced data availability: 55 pgs inactive, 27 pgs down, 28 pgs incomplete; Degraded data redundancy: 55 pgs unclean; 7 stuck requests are blocked > 4096 sec; 1/3 mons down, quorum node41,node44
PG_AVAILABILITY Reduced data availability: 55 pgs inactive, 27 pgs down, 28 pgs incomplete
pg 1.7 is incomplete, acting [5,6]
pg 1.a is down, acting [5,6]
pg 1.10 is down, acting [0,5]
pg 1.15 is incomplete, acting [6,1]
pg 1.1c is incomplete, acting [5,6]
pg 1.22 is down, acting [5,6]
pg 1.24 is down, acting [5,0]
pg 1.2d is incomplete, acting [6,5]
pg 1.2e is down, acting [5,6]
pg 1.32 is down, acting [4,1]
pg 1.34 is incomplete, acting [4,6]
pg 1.38 is down, acting [4,0]
pg 1.41 is incomplete, acting [1,4]
pg 1.42 is incomplete, acting [4,0]
pg 1.4e is incomplete, acting [4,1]
pg 1.55 is incomplete, acting [4,1]
pg 1.56 is down, acting [1,4]
pg 1.5e is incomplete, acting [6,4]
pg 1.64 is down, acting [4,1]
pg 1.6a is down, acting [6,4]
pg 1.6e is down, acting [5,6]
pg 1.70 is down, acting [5,0]
pg 1.72 is down, acting [0,4]
pg 1.81 is incomplete, acting [4,1]
pg 1.84 is incomplete, acting [4,0]
pg 1.8d is incomplete, acting [4,6]
pg 1.92 is incomplete, acting [5,0]
pg 1.94 is incomplete, acting [1,5]
pg 1.97 is down, acting [0,4]
pg 1.9d is incomplete, acting [4,6]
pg 1.a1 is incomplete, acting [5,0]
pg 1.a3 is incomplete, acting [4,6]
pg 1.a4 is incomplete, acting [5,6]
pg 1.a7 is incomplete, acting [4,0]
pg 1.ab is down, acting [0,5]
pg 1.ac is down, acting [5,0]
pg 1.b2 is stuck inactive for 147747.228969, current state down, last acting [5,6]
pg 1.c8 is down, acting [5,6]
pg 1.ca is down, acting [5,6]
pg 1.ce is incomplete, acting [5,1]
pg 1.d3 is down, acting [5,0]
pg 1.d4 is down, acting [5,6]
pg 1.d5 is incomplete, acting [4,1]
pg 1.d7 is down, acting [1,5]
pg 1.d8 is down, acting [1,4]
pg 1.d9 is incomplete, acting [1,5]
pg 1.e4 is incomplete, acting [4,6]
pg 1.e5 is incomplete, acting [4,1]
pg 1.e8 is down, acting [0,6]
pg 1.ea is incomplete, acting [6,5]
pg 1.fc is incomplete, acting [1,5]
PG_DEGRADED Degraded data redundancy: 55 pgs unclean
pg 1.7 is stuck unclean since forever, current state incomplete, last acting [5,6]
pg 1.a is stuck unclean for 177816.248200, current state down, last acting [5,6]
pg 1.10 is stuck unclean since forever, current state down, last acting [0,5]
pg 1.15 is stuck unclean since forever, current state incomplete, last acting [6,1]
pg 1.1c is stuck unclean for 179338.763846, current state incomplete, last acting [5,6]
pg 1.22 is stuck unclean for 176573.873492, current state down, last acting [5,6]
pg 1.24 is stuck unclean for 176804.662961, current state down, last acting [5,0]
pg 1.2d is stuck unclean since forever, current state incomplete, last acting [6,5]
pg 1.2e is stuck unclean for 177810.929634, current state down, last acting [5,6]
pg 1.32 is stuck unclean for 178030.266384, current state down, last acting [4,1]
pg 1.34 is stuck unclean for 176565.995475, current state incomplete, last acting [4,6]
pg 1.38 is stuck unclean for 177097.827288, current state down, last acting [4,0]
pg 1.41 is stuck unclean since forever, current state incomplete, last acting [1,4]
pg 1.42 is stuck unclean for 176556.469928, current state incomplete, last acting [4,0]
pg 1.4e is stuck unclean since forever, current state incomplete, last acting [4,1]
pg 1.55 is stuck unclean for 176513.554485, current state incomplete, last acting [4,1]
pg 1.56 is stuck unclean since forever, current state down, last acting [1,4]
pg 1.5e is stuck unclean since forever, current state incomplete, last acting [6,4]
pg 1.64 is stuck unclean for 190478.451929, current state down, last acting [4,1]
pg 1.6a is stuck unclean since forever, current state down, last acting [6,4]
pg 1.6e is stuck unclean for 176582.464287, current state down, last acting [5,6]
pg 1.70 is stuck unclean for 176585.050426, current state down, last acting [5,0]
pg 1.72 is stuck unclean since forever, current state down, last acting [0,4]
pg 1.81 is stuck unclean for 176697.163538, current state incomplete, last acting [4,1]
pg 1.84 is stuck unclean for 176569.100444, current state incomplete, last acting [4,0]
pg 1.8d is stuck unclean for 176522.513870, current state incomplete, last acting [4,6]
pg 1.92 is stuck unclean for 176783.135021, current state incomplete, last acting [5,0]
pg 1.94 is stuck unclean since forever, current state incomplete, last acting [1,5]
pg 1.97 is stuck unclean since forever, current state down, last acting [0,4]
pg 1.9d is stuck unclean for 176524.074997, current state incomplete, last acting [4,6]
pg 1.a1 is stuck unclean for 176515.894133, current state incomplete, last acting [5,0]
pg 1.a3 is stuck unclean for 177392.298000, current state incomplete, last acting [4,6]
pg 1.a4 is stuck unclean for 176521.983648, current state incomplete, last acting [5,6]
pg 1.a7 is stuck unclean for 178949.188383, current state incomplete, last acting [4,0]
pg 1.ab is stuck unclean since forever, current state down, last acting [0,5]
pg 1.ac is stuck unclean since forever, current state down, last acting [5,0]
pg 1.b2 is stuck unclean for 176572.482961, current state down, last acting [5,6]
pg 1.c8 is stuck unclean for 176523.644703, current state down, last acting [5,6]
pg 1.ca is stuck unclean since forever, current state down, last acting [5,6]
pg 1.ce is stuck unclean for 177798.955780, current state incomplete, last acting [5,1]
pg 1.d3 is stuck unclean for 177763.726652, current state down, last acting [5,0]
pg 1.d4 is stuck unclean for 177420.197246, current state down, last acting [5,6]
pg 1.d5 is stuck unclean since forever, current state incomplete, last acting [4,1]
pg 1.d7 is stuck unclean since forever, current state down, last acting [1,5]
pg 1.d8 is stuck unclean since forever, current state down, last acting [1,4]
pg 1.d9 is stuck unclean since forever, current state incomplete, last acting [1,5]
pg 1.e4 is stuck unclean for 176567.382597, current state incomplete, last acting [4,6]
pg 1.e5 is stuck unclean for 177381.259159, current state incomplete, last acting [4,1]
pg 1.e8 is stuck unclean since forever, current state down, last acting [0,6]
pg 1.ea is stuck unclean since forever, current state incomplete, last acting [6,5]
pg 1.fc is stuck unclean since forever, current state incomplete, last acting [1,5]
REQUEST_STUCK 7 stuck requests are blocked > 4096 sec
7 ops are blocked > 134218 sec
osd.5 has stuck requests > 134218 sec
MON_DOWN 1/3 mons down, quorum node41,node44
mon.node48 (rank 2) addr 192.168.184.48:6789/0 is down (out of quorum)
root@node41:~# ceph pg loks query --cluster loks
no valid command found; 10 closest matches:
pg force_create_pg <pgid>
pg set_nearfull_ratio <float[0.0-1.0]>
pg set_full_ratio <float[0.0-1.0]>
pg map <pgid>
pg ls {<int>} {<states> [<states>...]}
pg dump_stuck {inactive|unclean|stale|undersized|degraded [inactive|unclean|stale|undersized|degraded...]} {<int>}
pg ls-by-primary <osdname (id|osd.id)> {<int>} {<states> [<states>...]}
pg ls-by-osd <osdname (id|osd.id)> {<int>} {<states> [<states>...]}
pg dump_pools_json
pg ls-by-pool <poolstr> {<states> [<states>...]}
Error EINVAL: invalid command
root@node41:~# ceph pg  query --cluster loks
no valid command found; 10 closest matches:
pg force_create_pg <pgid>
pg set_nearfull_ratio <float[0.0-1.0]>
pg set_full_ratio <float[0.0-1.0]>
pg map <pgid>
pg ls {<int>} {<states> [<states>...]}
pg dump_stuck {inactive|unclean|stale|undersized|degraded [inactive|unclean|stale|undersized|degraded...]} {<int>}
pg ls-by-primary <osdname (id|osd.id)> {<int>} {<states> [<states>...]}
pg ls-by-osd <osdname (id|osd.id)> {<int>} {<states> [<states>...]}
pg dump_pools_json
pg ls-by-pool <poolstr> {<states> [<states>...]}
Error EINVAL: invalid command
root@node41:~# ceph query --cluster loks
no valid command found; 10 closest matches:
mon dump {<int[0-]>}
mon stat
fs set_default <fs_name>
fs set-default <fs_name>
fs add_data_pool <fs_name> <pool>
fs rm_data_pool <fs_name> <pool>
fs set <fs_name> max_mds|max_file_size|allow_new_snaps|inline_data|cluster_down|allow_multimds|allow_dirfrags|balancer|standby_count_wanted <val> {<confirm>}
fs flag set enable_multiple <val> {--yes-i-really-mean-it}
fs ls
fs get <fs_name>
Error EINVAL: invalid command

 

ceph pg loks query
2018-12-02 13:20:07.052092 7faabef80700 -1 Errors while parsing config file!
2018-12-02 13:20:07.052097 7faabef80700 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory
2018-12-02 13:20:07.052098 7faabef80700 -1 parse_file: cannot open ~/.ceph/ceph.conf: (2) No such file or directory
2018-12-02 13:20:07.052099 7faabef80700 -1 parse_file: cannot open ceph.conf: (2) No such file or directory
Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)

How many OSDs do you have, how many  up ? any down ?

run:

ceph pg 1.10  query --cluster loks
ceph pg 1.15  query --cluster loks

look into “recovery_status” sections to see what is stuck

Are the disks SSDs or HDDs ?

 

Hi,

We have 5 OSDs up and 1 down.

All disks HDDs sata.

ceph pg 1.10  query --cluster loks

"recovery_state": [

{
"name": "Started/Primary/Peering/Down",
"enter_time": "2018-12-01 18:08:56.573719",
"comment": "not enough up instances of this PG to go active"
},
{
"name": "Started/Primary/Peering",
"enter_time": "2018-12-01 18:08:56.573649",
"past_intervals": [
{
"first": "7293",
"last": "7963",
"all_participants": [
{
"osd": 0
},
{
"osd": 1
},
{
"osd": 2
},
{
"osd": 5
},
{
"osd": 6
}
],
"intervals": [
{
"first": "7613",
"last": "7625",
"acting": "2"
},
{
"first": "7957",
"last": "7958",
"acting": "0"
},
{
"first": "7961",
"last": "7963",
"acting": "5"
}
]
}
],
"probing_osds": [
"0",
"1",
"5",
"6"
],
"blocked": "peering is blocked due to down osds",
"down_osds_we_would_probe": [
2
],
"peering_blocked_by": [
{
"osd": 2,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let us proceed"
}
]
},
{
"name": "Started",
"enter_time": "2018-12-01 18:08:56.573578"
}
],

ceph pg 1.15  query --cluster loks

"recovery_state": [
{
"name": "Started/Primary/Peering/Incomplete",
"enter_time": "2018-12-01 18:27:49.462944",
"comment": "not enough complete instances of this PG"
},
{
"name": "Started/Primary/Peering",
"enter_time": "2018-12-01 18:27:49.422286",
"past_intervals": [
{
"first": "7293",
"last": "7963",
"all_participants": [
{
"osd": 1
},
{
"osd": 3
},
{
"osd": 5
},
{
"osd": 6
}
],
"intervals": [
{
"first": "7613",
"last": "7620",
"acting": "3"
},
{
"first": "7672",
"last": "7673",
"acting": "5"
},
{
"first": "7817",
"last": "7819",
"acting": "1,5"
},
{
"first": "7962",
"last": "7963",
"acting": "6"
}
]
}
],
"probing_osds": [
"1",
"5",
"6"
],
"down_osds_we_would_probe": [
3
],
"peering_blocked_by": [],
"peering_blocked_by_detail": [
{
"detail": "peering_blocked_by_history_les_bound"
}
]
},
{
"name": "Started",
"enter_time": "2018-12-01 18:27:49.422188"
}
],

 

From the logs you sent, it seems 2 out of 6 OSDs are down: OSD# 2,3
can you check

ceph status --cluster loks
ceph osd tree --cluster loks

The best thing to try now is to start these 2 OSDs. On their nodes:

systemctl restart ceph-osd@2
systemctl restart ceph-osd@3

systemctl status ceph-osd@2
systemctl status ceph-osd@3

i noticed you also have 1 downed monitor, if the 2 OSDs are not stating, on their node edit the conf file:
/etc/ceph/loks.conf
and temporarily modify the "mon_host = " to exclude the ip address of the failed mon then try again to restart the 2 failed OSDs. Note you shoud later revert the changes to the conf file.

If still they do not start, try to start them manually and see what error you get on console:

/usr/lib/ceph/ceph-osd-prestart.sh --cluster loks --id 2
/usr/bin/ceph-osd -f --cluster loks --id 2 --setuser ceph --setgroup ceph
/usr/lib/ceph/ceph-osd-prestart.sh --cluster loks --id 3
/usr/bin/ceph-osd -f --cluster loks --id 3 --setuser ceph --setgroup ceph

you can also find other error log in

/var/log/ceph/loks-osd.2.log
/var/log/ceph/loks-osd.3.log

Hi,

Is that possible just remove this OSDs of cluster? Because we have other 3 Petasan instances that is possible to work with him.

I just need now to get the files of iscsi storage, because are VMs of clients.

The result of commands:

root@node44:~#  systemctl restart ceph-osd@3
Job for ceph-osd@3.service failed because the control process exited with error code. See "systemctl status ceph-osd@3.service" and "journalctl -xe" for details.

root@node44:~# /usr/lib/ceph/ceph-osd-prestart.sh --cluster loks --id 3
OSD data directory /var/lib/ceph/osd/loks-3 does not exist; bailing out.

root@node44:~#  /usr/bin/ceph-osd -f --cluster loks --id 3 --setuser ceph --setgroup ceph
2018-12-03 08:51:50.169527 7fde0ecb9e00 -1  ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/loks-3: (2) No such file or directory

The same output for the two OSDs.

Thanks!

No you can not move the 2 OSDs to another cluster. You can remove them from their hosts and put them in other hosts within the same PetaSAN cluster if you think the problem is with the host node rather than the disks themselves.

Can you send me the the output of the first 2 commands in my prev post, are OSD 2 and 3 on same host or different ?

edit /etc/ceph/loks.conf
and temporarily modify the "mon_host = " to exclude the ip address of the failed mon

find the disk /dev/SDX where OSD 2 and OSD 3 are on via the ui or via the cli

ceph-disk list

Mount their first/metada partition:

mount /dev/sdX1 /var/lib/ceph/osd/loks-2
where sdX is the disk for OSD 2

mount /dev/sdY1 /var/lib/ceph/osd/loks-3
where sdY is the disk for OSD 3

After mounting, try to start the OSDs manually and see what error you get on console:

/usr/lib/ceph/ceph-osd-prestart.sh --cluster loks --id 2
/usr/bin/ceph-osd -f --cluster loks --id 2 --setuser ceph --setgroup ceph
/usr/lib/ceph/ceph-osd-prestart.sh --cluster loks --id 3
/usr/bin/ceph-osd -f --cluster loks --id 3 --setuser ceph --setgroup ceph

you can also find other error log in

/var/log/ceph/loks-osd.2.log
/var/log/ceph/loks-osd.3.log

Hi,

I replaced the node that was gave problem (node 48).

Follow the output:

root@node41:~# ceph status --cluster loks
cluster:
id:     c8497f53-4077-4359-8d50-7439c0d2760f
health: HEALTH_WARN
Reduced data availability: 55 pgs inactive, 55 pgs incomplete
Degraded data redundancy: 55 pgs unclean

services:
mon: 3 daemons, quorum node41,node44,node48
mgr: node44(active), standbys: node41, node48
osd: 7 osds: 7 up, 7 in

data:
pools:   1 pools, 256 pgs
objects: 151k objects, 604 GB
usage:   1216 GB used, 5482 GB / 6699 GB avail
pgs:     21.484% pgs not active
198 active+clean
55  incomplete
2   active+clean+scrubbing+deep
1   active+clean+scrubbing

 

root@node41:~#  ceph osd tree --cluster loks
ID CLASS WEIGHT  TYPE NAME       STATUS REWEIGHT PRI-AFF
-1       6.54225 root default
-7       1.81918     host node41
4   hdd 0.90959         osd.4       up  1.00000 1.00000
5   hdd 0.90959         osd.5       up  1.00000 1.00000
-9       0.90709     host node42
6   hdd 0.90709         osd.6       up  1.00000 1.00000
-3       1.81619     host node44
0   hdd 0.90810         osd.0       up  1.00000 1.00000
1   hdd 0.90810         osd.1       up  1.00000 1.00000
-5       1.99979     host node48
3   hdd 0.99989         osd.3       up  1.00000 1.00000
7   hdd 0.99989         osd.7       up  1.00000 1.00000
root@node41:~#

 

Thanks!

Hi

What happened to osd 2 ? did you try to start 2&3 per command line ? what was the output ? As stated starting ths osds is the best thing to do. even if they do not start we can retrieve data as long as the physical disks are not corrupt.

Now we need to see why there in-active/in-complete pgs, can you do what was done prev by quering such pgs and look at the the recovery status section as before and post a couple of them if they are different.

Pages: 1 2