Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

replication job: is it working ?

Hi all,

I configured a replication job between two PetaSAN cluster in the same LAN, the job start at the scheduled time, but i does not seem to do anything, as the throughput graph sticks to 0 (zero) bs on the destination cluster. Here is the log:

2022-10-12 18:00:27 - Job 00001 | Job instance (2022-10-12 18:00:01) | Executing replication job.

2022-10-12 18:00:23 - Job 00001 | Job instance (2022-10-12 18:00:01) | Creating new snapshot at source disk.

2022-10-12 18:00:07 - Job 00001 | Job instance (2022-10-12 18:00:01) | Source disk has no snapshots , start clearing destination disk.

2022-10-12 18:00:07 - Job 00001 | Job instance (2022-10-12 18:00:01) | Started.

2022-10-12 18:00:07 - Job 00001 | Getting source disk snapshots list.

2022-10-12 18:00:05 - Job 00001 | Getting destination disk snapshots list.

2022-10-12 18:00:04 - Job 00001 | Getting metadata of source disk.

2022-10-12 18:00:02 - Job 00001 | Getting metadata of destination disk.

2022-10-12 18:00:01 - Job 00001 | Stopping destination disk.

2022-10-12 18:00:01 - Job 00001 | Run replication job ...

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

2022-10-12 17:04:24 - Job 00001 has been updated.

2022-10-12 16:58:31 - Job 00001 has been updated.

2022-10-12 16:56:30 - Job 00001 has been started.

2022-10-12 16:56:28 - Job 00001 has been created.

Am i missing something ? Maybe it is taking long time to do the snapshot (10 TB) on the source cluster ?

Thanks, Ste

Can you run

ps aux | grep rbd

on source cluster on node running the replication service as well as on destination cluster node that has destination ip

i assume you followed the replication guide to setup replication user..etc.

Hello, here is the output from the source node:

root      280584  0.0  0.0   6896  2308 ?        S    Oct12   0:00 /bin/bash -c rbd export-diff rbd/image-00005@rep-snap-00001-00001 - | tee >(md5sum | cut -d ' ' -f 1 > /opt/petasan/config/replication/00001-1665590402_md5_1.txt) | /opt/petasan/scripts/backups/pipe_reader.py /opt/petasan/config/replication/00001-1665590402_progress_uncomp.json | ssh -o StrictHostKeyChecking=no -i /opt/petasan/config/replication/00001-1665590402_1665590425_474326_sshkey.txt root@10.220.12.133 "tee >(md5sum | cut -d ' ' -f 1 > /opt/petasan/config/replication/00001-1665590402_md5_2.txt) | rbd import-diff - Solo_su_HDD/image-00003 -n client.root --keyring=/etc/ceph/ceph.client.root.keyring" 2> /opt/petasan/config/replication/00001-1665590402_progress_import.txt
root      280585  0.0  0.1 950836 83396 ?        Sl   Oct12   0:04 rbd export-diff rbd/image-00005@rep-snap-00001-00001 -
root      280589  0.0  0.0   6896   228 ?        S    Oct12   0:00 /bin/bash -c rbd export-diff rbd/image-00005@rep-snap-00001-00001 - | tee >(md5sum | cut -d ' ' -f 1 > /opt/petasan/config/replication/00001-1665590402_md5_1.txt) | /opt/petasan/scripts/backups/pipe_reader.py /opt/petasan/config/replication/00001-1665590402_progress_uncomp.json | ssh -o StrictHostKeyChecking=no -i /opt/petasan/config/replication/00001-1665590402_1665590425_474326_sshkey.txt root@10.220.12.133 "tee >(md5sum | cut -d ' ' -f 1 > /opt/petasan/config/replication/00001-1665590402_md5_2.txt) | rbd import-diff - Solo_su_HDD/image-00003 -n client.root --keyring=/etc/ceph/ceph.client.root.keyring" 2> /opt/petasan/config/replication/00001-1665590402_progress_import.txt
root     1124148  0.0  0.0   6436   656 pts/0    S+   11:28   0:00 grep --color=auto rbd

And this is from the destination node:

root@petaIBIS03:~# ps aux | grep rbd
root        3523  0.0  0.0      0     0 ?        S<   Oct12   0:00 [rbd]
root     1058592  0.0  0.0   6436  2604 pts/0    S+   11:29   0:00 grep --color=auto rbd

I followed the Admin guide vers. 1.8, completing all these steps (in this order):

  • added backup/replication role to one node in source cluster and one on dest cluster
  • added destination cluster -> connection test successful (root user)
  • added replication job -> succesfully selected nodes and disks
  • added a daily schedule

The job is started, but the log is now at the same point of yesterday (my first message here in this thread).

Thanks and bye, S.

 

on source cluster replication node

1) Can you show output of

consul kv get PetaSAN/Replication/Active_Jobs/00001
cat /opt/petasan/config/replication/00001-1665590402_progress_import.txt
2) Do you see progress updating when you run:
rbd export-diff rbd/image-00005@rep-snap-00001-00001 - | md5sum

else if it is stuck, add the following to /etc/ceph/ceph.conf
[client]
debug_rbd = 20
debug_rados = 20
log_file = /var/log/ceph/client_rbd.log

and rerun the command and check the log

Point 1)

root@petasan03:~# consul kv get PetaSAN/Replication/Active_Jobs/00001
{
"job_id": "00001-1665590402",
"job_name": "prova-replica",
"node_name": "",
"uncompressed_transferred_bytes": "0.0 GB",
"uncompressed_transferred_rate": "0.0 MB/s",
"compressed_transferred_bytes": "",
"compressed_transferred_rate": "",
"compression_ratio": "",
"elapsed_time": "22:19",
"pid": "279341",
"progress": "",
"start_time": "2022-10-12 18:00:01"
}

root@petasan03:~# cat /opt/petasan/config/replication/00001-1665590402_progress_import.txt
2022-10-12T18:00:27.422+0200 7f83d1480380 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.root.keyring: (2) No such file or directory
2022-10-12T18:00:27.422+0200 7f83d1480380 -1 AuthRegistry(0x55f08b4a1ea0) no keyring found at /etc/ceph/ceph.client.root.keyring, disabling cephx
2022-10-12T18:00:27.422+0200 7f83d1480380 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.root.keyring: (2) No such file or directory
2022-10-12T18:00:27.422+0200 7f83d1480380 -1 AuthRegistry(0x7ffd531652a0) no keyring found at /etc/ceph/ceph.client.root.keyring, disabling cephx
rbd: couldn't connect to the cluster!

Point 2), the rbd command hangs also after adding the [client] section in /etc/ceph/ceph.conf, here is a tail of the log file:

root@petasan03:~# tail -50 /var/log/ceph/client_rbd.log
2022-10-13T16:21:11.996+0200 7f5a0effd700 20 librbd::io::ReadResult: 0x7f5a0eff7d60 operator(): moved resulting 4194304 bytes to bl 0x7f5a00008818
2022-10-13T16:21:11.996+0200 7f5a0e7fc700 20 librbd::io::AsyncOperation: 0x55705ebfac58 finish_op
2022-10-13T16:21:11.996+0200 7f5a0effd700 20 librbd::io::ObjectRequest: 0x7f5a0000c2e0 handle_read_object: rbd_data.b24cfb871d634b.00000000000004b4 r=0
2022-10-13T16:21:11.996+0200 7f5a0effd700 20 librbd::io::ObjectRequest: 0x7f5a0000c2e0 finish: rbd_data.b24cfb871d634b.00000000000004b4 r=0
2022-10-13T16:21:11.996+0200 7f5a0effd700 10 librbd::io::ReadResult: 0x7f5a00007380 finish: C_ObjectReadRequest: r=0
2022-10-13T16:21:11.996+0200 7f5a0effd700 10 librbd::io::ReadResult: 0x7f5a00007380 finish:  got {0=4194304} for [0,4194304] bl 4194304
2022-10-13T16:21:11.996+0200 7f5a0effd700 20 librbd::io::AioCompletion: 0x55705ebec470 complete_request: cb=1, pending=0
2022-10-13T16:21:11.996+0200 7f5a0effd700 20 librbd::io::AioCompletion: 0x55705ebec470 finalize: r=4194304
2022-10-13T16:21:11.996+0200 7f5a0effd700 20 librbd::io::ReadResult: 0x7f5a0eff7d60 operator(): moved resulting 4194304 bytes to bl 0x55705ebff928
2022-10-13T16:21:11.996+0200 7f5a0e7fc700 20 librbd::io::AsyncOperation: 0x55705ebec568 finish_op
2022-10-13T16:21:12.000+0200 7f5a0effd700 20 librbd::io::ObjectRequest: 0x7f5a000048a0 handle_read_object: rbd_data.b24cfb871d634b.00000000000004b6 r=0
2022-10-13T16:21:12.000+0200 7f5a0effd700 20 librbd::io::ObjectRequest: 0x7f5a000048a0 finish: rbd_data.b24cfb871d634b.00000000000004b6 r=0
2022-10-13T16:21:12.000+0200 7f5a0effd700 10 librbd::io::ReadResult: 0x7f5a0000ca30 finish: C_ObjectReadRequest: r=0
2022-10-13T16:21:12.000+0200 7f5a0effd700 10 librbd::io::ReadResult: 0x7f5a0000ca30 finish:  got {0=4194304} for [0,4194304] bl 4194304
2022-10-13T16:21:12.000+0200 7f5a0effd700 20 librbd::io::AioCompletion: 0x55705eb9b700 complete_request: cb=1, pending=0
2022-10-13T16:21:12.000+0200 7f5a0effd700 20 librbd::io::AioCompletion: 0x55705eb9b700 finalize: r=4194304
2022-10-13T16:21:12.000+0200 7f5a0effd700 20 librbd::io::ReadResult: 0x7f5a0eff7d60 operator(): moved resulting 4194304 bytes to bl 0x55705eba3908
2022-10-13T16:21:12.000+0200 7f5a0e7fc700 20 librbd::io::AsyncOperation: 0x55705eb9b7f8 finish_op
2022-10-13T16:21:12.008+0200 7f5a0effd700 20 librbd::io::ObjectRequest: 0x7f5a0000b500 handle_read_object: rbd_data.b24cfb871d634b.00000000000004b3 r=0
2022-10-13T16:21:12.008+0200 7f5a0effd700 20 librbd::io::ObjectRequest: 0x7f5a0000b500 finish: rbd_data.b24cfb871d634b.00000000000004b3 r=0
2022-10-13T16:21:12.008+0200 7f5a0effd700 10 librbd::io::ReadResult: 0x7f5a0000c1b0 finish: C_ObjectReadRequest: r=0
2022-10-13T16:21:12.008+0200 7f5a0effd700 10 librbd::io::ReadResult: 0x7f5a0000c1b0 finish:  got {0=4194304} for [0,4194304] bl 4194304
2022-10-13T16:21:12.008+0200 7f5a0effd700 20 librbd::io::AioCompletion: 0x55705ebec280 complete_request: cb=1, pending=0
2022-10-13T16:21:12.008+0200 7f5a0effd700 20 librbd::io::AioCompletion: 0x55705ebec280 finalize: r=4194304
2022-10-13T16:21:12.008+0200 7f5a0effd700 20 librbd::io::ReadResult: 0x7f5a0eff7d60 operator(): moved resulting 4194304 bytes to bl 0x7f5a0000bd58
2022-10-13T16:21:12.008+0200 7f5a0e7fc700 20 librbd::io::AsyncOperation: 0x55705ebec378 finish_op
2022-10-13T16:21:12.008+0200 7f5a0effd700 20 librbd::io::ObjectRequest: 0x7f5a00008790 handle_read_object: rbd_data.b24cfb871d634b.00000000000004b5 r=0
2022-10-13T16:21:12.008+0200 7f5a0effd700 20 librbd::io::ObjectRequest: 0x7f5a00008790 finish: rbd_data.b24cfb871d634b.00000000000004b5 r=0
2022-10-13T16:21:12.008+0200 7f5a0effd700 10 librbd::io::ReadResult: 0x7f5a00007450 finish: C_ObjectReadRequest: r=0
2022-10-13T16:21:12.008+0200 7f5a0effd700 10 librbd::io::ReadResult: 0x7f5a00007450 finish:  got {0=4194304} for [0,4194304] bl 4194304
2022-10-13T16:21:12.008+0200 7f5a0effd700 20 librbd::io::AioCompletion: 0x55705ec00be0 complete_request: cb=1, pending=0
2022-10-13T16:21:12.008+0200 7f5a0effd700 20 librbd::io::AioCompletion: 0x55705ec00be0 finalize: r=4194304
2022-10-13T16:21:12.008+0200 7f5a0effd700 20 librbd::io::ReadResult: 0x7f5a0eff7d60 operator(): moved resulting 4194304 bytes to bl 0x7f5a0000acc8
2022-10-13T16:21:12.008+0200 7f5a0e7fc700 20 librbd::io::AsyncOperation: 0x55705ec00cd8 finish_op
2022-10-13T16:21:12.012+0200 7f5a0effd700 20 librbd::io::ObjectRequest: 0x7f5a0000a120 handle_read_object: rbd_data.b24cfb871d634b.00000000000004b0 r=0
2022-10-13T16:21:12.012+0200 7f5a0effd700 20 librbd::io::ObjectRequest: 0x7f5a0000a120 finish: rbd_data.b24cfb871d634b.00000000000004b0 r=0
2022-10-13T16:21:12.012+0200 7f5a0effd700 10 librbd::io::ReadResult: 0x7f5a0000ab00 finish: C_ObjectReadRequest: r=0
2022-10-13T16:21:12.012+0200 7f5a0effd700 10 librbd::io::ReadResult: 0x7f5a0000ab00 finish:  got {0=4194304} for [0,4194304] bl 4194304
2022-10-13T16:21:12.012+0200 7f5a0effd700 20 librbd::io::AioCompletion: 0x55705ea88890 complete_request: cb=1, pending=0
2022-10-13T16:21:12.012+0200 7f5a0effd700 20 librbd::io::AioCompletion: 0x55705ea88890 finalize: r=4194304
2022-10-13T16:21:12.012+0200 7f5a0effd700 20 librbd::io::ReadResult: 0x7f5a0eff7d60 operator(): moved resulting 4194304 bytes to bl 0x7f5a0000a618
2022-10-13T16:21:12.012+0200 7f5a0e7fc700 20 librbd::io::AsyncOperation: 0x55705ea88988 finish_op
2022-10-13T16:21:12.020+0200 7f5a0effd700 20 librbd::io::ObjectRequest: 0x7f5a0000bac0 handle_read_object: rbd_data.b24cfb871d634b.00000000000004af r=0
2022-10-13T16:21:12.020+0200 7f5a0effd700 20 librbd::io::ObjectRequest: 0x7f5a0000bac0 finish: rbd_data.b24cfb871d634b.00000000000004af r=0
2022-10-13T16:21:12.020+0200 7f5a0effd700 10 librbd::io::ReadResult: 0x7f5a00008e20 finish: C_ObjectReadRequest: r=0
2022-10-13T16:21:12.020+0200 7f5a0effd700 10 librbd::io::ReadResult: 0x7f5a00008e20 finish:  got {0=4194304} for [0,4194304] bl 4194304
2022-10-13T16:21:12.020+0200 7f5a0effd700 20 librbd::io::AioCompletion: 0x55705ebff7b0 complete_request: cb=1, pending=0
2022-10-13T16:21:12.020+0200 7f5a0effd700 20 librbd::io::AioCompletion: 0x55705ebff7b0 finalize: r=4194304
2022-10-13T16:21:12.020+0200 7f5a0effd700 20 librbd::io::ReadResult: 0x7f5a0eff7d60 operator(): moved resulting 4194304 bytes to bl 0x55705ec00778
2022-10-13T16:21:12.020+0200 7f5a0e7fc700 20 librbd::io::AsyncOperation: 0x55705ebff8a8 finish_op

Bye, S.

  1. Is the cluster healthy in source and destination ?
  2. not sure if this causing an issue, but maybe using root as replication user could be a problem, it could be we create a key pair for this destination user which may conflict with existing source root keys..not sure but i would recommend you try creating a dedicated replication user.
  3. It is strange if the rbd export diff command at the source hangs, i am not sure if the tail log you showed is hung or it keeps adding, if so it maybe it is working but very slow. Does the progress stay at 0% even if you leave it for a few minutes ?
  4. Can you do a test with a small image say 100 GB in size and see if it is any better ? you can also try to run the rbd export-diff command on this smaller image, you can find the snapshot name from rbd snap ls command.