Unsolved
This post is more than 5 years old
1 Rookie
•
11 Posts
0
6856
July 25th, 2019 03:00
Backup hanging after client renaming
Hello, we are running networker 9.2.1.5 on all servers/clients. We are moving clients to backup vlan (adding interface with ip address from backup lan range). So far all clients have been renamed without issues. Now we have some clients (same configuration) where backup just won't run in this network however it runs fine in the shared network (using different nic on both server and client). It seems that these clients have problem communicating back with server, backup is started (session created on client) but then it just hangs with no errors and eventually timed out. All nslookups, pings, traceroutes, telnet on networker ports are working just fine. Also when I run nsradmin -p nsrexec -s clientsbackupvlanname it is connected right away but after I try to print something or get into visual mode it hangs and time out. Tried to reinstall the client software, restarted networker on backup server, tried adding server network interface etc but nothing helped so far. Any ideas?
No Events found!



DariaKe
1 Rookie
•
11 Posts
0
July 25th, 2019 04:00
nsrladb was delete while client was reinstalled (actually removed whole /nsr content)
Manual backups are hanging on:
save -vvv -s backupserver-bk -c client-bk -b FS /etc/hosts
174411:save: Step (1 of 7) for PID-181806: save has been started on the client 'client'.
175313:save: Step (2 of 7) for PID-181806: Running the backup on the client 'client-bk' for the selected save sets.
70342:save: save: got prototype for /
70342:save: save: got prototype for /
70342:save: save: got prototype for /dev/
70342:save: save: got prototype for /dev/
70342:save: save: got prototype for /proc/sys/fs/
70342:save: save: got prototype for /var/lib/nfs/
175318:save: Identifed a save for the backup with PID-181806 on the client 'client-bk'. Updating the total number of steps from 7 to 6.
174920:save: Step (3 of 6) for PID-181806: Contacting the NetWorker server through the nsrd process to obtain a handle to the target media device through the nsrmmd process for the save set '/etc/hosts'.
Tried to run the same with -D9 and get to the same point:
07/25/19 13:18:50.643235 Bound TCP/IPv4 socket descriptor 11 to port 0
07/25/19 13:18:50.654041 RPC Authentication: Client failed to obtain RPCSEC_GSS credentials: Authentication error; why = Server rejected credential
07/25/19 13:18:50.654081 Could not get a session key for GSS authentication. Perhaps this authentication method is not allowed/supported by both the local and remote remote machines.
07/25/19 13:18:50.654222 Failed to create lgto_auth credential for RPCSEC_GSS: Could not get session key from client for GSS authentication with backupserver-bk: Authentication error; why = Server rejected credential
07/25/19 13:18:50.654275 RPC Authentication: Client successfully obtained AUTH_LGTO credentials
07/25/19 13:18:50.654744 Skipping setup for 'Backup renamed directories' attribute: level=full, asof=0
07/25/19 13:18:50.654788 Enter setup_saveset_attrs: Save Paths: [/etc/hosts]
175318:save: Identifed a save for the backup with PID-195374 on the client 'client-bk'. Updating the total number of steps from 7 to 6.
174920:save: Step (3 of 6) for PID-195374: Contacting the NetWorker server through the nsrd process to obtain a handle to the target media device through the nsrmmd process for the save set '/etc/hosts'.
07/25/19 13:18:50.654968 User's total groups = 1, max groups set in environment or calculated = 512 and max groups buffer size = 10914
The only thing I can see are those authentication errors, however I presume as the last message is "RPC Authentication: Client successfully obtained AUTH_LGTO credentials" it should not be a problem
EDIT: I have disabled nsrauth strong authentication and the errors are gone now but the backup hangs on the same spot... Will leave it running till the next day and see if it fails with any error or just hang... it hangs now for more than 2 hours.
bingo.1
2.4K Posts
0
July 25th, 2019 04:00
On the client, stop the services and delete/rename the nsrladb directory. Then start the service and retry the backup.
If this does not solve the issue, it is good to verify in which phase the backup hangs.
So lets run a manual backup from the command line first and add some verbosity, if required.
This should tell you more, if it fails.
DariaKe
1 Rookie
•
11 Posts
0
July 25th, 2019 07:00
already did that after auth change and before testing. Did not help at all.
bingo.1
2.4K Posts
0
July 25th, 2019 07:00
You should also clear the GSS relevant information for the client on the server via
nsradmin -p nsrexec -s or nsradmin -p nsrexec -s
then
nsradmin> . type: nsr peer information; name:
nsradmin> p <---- for verification
nsradmin> d <---- to delete
nsradmin>q exits the program
crazyrov
4 Operator
•
1.3K Posts
0
July 25th, 2019 22:00
Interesting issue.
i.e dbgcommand -p pid_of_nsrd FlushDnsCache
Let us know the outcome of these steps.
DariaKe
1 Rookie
•
11 Posts
0
August 2nd, 2019 02:00
DariaKe
1 Rookie
•
11 Posts
0
August 2nd, 2019 03:00
.
DariaKe
1 Rookie
•
11 Posts
0
August 2nd, 2019 03:00
1. yes backup was going to storage node, however the peer info was deleted. I deleted all peer info today again on backupserver, storage nodes and client
2. yes, networker was restarted on backupserver, storage node and client. Did that again today, also removed nsrldb from client. Also used dbgcommand just to be sure. Did not help.
3. in the backup network, there is no firewall or anything, there is no connection between the customer and backup network and customer network is firewalled but the whole range is open for all backup clients (we did run backups through this customer network till today and they work ok).
To make it a bit more simple, I have removed the storage node from the backup flow. Client is now backed up directly to backup server to a newly created pool and device. This device is from datadomain. The tests are as follows:
1. no server network interface (SNI) in client configuration and client direct (CD) disabled:
session is created on server, All saveset is split to filesystems but backup fails with:
Unable to create session channel with nsrexecd on host client-bk to execute command 'save -LL -s backupsever ...etc.
This is expected as connection from backup to customer is blocked.
2. no SNI and CD enabled -> session is created on client (process visible) but hanging.
3. after adding SNI sever-bk (with CD enabled or disabled) -> session created on client but hanging
4. to rule out the datadomain, new aftd device created on backupserver but the results are the same, session created but hanging.
when trying to run nsradmin -p nsrexec from server to client-bk or from client to server-bk I am connected right away:
NetWorker administration program.
Use the "help" command for help, "visual" for full-screen mode.
nsradmin>
however running any command ends in (both ways):
Lost connection to NSR server: Timed out
Rebound to specific service
Timed out
DariaKe
1 Rookie
•
11 Posts
0
August 2nd, 2019 03:00
bingo.1
2.4K Posts
0
August 2nd, 2019 04:00
So you use a specific backup interface for that client.
Your statement "This is expected as connection from backup to customer is blocked." scares me.
Of course there must be a valid connection between client & server on the NW client ports in both directions. If not how do you expect the NW server to receive the metadata (CFI information)? - And because of that, the backup cannot continue. The client delivers the file index info, the storage node (which also runs on the NW server) sends the media index info. Am I pointing to the right direction?
DariaKe
1 Rookie
•
11 Posts
0
August 5th, 2019 03:00
bingo.1
2.4K Posts
0
August 5th, 2019 05:00
I just wanted to point out a general misunderstanding that you can sent all NW traffic via the server network interface.
The reason for the failure here might be due to a name resolution hostname which is often caused by a wrong hosts file. Does this point to the right direction?
DariaKe
1 Rookie
•
11 Posts
0
August 6th, 2019 23:00
We have just figured out what was the problem. On this backup network, usually we have servers directly connected to the switch (same as the backupserver) so the only configuration needed is on the switch and client. However these few servers were a special kind, running through some kind of hypervisor and all the known ports were using jumbo frames just some interface on this hypervisor was still set to MTU 1500 and this was blocking the traffic for all of these clients. Setting it to 9000 solved the issue right away.
Although your help did not solve the issue, I just want to thank you for your support and trying to help here, I really do appreciate it. Thanks.