Unsolved

This post is more than 5 years old

1 Rookie

 • 

11 Posts

6856

July 25th, 2019 03:00

Backup hanging after client renaming

Hello, we are running networker 9.2.1.5 on all servers/clients. We are moving clients to backup vlan (adding interface with ip address from backup lan range). So far all clients have been renamed without issues. Now we have some clients (same configuration) where backup just won't run in this network however it runs fine in the shared network (using different nic on both server and client). It seems that these clients have problem communicating back with server, backup is started (session created on client) but then it just hangs with no errors and eventually timed out. All nslookups, pings, traceroutes, telnet on networker ports are working just fine. Also when I run nsradmin -p nsrexec -s clientsbackupvlanname it is connected right away but after I try to print something or get into visual mode it hangs and time out. Tried to reinstall the client software, restarted networker on backup server, tried adding server network interface etc but nothing helped so far. Any ideas?

1 Rookie

 • 

11 Posts

July 25th, 2019 04:00

nsrladb was delete while client was reinstalled (actually removed whole /nsr content)

Manual backups are hanging on: 
 save -vvv -s backupserver-bk -c client-bk -b FS /etc/hosts

174411:save: Step (1 of 7) for PID-181806: save has been started on the client 'client'.
175313:save: Step (2 of 7) for PID-181806: Running the backup on the client 'client-bk' for the selected save sets.
70342:save: save: got prototype for /
70342:save: save: got prototype for /
70342:save: save: got prototype for /dev/
70342:save: save: got prototype for /dev/
70342:save: save: got prototype for /proc/sys/fs/
70342:save: save: got prototype for /var/lib/nfs/
175318:save: Identifed a save for the backup with PID-181806 on the client 'client-bk'. Updating the total number of steps from 7 to 6.
174920:save: Step (3 of 6) for PID-181806: Contacting the NetWorker server through the nsrd process to obtain a handle to the target media device through the nsrmmd process for the save set '/etc/hosts'.

Tried to run the same with -D9 and get to the same point:

 

07/25/19 13:18:50.643235 Bound TCP/IPv4 socket descriptor 11 to port 0
07/25/19 13:18:50.654041 RPC Authentication: Client failed to obtain RPCSEC_GSS credentials: Authentication error; why = Server rejected credential
07/25/19 13:18:50.654081 Could not get a session key for GSS authentication. Perhaps this authentication method is not allowed/supported by both the local and remote remote machines.
07/25/19 13:18:50.654222 Failed to create lgto_auth credential for RPCSEC_GSS: Could not get session key from client for GSS authentication with backupserver-bk: Authentication error; why = Server rejected credential
07/25/19 13:18:50.654275 RPC Authentication: Client successfully obtained AUTH_LGTO credentials
07/25/19 13:18:50.654744 Skipping setup for 'Backup renamed directories' attribute: level=full, asof=0
07/25/19 13:18:50.654788 Enter setup_saveset_attrs: Save Paths: [/etc/hosts]
175318:save: Identifed a save for the backup with PID-195374 on the client 'client-bk'. Updating the total number of steps from 7 to 6.
174920:save: Step (3 of 6) for PID-195374: Contacting the NetWorker server through the nsrd process to obtain a handle to the target media device through the nsrmmd process for the save set '/etc/hosts'.
07/25/19 13:18:50.654968 User's total groups = 1, max groups set in environment or calculated = 512 and max groups buffer size = 10914

 

The only thing I can see are those authentication errors, however I presume as the last message is "RPC Authentication: Client successfully obtained AUTH_LGTO credentials" it should not be a problem



EDIT: I have disabled nsrauth strong authentication and the errors are gone now but the backup hangs on the same spot...  Will leave it running till the next day and see if it fails with any error or just hang... it hangs now for more than 2 hours.

2.4K Posts

July 25th, 2019 04:00

On the client, stop the services and delete/rename the nsrladb directory. Then start the service and retry the backup.

 

If this does not solve the issue, it is good to verify in which phase the backup hangs.

So lets run a manual backup from the command line first and add some verbosity, if required.

This should tell you more, if it fails.

 

1 Rookie

 • 

11 Posts

July 25th, 2019 07:00

already did that after auth change and before testing. Did not help at all.

2.4K Posts

July 25th, 2019 07:00

You should also clear the GSS relevant information for the client on the server via

  nsradmin -p nsrexec -s     or        nsradmin -p nsrexec -s

  then

  nsradmin> . type: nsr peer information; name:

  nsradmin> p       <---- for verification

  nsradmin> d       <---- to delete

  nsradmin>q        exits the program

 

 

 

4 Operator

 • 

1.3K Posts

July 25th, 2019 22:00

Interesting issue.

  • Is your backup data being sent to a different storage node? If yes, then you would need to clear the peer information or the affected client on the storage node as well.
  • Have you restarted the NetWorker services ? If not try clearing the nsrd's DNS cache using the dbgcommand.
    i.e dbgcommand -p pid_of_nsrd FlushDnsCache
  • Are your ports for the backup traffic restricted? If yes, what is the port range that is allowed if its all the ports i.e from 7937-9936? From what i see on the logs the communication with nsrd is working fine but its unable to get to the nsrmmd. nsrrpcinfo from the client machine would help you see if it can communicated all the nsr daemons running on the backup sever, use the command - nsrrpcinfo -p backup_server.

Let us know the outcome of these steps.

1 Rookie

 • 

11 Posts

August 2nd, 2019 02:00

.

1 Rookie

 • 

11 Posts

August 2nd, 2019 03:00

.

1 Rookie

 • 

11 Posts

August 2nd, 2019 03:00

1. yes backup was going to storage node, however the peer info was deleted. I deleted all peer info today again on backupserver, storage nodes and client

2. yes, networker was restarted on backupserver, storage node and client. Did that again today, also removed nsrldb from client. Also used dbgcommand just to be sure. Did not help.

3. in the backup network, there is no firewall or anything, there is no connection between the customer and backup network and customer network is firewalled but the whole range is open for all backup clients (we did run backups through this customer network till today and they work ok).

To make it a bit more simple, I have removed the storage node from the backup flow. Client is now backed up directly to backup server to a newly created pool and device. This device is from datadomain. The tests are as follows:

1. no server network interface (SNI) in client configuration and client direct (CD) disabled:
session is created on server, All saveset is split to filesystems but backup fails with: 
Unable to create session channel with nsrexecd on host client-bk to execute command 'save -LL -s backupsever  ...etc.
This is expected as connection from backup to customer is blocked.
2. no SNI and CD enabled -> session is created on client (process visible) but hanging.

3. after adding SNI sever-bk (with CD enabled or disabled) -> session created on client but hanging

4. to rule out the datadomain, new aftd device created on backupserver but the results are the same, session created but hanging.

when trying to run nsradmin -p nsrexec from server to client-bk or from client to server-bk I am connected right away:
NetWorker administration program.
Use the "help" command for help, "visual" for full-screen mode.
nsradmin> 
however running any command ends in (both ways):

Lost connection to NSR server: Timed out
Rebound to specific service
Timed out

1 Rookie

 • 

11 Posts

August 2nd, 2019 03:00

.

2.4K Posts

August 2nd, 2019 04:00

So you use a specific backup interface for that client.

Your statement "This is expected as connection from backup to customer is blocked." scares me.

Of course there must be a valid connection between client & server on the NW client ports in both directions. If not how do you expect the NW server to receive the metadata (CFI information)? - And because of that, the backup cannot continue. The client delivers the file index info, the storage node (which also runs on the NW server) sends the media index info. Am I pointing to the right direction?

 

 

1 Rookie

 • 

11 Posts

August 5th, 2019 03:00

yes, we have 2 interfaces on each client, storage node, backup server. You cannot connect from a 10.x.x.x to a 192.x.x.x network or vice versa. But as every server has 2 interfaces, they can communicate of course on both networks. So nothing to be scared about. Just saying that if you use client-bk interface and put server network interface in 10.x.x.x network it is expected to fail as crossnetwork connection does not work. However if you add also customer interface into aliases it works ok, just the backup is then sent over the customer network interface and we don't want that. And as I said, all the ports are ok and communicating via telnet (backup to backup or customer to cutomer network). This setup works on all other clients so this is not the right direction. So the problem is connection in the backup network for sure. Just can't figure out what is the problem.

2.4K Posts

August 5th, 2019 05:00

I just wanted to point out a general misunderstanding that you can sent all NW traffic via the server network interface.

The reason for the failure here might be due to a name resolution hostname which is often caused by a wrong hosts file. Does this point to the right direction?

 

1 Rookie

 • 

11 Posts

August 6th, 2019 23:00

We have just figured out what was the problem. On this backup network, usually we have servers directly connected to the switch (same as the backupserver) so the only configuration needed is on the switch and client. However these few servers were a special kind, running through some kind of hypervisor and all the known ports were using jumbo frames just some interface on this hypervisor was still set to MTU 1500 and this was blocking the traffic for all of these clients. Setting it to 9000 solved the issue right away. 

Although your help did not solve the issue, I just want to thank you for your support and trying to help here, I really do appreciate it. Thanks.


No Events found!

Top