Avamar cron replication issues between Avamar 6.1.2-47 & 6.1.1-87

Question

Replication (cron-job) from Avamar 6.1.2-47 to Avamar 6.1.1-87 server fails... very weard error message: 2013/10/20-21:01:49 avtar Error <5803>: Error writing 32-byte header to cache file /usr/local/avamar/var/p_bewetbu10.be.recticel.net-begerbu10-AVI_BACKUPS.dat. Possibly out of disk space 2013/10/20-21:01:49 avtar FATAL <5225>: Unable to open hash cache in directory '/usr/local/avamar/var' Manual job is working fine. Anyone ? See details below... 2013/10/20-21:01:48 avtar Info <5551>: Command Line: /usr/local/avamar/bin/avtar.bin --flagfile=/usr/local/avamar/etc/usersettings.cfg --password=**************** --server=begerbu10 --vardir=/usr/local/avamar/var --bindir=/usr/local/avamar/bin --id=root --vardir=/usr/local/avamar/var --bindir=/usr/local/avamar/bin --sysdir=/usr/local/avamar/etc -x --replicate --workorderid=8185b01cd5d06f38 --allbackups --retention-type=none,daily,weekly,monthly,yearly --hashcachemax=32 --statistics --informationals=1 --account=/AVI_BACKUPS 2013/10/20-21:01:48 avtar Info <7977>: Starting at 2013-10-20 21:01:48 CEST [avtar Aug 19 2013 02:20:24 6.1.102-47 Linux-x86_64] 2013/10/20-21:01:48 avtar Info <9931>: Secondary flags: --flagfile=/usr/local/avamar/etc/usersettings.cfg --password=**************** --server=begerbu10 --vardir=/usr/local/avamar/var --bindir=/usr/local/avamar/bin --id=root -x --net-throttle=1.5 --server=bewetbu10.be.recticel.net --id=repluser --password=**************** --account=/REPLICATE/BEGERBU10.BE.RECTICEL.NET/AVI_BACKUPS --status=300 2013/10/20-21:01:48 avtar Info <8475>: ADE for multicore architectures enabled (Avamar Deduplication Engine v2.0.0) 2013/10/20-21:01:48 avtar Info <5552>: Connecting to Avamar Server (begerbu10) 2013/10/20-21:01:48 avtar Info <5554>: Connecting to one node in each datacenter 2013/10/20-21:01:48 avtar Info <5552>: Connecting to Avamar Server (bewetbu10.be.recticel.net) 2013/10/20-21:01:48 avtar Info <5554>: Connecting to one node in each datacenter 2013/10/20-21:01:49 avtar Info <5583>: Login User: 'root', Domain: 'default', Account: '/AVI_BACKUPS' 2013/10/20-21:01:49 avtar Info <5580>: Logging in on connection 0 (server 0) 2013/10/20-21:01:49 avtar Info <5582>: Avamar Server login successful 2013/10/20-21:01:49 avtar Info <5583>: Login User: 'repluser', Domain: 'default', Account: '/REPLICATE/BEGERBU10.BE.RECTICEL.NET/AVI_BACKUPS' 2013/10/20-21:01:49 avtar Info <5580>: Logging in on connection 0 (server 1) 2013/10/20-21:01:49 avtar Info <5582>: Avamar Server login successful 2013/10/20-21:01:49 avtar Info <5550>: Successfully logged into Avamar Server [6.1.2-47] 2013/10/20-21:01:49 avtar Info <5295>: Starting replicate at 2013-10-20 21:01:49 CEST as 'dpn' on 'begerbu10' (4 CPUs) [6.1.102-47] 2013/10/20-21:01:49 avtar Info <5949>: Backup file system character encoding is UTF-8. 2013/10/20-21:01:49 avtar Info <5667>: 113 backups found for client 'AVI_BACKUPS' 2013/10/20-21:01:49 avtar Info <7250>: Client 'AVI_BACKUPS' has 87 backups on target (bewetbu10.be.recticel.net) 2013/10/20-21:01:49 avtar Info <5688>: Loading hash cache /usr/local/avamar/var/p_bewetbu10.be.recticel.net-begerbu10-AVI_BACKUPS.dat 2013/10/20-21:01:49 avtar Info <8650>: Opening cache file /usr/local/avamar/var/p_bewetbu10.be.recticel.net-begerbu10-AVI_BACKUPS.dat 2013/10/20-21:01:49 avtar Error <5064>: Cannot open file '/usr/local/avamar/var/p_bewetbu10.be.recticel.net-begerbu10-AVI_BACKUPS.dat' 2013/10/20-21:01:49 avtar Info <5065>: Creating new cache file /usr/local/avamar/var/p_bewetbu10.be.recticel.net-begerbu10-AVI_BACKUPS.dat (1,573,408 bytes) 2013/10/20-21:01:49 avtar Error <5803>: Error writing 32-byte header to cache file /usr/local/avamar/var/p_bewetbu10.be.recticel.net-begerbu10-AVI_BACKUPS.dat. Possibly out of disk space 2013/10/20-21:01:49 avtar FATAL <5225>: Unable to open hash cache in directory '/usr/local/avamar/var' 2013/10/20-21:01:49 avtar Stats <6152>: Hash cache: 65,536 entries, added/updated 0, booted 0 2013/10/20-21:01:49 begin stack dump bp=(nil) 2013/10/20-21:01:49 end stack dump bp=(nil) 2013/10/20-21:01:49 avtar FATAL <5889>: Fatal signal 11 in pid 118358 2013/10/20-21:01:49 Fatal signal 11 2013/10/20-21:01:49 [118358] | 00000000007ea098 2013/10/20-21:01:49 [118358] | 00007fca31c6a6b0 2013/10/20-21:01:49 [118358] | 0000000000740f02 2013/10/20-21:01:49 [118358] | 0000000000748c80

TomLambrechts · Accepted Answer

Workaround Avamar 6.1.2_47 – repl_cron replication issue – Error writing 32-byte header to cache file | Tom Lambrechts

Nayak2010 · Answer

Do you by any chance have multiple replication jobs running ? At this moment I'd also like to suggest opening a support ticket so this issue could be thoroughly worked upon.

TomLambrechts · Answer

That’s a very good question… I’ll check… because in deed we have 3 locations… 2 locations replicating to a central location.

So there’s a very good chance that you are right here…

I’ll keep you posted.

Thanks !

TomLambrechts · Answer

no other replication jobs running at the same time.
And now we also are facing this on other avamar systems running the same version (6.1.2.47) as well.
The scheduled replication fails on AVI_BACKUPS, 1 client, EM_BACKUPS, MC_BACKUPS.... some client replication jobs are ok.

any idea please ?

Nayak2010 · Answer

At this point.I'd suggest opening an SR with support.

gfznjhz · Answer

Hi,

I've encountered the same issue in v7.0. I've solved the problem.

The original replication was done with EM, so p_cache files, in /usr/local/avamar/var directory, were owned by admin user in 644 mode 'rw-r--r--). The cron job for replication is scheduled in dpn user crontab. So dpn user cannot write in p_cache files already generated.

Just change the owner of p_cache files (chown dpn p_*.dat) and it works fine.

TomLambrechts · Answer

EMC support let me know this is a bug found in Avamar 6.1.2-47 and at the moment there's no support document or fix.

You need to change the ownership every time you add a client in the list of repl_cron...

Whenever you add a new client in the list of the repl_cron, the P_Cache file of that client is listed under admin account and hence, we get this error while it is getting replicated.

As a workaround, we changed the ownership of those files from admin to dpn and it is resolved.

A Complete fix is resolved on the Avamar Version 7.0 SP1.

TomLambrechts · Answer

example of the error... and the way to solve it:

2013/11/07-20:01:51 avtar Info <5688>: Loading hash cache /usr/local/avamar/var/p_bewetbu11.be.blabla.net-begerbu10-AVI_BACKUPS.dat
2013/11/07-20:01:51 avtar Info <8650>: Opening cache file /usr/local/avamar/var/p_bewetbu11.be.blabla.net-begerbu10-AVI_BACKUPS.dat
2013/11/07-20:01:51 avtar Error <5064>: Cannot open file "/usr/local/avamar/var/p_bewetbu11.be.blabla.net-begerbu10-AVI_BACKUPS.dat"
2013/11/07-20:01:51 avtar Info <5065>: Creating new cache file /usr/local/avamar/var/p_bewetbu11.be.blabla.net-begerbu10-AVI_BACKUPS.dat (1,573,408 bytes)
2013/11/07-20:01:51 avtar Error <5803>: Error writing 32-byte header to cache file /usr/local/avamar/var/p_bewetbu11.be.blabla.net-begerbu10-AVI_BACKUPS.dat. Possibly out of disk space
2013/11/07-20:01:51 avtar FATAL <5225>: Unable to open hash cache in directory '/usr/local/avamar/var'

.....

root@begerbu10:/usr/local/avamar/var/#: ls -l p_*

-rw-rw-rw- 1 dpn admin 1573408 Nov 5 19:02 p_bewetbu10.be.blabla.net-begerbu10-AVI_BACKUPS.dat
-rw-rw-rw- 1 dpn admin 1573408 Nov 5 19:29 p_bewetbu10.be.blabla.net-begerbu10-EM_BACKUPS.dat
-rw-rw-rw- 1 dpn admin 1573408 Nov 5 19:29 p_bewetbu10.be.blabla.net-begerbu10-MC_BACKUPS.dat
-rw-rw-rw- 1 dpn admin 25166368 Nov 5 19:28 p_bewetbu10.be.blabla.net-begerbu10-begerms1.be.blabla.net.dat
-rw-rw-rw- 1 dpn admin 12583456 Oct 18 11:42 p_bewetbu10.be.blabla.net-begerbu10-vmwarepc064_UDSJFteNWuob4sn3tVBcLQ.dat
-rw-r--r-- 1 admin admin 1573408 Nov 7 16:00 p_bewetbu11.be.blabla.net-begerbu10-AVI_BACKUPS.dat
-rw-r--r-- 1 admin admin 1573408 Nov 7 16:38 p_bewetbu11.be.blabla.net-begerbu10-EM_BACKUPS.dat
-rw-r--r-- 1 admin admin 1573408 Nov 7 16:39 p_bewetbu11.be.blabla.net-begerbu10-MC_BACKUPS.dat
-rw-r--r-- 1 admin admin 1573408 Nov 7 16:37 p_bewetbu11.be.blabla.net-begerbu10-begerms1.be.blabla.net.dat

===>>>

root@begerbu10:/usr/local/avamar/var/#: chown dpn p_bewetbu11.be.blabla.net-begerbu10-AVI_BACKUPS.dat
root@begerbu10:/usr/local/avamar/var/#: chown dpn p_bewetbu11.be.blabla.net-begerbu10-EM_BACKUPS.dat
root@begerbu10:/usr/local/avamar/var/#: chown dpn p_bewetbu11.be.blabla.net-begerbu10-MC_BACKUPS.dat
root@begerbu10:/usr/local/avamar/var/#: chown dpn p_bewetbu11.be.blabla.net-begerbu10-begerms1.be.blabla.net.dat

When you add another client to the replication job, you will have to do the same for that client !

jjbladester1 · Answer

I administer two Avamar 7.0 SP 1 grids. Replication is bi-directional and everything on one grid should be replicated to the other grid. This means that their capacity utilization should be identical. When I found that one grid had a 10% less capacity utilization, I started investigating replication issues.

Both grids are throwing errors in /usr/local/avamar/var/cron/replicate.log:

2014/05/05-08:11:13 avtar Error <0000>: Unable to chmod hash cache file /usr/local/avamar/var/p_alb-avmr-01.oag.lawnet-nyc-avmr-01-internal-server-name-here.dat

I did a chmod 644 /usr/local/avamar/var/p_*.dat and a chown /usr/local/avamar/var/p_*.dat and kicked off replication on both grids via Enterprise Manager. Replication appears to be running properly now.

Is there a hotfix for Avamar 7.0 SP 1 for this issue?

TomLambrechts · Answer

Same problem as with 6.1.1 & 6.1.2... Manual Replication is showing different behaviour than Scheduled Replication. EMC Support ??

J_H_ · Answer

I would like to throw a different scenario into this.

I am on 7.0.101-61 and have converted to the Batch job replication

but I also have two way replication so they should be the same size and are not (that comment is what got my attention)

I have looked at my p_*.dat files and I have

root root

dpn admin

so I should change the root roots to dpn admin?

ionthegeek · Answer

Just to clarify, there are two issues described in this thread.

The first issue is the avtar FATAL. It is possible for this issue to cause a discrepancy in capacity if avtar exits before replicating the backups for a client. Hotfix 55125 is available to resolve this issue for 6.1.2-47 systems. This issue is also resolved in Avamar version 7.0.1-61.

The second issue is the "Unable to chmod hash cache file" message which -- on its own -- cannot explain any capacity discrepancy. This issue will not prevent backups from being replicated (though it will cause the system to report that replication has failed).

This issue can occur on 7.0 systems if both cron-based and plug-in-based replication are being used. If plug-in-based replication is in use, it is recommended to avoid cron-based replication entirely.

The p_cache*.dat files should be owned by user dpn, group admin and have 664 permissions so that both the dpn user (under which cron-based replication runs) and the admin user (under which manual replications run) can access the caches. The following commands (run as the root user) will correct the ownership and permissions on the cache files:

chown dpn:admin /usr/local/avamar/var/p_cache*.dat

chmod 664 /usr/local/avamar/var/p_cache*.dat

If there is a difference in capacity utilization, I would recommend working with support to confirm that replication is covering all of the intended clients, that it is completing successfully (and not timing out) and have them review the system to see if there are any stale backups hanging around under the MC_DELETED domain.

jjbladester1 · Answer

Ian,

I opened a Sev 2 ticket (SR 62897766) on this issue two days ago. So far, I've worked with "first level" and "second level" technical support engineers who themselves have been working with "engineering". We have performed the chown/chmod operations you mention several times but that is not fixing the problem and replicate.log is still filled with "Unable to chmod hash cache file" errors.

We are *only* using full site-to-site Enterprise Manager (cron-based) replication and have never touched group-based replication in the Avamar Administrator GUI. The L2 tech support person thought the issue could be with the permissions of files in /tmp/replicate/ but he wasn't sure. If the issue is resolved in 7.0.1-61, it must not apply to clients who upgraded from 6.1.1-87 since that is what we did on both of our Avamar grids.

root@avamar-server-1:/tmp/replicate/#: ls -ltrh

total 19M

-rw-r--r-- 1 admin admin 0 2014-05-06 12:10 empty

-rwxr-xr-x 1 dpn admin 9.3M 2014-05-06 12:55 replold.sh

-rwxr-xr-x 1 dpn admin 8.6M 2014-05-06 12:55 replnew.sh

-rw-rw-r-- 1 dpn admin 324K 2014-05-06 12:55 repldiff.sh

We don't have an MC_DELETED domain, but we do have MC_RETIRED. I deleted the stuff in there as it wasn't important and the backups for those retired clients were already expired from both grids. Now that those are gone, I just manually started repl_cron from Enterprise Manager. According to tail -f /usr/local/avamar/var/cron/replicate.log, I just received the following:

2014/05/07-08:33:32 avtar Error <0000>: Unable to chmod hash cache file /usr/local/avamar/var/p_avamar-server-2-avamar-server-1-internal-server-name.domain.com.dat

ionthegeek · Answer

If the issue is resolved in 7.0.1-61, it must not apply to clients who upgraded from 6.1.1-87 since that is what we did on both of our Avamar grids.

Remember, we're talking about two different issues. If you do not see fatal errors in the replicate log, your system is not affected by the first issue which is the one that is fixed in 7.0.1-61.

Replication will have slightly different behaviour depending on whether it is a scheduled replication job or it was started manually through Enterprise Manager / Avamar Administrator. This is because scheduled replication jobs run as the dpn user (since they are started from dpn's crontab), where manual replications run as the admin user (since they are started directly by EM / MCS which runs as the admin user).

In any case, if the problem you are most interested in resolving is the capacity imbalance, I would encourage you to ask support to focus on this. I won't say it's impossible but it is vanishingly unlikely that the cache permissions error could be causing a capacity imbalance. Efforts should be focused on finding the root cause cause of the capacity difference if that is the problem you are trying to solve.

The MC_DELETED domain is not visible through the Avamar Administrator which is why I recommended you ask support to review its contents.

Avamar

Avamar cron replication issues between Avamar 6.1.2-47 & 6.1.1-87

Was this post helpful?