Unsolved
This post is more than 5 years old
2 Intern
•
132 Posts
0
5804
February 1st, 2018 07:00
Avamar shutdown taking a "long" time
Working on an issue and part of the resolution is to reboot the Avamar node.
Did all the pre-requisite checking of processes etc., and issued the "dpnctl stop" command - 20 minutes later, the GSAN is still in the process of shutting down.
I see the following messages in the gsan.log, and they seem to be repeating:
2018/02/01-15:50:13.53083 {0.0} [manage:2158] samconn::sendmessage retry stripe still accessible excess waiting retry=0 force=0 stripeid=0.0-1 TERMINATEDISPATCHER=192 TO stripeid=0.0-1 targetstripeid=0.0-1 origin=[0.0,158.107.40.35:26000] seq=300219186634 flags=X:N:2229 kind=0 pri=28
2018/02/01-15:50:13.55125 {0.0} [srvm-219193096#srv:1946] samconn::sendmessage retry stripe still accessible excess waiting retry=0 force=0 stripeid=0.0-1 EXIT_DPN=66 TO stripeid=0.0-1 targetstripeid=0.0-1 origin=[0.0,158.107.40.35:26000] seq=300219186631 flags=X:N:2017 kind=0 pri=28
2018/02/01-15:50:19.50959 {0.0} [connbeat:188] servmain::checkconntimeout calling shutdown lastreqtime=3087892784907307 maxconninactive=3600 0x7f3690458230 clientaddr=158.107.40.35:51472 ismaint=1 type=avmaint access=uname=root uid=0 priv=enabled,create,read,backup,access,move,delete,maint,manage,fullmanage,noticketrequired,superuser,ignoreacls,readdir,mclogin,opt1,opt2 avail=modes=00pu
2018/02/01-15:50:19.50963 {0.0} [connbeat:188] servmain::shutdown already killed 158.107.40.35:51472
Not sure exactly what they mean, other than it's something that is delaying the shutdown.
Wondering if anyone could shed any light on this?
All feedback/comments appreciated - thanks.
Cal C.
ionthegeek
2 Intern
•
2K Posts
0
February 1st, 2018 08:00
"This request thas been waiting for a long time."
"Kill this backup session"
It's always a good idea to terminate any running backup or replication jobs before shutting the system down.
connolly
2 Intern
•
132 Posts
0
February 1st, 2018 08:00
I did terminate the only running job before shutting the system down.
The session monitor was empty, and "avmaint sessions" showed nothing whatsoever.
So - does that mean there's some hung/rogue connection somewhere?
DM-Avamar
7 Posts
0
February 1st, 2018 08:00
If it’s an Avamar AVE I had an issue with a failed array and firmware issue that caused it to take nearly an hour odd to eventually stop.
ionthegeek
2 Intern
•
2K Posts
1
February 1st, 2018 08:00
There may have been a hung backup session. Older releases (pre-7.3.1) are more prone to this type of issue. There were some changes made to connection handling in 7.3.1+ that helped a lot. The hung session may disappear on its own after an hour but if not, you may have to kill the GSAN and roll back. We always recommend taking a checkpoint before shutting down.
connolly
2 Intern
•
132 Posts
0
February 1st, 2018 10:00
Is there any way to verify whether a session is hung or not at the Avamar or Linux level? Also, if one is following the standard procedure and cancelling backups prior to performing the documented Avamar software steps, should there be any kind of "quiesce" time to wait between cancelling the last backup and running the "dpnctl stop" command?
FWIW, the Avamar in question is running v7.5.0-183, so I guess it's "less prone" - but not immune it would seem. And the scheduling of the software shutdown was so that we would be able to exploit the daily Avamar checkpoints that run under the regular maintenance schedule for the "pre-shutdown" checkpoints.
connolly
2 Intern
•
132 Posts
0
February 2nd, 2018 07:00
In this case, it is a physical Avamar node, Gen4T.
connolly
2 Intern
•
132 Posts
0
February 2nd, 2018 07:00
Update - I ended up rebooting the node, and on GSAN startup, it didn't appreciate how it shut down and we had to roll back. Interesting sidebar to the rollback was that something "goofy" happened in the "getlogs" portion - weird permissions issue, working on it with Support but managed to get past it to get the node up.
The other interesting thing, and I don't know if it's related to the shutdown issue - on startup, the event log contained activity "status" entries for replication sessions that had finished almost 2 hours before we even started the Avamar software shutdown process, but with dates that coincided with the MCS startup. FWIW, they were all partial replications - but I don't know if I've ever seen any "carryover" activity where a session activity event for an session that completed before the Avamar software was shutdown showed up in the event log after an MCS reboot.