Can't switch back to primary after server reboot.

We got server reboot and after that redundancy can not switch back to primary server.
Somehow backup does not see it as ready .


on primary


No issues on network level.

Hi @progman ,
The fastest way to get someone to dig into this issue with you would be to open a support ticket and they can take a look at your logs to help resolve the issue quicker.

That is community right?
I wana give a try here.
Support that is other thing.
Not sure that we have a support for that. System was unmanaged for years :wink:
Working silently

Haha - that’s great to hear that it’s been humming away all this time!
It looks like your Primary isn’t ready to take back over as it’s reporting “Local Inactive”. If you do a show redundancy detail see if you can find more information.
I see this in the docs:
More info here: Monitoring Redundancy

Hope that helps!

What information is given when you run “show config-sync”
If it shows “out of sync” you may need to run “assert-master” or “assert-leader” depending on the version of your broker.
Given that it’s been unmanaged for years I assume it is the former, so if “out of sync” try something like this…
ip-172-31-7-xxx# admin
ip-172-31-7-xxx(admin)# config-sync
ip-172-31-7-xxx(admin/config-sync)# assert-master router
ip-172-31-7-xxx(admin/config-sync)# assert-master message-vpn
ip-172-31-7-xxx(admin/config-sync)# show config-sync

@mstobo
Yes I did checked that
It seems OK


@marc.dipasquale
Here are details:

vmrncnqv5ldefkcprimary> show redundancy detail
Configuration Status   : Enabled
Redundancy Status    : Down
 Last Failure Reason  : One or more nodes is offline
 Last Failure Time   : Mar 27 2023 13:47:01 UTC
Operating Mode      : Message Routing Node
Switchover Mechanism   : Hostlist
Auto Revert       : No
Redundancy Mode     : Active/Standby
Active-Standby Role   : Primary
Mate Router Name     : vmrncnqv5ldefkcbackup
 SMF plain-text Port  : 
 SMF compressed Port  : 
 SMF SSL Port      : 
 Mate-Link Connect Via : vmrncnqv5ldefkc1:8741
ADB Link To Mate     : Up
 Last Failure Reason  : Mate Link Restart
 Last Failure Time   : Mar 24 2023 19:32:16 UTC
ADB Hello To Mate    : Up
 Last Failure Reason  : N/A
 Last Failure Time   : 
 Hello Interval (ms)  : 1000
 Hello Timeout (ms)   : 3000
 Avg Hello Latency (ms) : 1
 Max Hello Latency (ms) : 224
                Primary Virtual Router Backup Virtual Router
                ---------------------- ----------------------
Activity Status        Local Inactive     Shutdown
Routing Interface       intf0:1         intf0:1
Routing Interface Status    Up            
VRRP Status          Initialize        
VRRP Priority         -1            
Message Spool Status      AD-Standby        
Priority Reported By Mate:   Active          
 ADB Hello Protocol       Active          
 VRRP              None (-1)        
Activity Status:        Mate Active       
 Operational Status       Not Ready        
  Redundancy Config Status    Enabled         
  Message Spool Status      Ready          
 SMRP Status          Ready          
  Db Build Status        Ready          
  Db Sync Status         Ready          
 Internal Priority       None           
 Internal Activity Status    Mate Active       
 Internal Redundancy State   Pri-NotReady       
Message Spool Status:     Ready          
 Message Spool Config Status  Enabled         
 VRID Config Status       Ready          
 ADB Status           Ready          
  Flash Module Status      Ready          
  Power Module Status      Ready          
 ADB Contents          Ready          
  Local Contents Key       224.251.75.43:49,122   
  Mate Contents Key       224.251.75.43:49,122   
  Schema Match          Yes           
 Disk Status          Ready          
 Disk Contents         Ready          
  Disk Key (Primary)       224.251.75.43:49,122   
  Disk Key (Backup)       224.251.75.43:49,122   
 ADB Datapath Status      Ready          
 Internal Redundancy State   AD-Standby        
 Lock Owner           N/A           
VRID Config Parameter     Local Configuration   Received From Mate
----------------------------- ---------------------- ----------------------
 Primary VRID          224.251.75.43      224.251.75.43
 Backup VRID                       
 AD-Enabled VRID        224.251.75.43      224.251.75.43
 Disk WWN:                        
  Local:       N/A
  Received From Mate: N/A
 * - indicates configuration mismatch between local and mate router

@progman Does “show config-sync” look good from the backup, as well?

You sure it’s not a networking issue? Backup and monitor can see each other, but can’t see primary, and primary can’t see anyone. There’s a ping command available in CLI too, maybe just try that to confirm?

Yes
Looks good there as well

@Aaron
yes. I can see traffic by tcpdump.
Can telnet to any kind of ports between each other.
Checked logs everywhere.
Had no chance to find why primary set as offline all nodes in group.
I can reach those from primary to mate port 8741.
Same I see traffic on those nodes from primary on config sync port 8300

Even though we’re not showing them out of sync, did you try running the assert-master (or leader) commands on the backup broker, since it has the better state right now?

Found what was that.
In system logs found complains from consul
2023-03-28T11:48:58.460+00:00 <local5.warning> vmrncnqv5ldefkc1 consul[2091]: consul: error getting server health from “vmrncnqv5ldefkcprimary”: context deadline exceeded
2023-03-28T11:48:59.460+00:00 <local5.warning> vmrncnqv5ldefkc1 consul[2091]: consul: error getting server health from “vmrncnqv5ldefkcprimary”: rpc error getting client: failed to get conn: dial tcp 10.60.25.6:0->127.0.0.1:8300: getsockopt: connection refused
2023-03-28T11:49:00.460+00:00 <local5.warning> vmrncnqv5ldefkc1 consul[2091]: consul: error getting server health from “vmrncnqv5ldefkcprimary”: context deadline exceeded
2023-03-28T11:49:01.458+00:00 <local5.warning> vmrncnqv5ldefkc1 consul[2091]: consul: error getting server health from “vmrncnqv5ldefkcprimary”: rpc error getting client: failed to get conn: dial tcp 10.60.25.6:0->127.0.0.1:8300: getsockopt: connection refused

Then looked into consul agent config on both.
/var/lib/solace/consul.json
On primary somehow there is wrong option. On backup no such at all.
“advertise_addr”: “127.0.0.1”,

Do not know from where it is coming. Sure that will cause for consul to get it’s cluster up as it advertise 127.0.0.1 instead primary node IP.
So I’ve removed that and killed agent. It started again automatically and consul cluster came up.
And then solace primary node become fully OK and redundancy switched to it.

Now I know why it broke after reboot :slight_smile:
Someone years back already fixed that case temporary until it gone after reboot.

@mstobo @marc.dipasquale any chance you guys know from where that advertise_addr came on?
And how to remove it permanently from configuration?
May be we have some version of solace that had that bug?

@progman sorry I don’t. I think that we would need a support ticket to find that out; it likely gets into the internals of the code.

Forgot to mention that those are running in docker.
So once I’ve restart docker that config going back with “advertise_addr”: “127.0.0.1”.
I assume that is part of product.
Did not found so far how I can set that in right way.

So more likely will do some dummy cron to check that config, remove bad option if there and then restart consul.

Wow that’s wacky. You definitely should not have to go digging around in the internals of the container. What version of SolOS are you running? show version should show the uptime as well.

@progman just checking back in on this thread. Noticed you already provided the SolOS version on reply #6 above. So ignore my last comment.
Did you try upgrading the broker? 9.0.1 is a pretty old version. You might have to upgrade to an intermediate version before going to the latest (10.3.1). If you ran into this problem on a newer broker, I’m sure our engineering team would want to take a look at your issue. 9.0.1 being so old, I don’t think we’ll get much help.