We got server reboot and after that redundancy can not switch back to primary server.
Somehow backup does not see it as ready .
on primary
No issues on network level.
We got server reboot and after that redundancy can not switch back to primary server.
Somehow backup does not see it as ready .
on primary
No issues on network level.
Hi @progman ,
The fastest way to get someone to dig into this issue with you would be to open a support ticket and they can take a look at your logs to help resolve the issue quicker.
That is community right?
I wana give a try here.
Support that is other thing.
Not sure that we have a support for that. System was unmanaged for years
Working silently
Haha - that’s great to hear that it’s been humming away all this time!
It looks like your Primary isn’t ready to take back over as it’s reporting “Local Inactive”. If you do a show redundancy detail
see if you can find more information.
I see this in the docs:
More info here: Monitoring Redundancy
Hope that helps!
What information is given when you run “show config-sync”
If it shows “out of sync” you may need to run “assert-master” or “assert-leader” depending on the version of your broker.
Given that it’s been unmanaged for years I assume it is the former, so if “out of sync” try something like this…
ip-172-31-7-xxx# admin
ip-172-31-7-xxx(admin)# config-sync
ip-172-31-7-xxx(admin/config-sync)# assert-master router
ip-172-31-7-xxx(admin/config-sync)# assert-master message-vpn
ip-172-31-7-xxx(admin/config-sync)# show config-sync
@marc.dipasquale
Here are details:
vmrncnqv5ldefkcprimary> show redundancy detail
Configuration Status : Enabled
Redundancy Status : Down
Last Failure Reason : One or more nodes is offline
Last Failure Time : Mar 27 2023 13:47:01 UTC
Operating Mode : Message Routing Node
Switchover Mechanism : Hostlist
Auto Revert : No
Redundancy Mode : Active/Standby
Active-Standby Role : Primary
Mate Router Name : vmrncnqv5ldefkcbackup
SMF plain-text Port :
SMF compressed Port :
SMF SSL Port :
Mate-Link Connect Via : vmrncnqv5ldefkc1:8741
ADB Link To Mate : Up
Last Failure Reason : Mate Link Restart
Last Failure Time : Mar 24 2023 19:32:16 UTC
ADB Hello To Mate : Up
Last Failure Reason : N/A
Last Failure Time :
Hello Interval (ms) : 1000
Hello Timeout (ms) : 3000
Avg Hello Latency (ms) : 1
Max Hello Latency (ms) : 224
Primary Virtual Router Backup Virtual Router
---------------------- ----------------------
Activity Status Local Inactive Shutdown
Routing Interface intf0:1 intf0:1
Routing Interface Status Up
VRRP Status Initialize
VRRP Priority -1
Message Spool Status AD-Standby
Priority Reported By Mate: Active
ADB Hello Protocol Active
VRRP None (-1)
Activity Status: Mate Active
Operational Status Not Ready
Redundancy Config Status Enabled
Message Spool Status Ready
SMRP Status Ready
Db Build Status Ready
Db Sync Status Ready
Internal Priority None
Internal Activity Status Mate Active
Internal Redundancy State Pri-NotReady
Message Spool Status: Ready
Message Spool Config Status Enabled
VRID Config Status Ready
ADB Status Ready
Flash Module Status Ready
Power Module Status Ready
ADB Contents Ready
Local Contents Key 224.251.75.43:49,122
Mate Contents Key 224.251.75.43:49,122
Schema Match Yes
Disk Status Ready
Disk Contents Ready
Disk Key (Primary) 224.251.75.43:49,122
Disk Key (Backup) 224.251.75.43:49,122
ADB Datapath Status Ready
Internal Redundancy State AD-Standby
Lock Owner N/A
VRID Config Parameter Local Configuration Received From Mate
----------------------------- ---------------------- ----------------------
Primary VRID 224.251.75.43 224.251.75.43
Backup VRID
AD-Enabled VRID 224.251.75.43 224.251.75.43
Disk WWN:
Local: N/A
Received From Mate: N/A
* - indicates configuration mismatch between local and mate router
@progman Does “show config-sync” look good from the backup, as well?
You sure it’s not a networking issue? Backup and monitor can see each other, but can’t see primary, and primary can’t see anyone. There’s a ping command available in CLI too, maybe just try that to confirm?
@Aaron
yes. I can see traffic by tcpdump.
Can telnet to any kind of ports between each other.
Checked logs everywhere.
Had no chance to find why primary set as offline all nodes in group.
I can reach those from primary to mate port 8741.
Same I see traffic on those nodes from primary on config sync port 8300
Even though we’re not showing them out of sync, did you try running the assert-master (or leader) commands on the backup broker, since it has the better state right now?
Found what was that.
In system logs found complains from consul
2023-03-28T11:48:58.460+00:00 <local5.warning> vmrncnqv5ldefkc1 consul[2091]: consul: error getting server health from “vmrncnqv5ldefkcprimary”: context deadline exceeded
2023-03-28T11:48:59.460+00:00 <local5.warning> vmrncnqv5ldefkc1 consul[2091]: consul: error getting server health from “vmrncnqv5ldefkcprimary”: rpc error getting client: failed to get conn: dial tcp 10.60.25.6:0->127.0.0.1:8300: getsockopt: connection refused
2023-03-28T11:49:00.460+00:00 <local5.warning> vmrncnqv5ldefkc1 consul[2091]: consul: error getting server health from “vmrncnqv5ldefkcprimary”: context deadline exceeded
2023-03-28T11:49:01.458+00:00 <local5.warning> vmrncnqv5ldefkc1 consul[2091]: consul: error getting server health from “vmrncnqv5ldefkcprimary”: rpc error getting client: failed to get conn: dial tcp 10.60.25.6:0->127.0.0.1:8300: getsockopt: connection refused
Then looked into consul agent config on both.
/var/lib/solace/consul.json
On primary somehow there is wrong option. On backup no such at all.
“advertise_addr”: “127.0.0.1”,
Do not know from where it is coming. Sure that will cause for consul to get it’s cluster up as it advertise 127.0.0.1 instead primary node IP.
So I’ve removed that and killed agent. It started again automatically and consul cluster came up.
And then solace primary node become fully OK and redundancy switched to it.
Now I know why it broke after reboot
Someone years back already fixed that case temporary until it gone after reboot.
@mstobo @marc.dipasquale any chance you guys know from where that advertise_addr came on?
And how to remove it permanently from configuration?
May be we have some version of solace that had that bug?
@progman sorry I don’t. I think that we would need a support ticket to find that out; it likely gets into the internals of the code.
Forgot to mention that those are running in docker.
So once I’ve restart docker that config going back with “advertise_addr”: “127.0.0.1”.
I assume that is part of product.
Did not found so far how I can set that in right way.
So more likely will do some dummy cron to check that config, remove bad option if there and then restart consul.
Wow that’s wacky. You definitely should not have to go digging around in the internals of the container. What version of SolOS are you running? show version should show the uptime as well.
@progman just checking back in on this thread. Noticed you already provided the SolOS version on reply #6 above. So ignore my last comment.
Did you try upgrading the broker? 9.0.1 is a pretty old version. You might have to upgrade to an intermediate version before going to the latest (10.3.1). If you ran into this problem on a newer broker, I’m sure our engineering team would want to take a look at your issue. 9.0.1 being so old, I don’t think we’ll get much help.