Can't switch back to primary after server reboot.

progman
progman Member Posts: 9

We got server reboot and after that redundancy can not switch back to primary server.

Somehow backup does not see it as ready .


on primary



No issues on network level.

Comments

  • marc
    marc Member, Administrator, Moderator, Employee Posts: 963 admin

    Hi @progman,

    The fastest way to get someone to dig into this issue with you would be to open a support ticket and they can take a look at your logs to help resolve the issue quicker.

  • progman
    progman Member Posts: 9

    That is community right?

    I wana give a try here.

    Support that is other thing.

    Not sure that we have a support for that. System was unmanaged for years ;-)

    Working silently

  • marc
    marc Member, Administrator, Moderator, Employee Posts: 963 admin

    Haha - that's great to hear that it's been humming away all this time!

    It looks like your Primary isn't ready to take back over as it's reporting "Local Inactive". If you do a `show redundancy detail` see if you can find more information.

    I see this in the docs:

    More info here: https://docs.solace.com/Features/HA-Redundancy/Monitoring-Appliance-Redundancy.htm


    Hope that helps!

  • mstobo
    mstobo Member, Employee Posts: 26 Solace Employee

    What information is given when you run "show config-sync"

    If it shows "out of sync" you may need to run "assert-master" or "assert-leader" depending on the version of your broker.

    Given that it's been unmanaged for years I assume it is the former, so if "out of sync" try something like this...

    ip-172-31-7-xxx# admin

    ip-172-31-7-xxx(admin)# config-sync

    ip-172-31-7-xxx(admin/config-sync)# assert-master router

    ip-172-31-7-xxx(admin/config-sync)# assert-master message-vpn <your vpn>

    ip-172-31-7-xxx(admin/config-sync)# show config-sync

  • progman
    progman Member Posts: 9

    @mstobo

    Yes I did checked that

    It seems OK



  • progman
    progman Member Posts: 9
    edited March 2023 #7

    @marc

    Here are details:


    vmrncnqv5ldefkcprimary> show redundancy detail
    Configuration Status   : Enabled
    Redundancy Status    : Down
     Last Failure Reason  : One or more nodes is offline
     Last Failure Time   : Mar 27 2023 13:47:01 UTC
    Operating Mode      : Message Routing Node
    Switchover Mechanism   : Hostlist
    Auto Revert       : No
    Redundancy Mode     : Active/Standby
    Active-Standby Role   : Primary
    Mate Router Name     : vmrncnqv5ldefkcbackup
     SMF plain-text Port  : 
     SMF compressed Port  : 
     SMF SSL Port      : 
     Mate-Link Connect Via : vmrncnqv5ldefkc1:8741
    ADB Link To Mate     : Up
     Last Failure Reason  : Mate Link Restart
     Last Failure Time   : Mar 24 2023 19:32:16 UTC
    ADB Hello To Mate    : Up
     Last Failure Reason  : N/A
     Last Failure Time   : 
     Hello Interval (ms)  : 1000
     Hello Timeout (ms)   : 3000
     Avg Hello Latency (ms) : 1
     Max Hello Latency (ms) : 224
    
                    Primary Virtual Router Backup Virtual Router
                    ---------------------- ----------------------
    Activity Status        Local Inactive     Shutdown
    Routing Interface       intf0:1         intf0:1
    Routing Interface Status    Up            
    VRRP Status          Initialize        
    VRRP Priority         -1            
    Message Spool Status      AD-Standby        
    
    Priority Reported By Mate:   Active          
     ADB Hello Protocol       Active          
     VRRP              None (-1)        
    
    Activity Status:        Mate Active       
     Operational Status       Not Ready        
      Redundancy Config Status    Enabled         
      Message Spool Status      Ready          
     SMRP Status          Ready          
      Db Build Status        Ready          
      Db Sync Status         Ready          
     Internal Priority       None           
     Internal Activity Status    Mate Active       
     Internal Redundancy State   Pri-NotReady       
    
    Message Spool Status:     Ready          
     Message Spool Config Status  Enabled         
     VRID Config Status       Ready          
     ADB Status           Ready          
      Flash Module Status      Ready          
      Power Module Status      Ready          
     ADB Contents          Ready          
      Local Contents Key       224.251.75.43:49,122   
      Mate Contents Key       224.251.75.43:49,122   
      Schema Match          Yes           
     Disk Status          Ready          
     Disk Contents         Ready          
      Disk Key (Primary)       224.251.75.43:49,122   
      Disk Key (Backup)       224.251.75.43:49,122   
     ADB Datapath Status      Ready          
     Internal Redundancy State   AD-Standby        
     Lock Owner           N/A           
    
    VRID Config Parameter     Local Configuration   Received From Mate
    ----------------------------- ---------------------- ----------------------
     Primary VRID          224.251.75.43      224.251.75.43
     Backup VRID                       
     AD-Enabled VRID        224.251.75.43      224.251.75.43
    
     Disk WWN:                        
      Local:       N/A
      Received From Mate: N/A
     * - indicates configuration mismatch between local and mate router
    


  • mstobo
    mstobo Member, Employee Posts: 26 Solace Employee

    @progman Does "show config-sync" look good from the backup, as well?

  • Aaron
    Aaron Member, Administrator, Moderator, Employee Posts: 644 admin

    You sure it's not a networking issue? Backup and monitor can see each other, but can't see primary, and primary can't see anyone. There's a ping command available in CLI too, maybe just try that to confirm?

  • progman
    progman Member Posts: 9

    @Aaron

    yes. I can see traffic by tcpdump.

    Can telnet to any kind of ports between each other.

    Checked logs everywhere.

    Had no chance to find why primary set as offline all nodes in group.

    I can reach those from primary to mate port 8741.

    Same I see traffic on those nodes from primary on config sync port 8300


  • mstobo
    mstobo Member, Employee Posts: 26 Solace Employee

    Even though we're not showing them out of sync, did you try running the assert-master (or leader) commands on the backup broker, since it has the better state right now?

  • progman
    progman Member Posts: 9

    Found what was that.

    In system logs found complains from consul

    2023-03-28T11:48:58.460+00:00 <local5.warning> vmrncnqv5ldefkc1 consul[2091]: consul: error getting server health from "vmrncnqv5ldefkcprimary": context deadline exceeded

    2023-03-28T11:48:59.460+00:00 <local5.warning> vmrncnqv5ldefkc1 consul[2091]: consul: error getting server health from "vmrncnqv5ldefkcprimary": rpc error getting client: failed to get conn: dial tcp 10.60.25.6:0->127.0.0.1:8300: getsockopt: connection refused

    2023-03-28T11:49:00.460+00:00 <local5.warning> vmrncnqv5ldefkc1 consul[2091]: consul: error getting server health from "vmrncnqv5ldefkcprimary": context deadline exceeded

    2023-03-28T11:49:01.458+00:00 <local5.warning> vmrncnqv5ldefkc1 consul[2091]: consul: error getting server health from "vmrncnqv5ldefkcprimary": rpc error getting client: failed to get conn: dial tcp 10.60.25.6:0->127.0.0.1:8300: getsockopt: connection refused


    Then looked into consul agent config on both.

    /var/lib/solace/consul.json

    On primary somehow there is wrong option. On backup no such at all.

    "advertise_addr": "127.0.0.1",


    Do not know from where it is coming. Sure that will cause for consul to get it's cluster up as it advertise 127.0.0.1 instead primary node IP.

    So I've removed that and killed agent. It started again automatically and consul cluster came up.

    And then solace primary node become fully OK and redundancy switched to it.


    Now I know why it broke after reboot :)

    Someone years back already fixed that case temporary until it gone after reboot.

  • progman
    progman Member Posts: 9

    @mstobo @marc any chance you guys know from where that advertise_addr came on?

    And how to remove it permanently from configuration?

    May be we have some version of solace that had that bug?

  • mstobo
    mstobo Member, Employee Posts: 26 Solace Employee

    @progman sorry I don't. I think that we would need a support ticket to find that out; it likely gets into the internals of the code.

  • progman
    progman Member Posts: 9

    Forgot to mention that those are running in docker.

    So once I've restart docker that config going back with "advertise_addr": "127.0.0.1".

    I assume that is part of product.

    Did not found so far how I can set that in right way.


    So more likely will do some dummy cron to check that config, remove bad option if there and then restart consul.

  • Aaron
    Aaron Member, Administrator, Moderator, Employee Posts: 644 admin
    edited March 2023 #17

    Wow that's wacky. You definitely should not have to go digging around in the internals of the container. What version of SolOS are you running? show version should show the uptime as well.

  • Aaron
    Aaron Member, Administrator, Moderator, Employee Posts: 644 admin

    @progman just checking back in on this thread. Noticed you already provided the SolOS version on reply #6 above. So ignore my last comment.

    Did you try upgrading the broker? 9.0.1 is a pretty old version. You might have to upgrade to an intermediate version before going to the latest (10.3.1). If you ran into this problem on a newer broker, I'm sure our engineering team would want to take a look at your issue. 9.0.1 being so old, I don't think we'll get much help.