Internal link is not restored after disconnect and reconnect?

Here is my topology:

  • 3 event brokers - 2 in a LAN, 1 in the cloud - the 2 LAN brokers are on separate computers and one of them is inside a virtual box ubuntu vm. The 2 LAN brokers are also both using the docker image while the cloud is subscribed to the aws marketplace image.
  • Setup clustering as follows:
  1. Create same named cluster on all 3 brokers via management console.
  2. Create internal links from both LAN instances to cloud instance. Create internal link from the vm LAN broker to the non-vm LAN broker (I could not get joining a cluster to work from the non-vm instance to the vm instance, so that's why I am manually creating the clusters and links).
  3. Setup same named queues on all brokers and subscribe to topic test/topic. (On a side note - I am curious why joining a cluster and adding links does not sync queues, settings, etc. across to cluster joined brokers since they are linking to the same message vpn? Seems weird that I have to setup identical queues on other clustered machines. Or maybe I am misunderstanding what is actually happening with clustered brokers - this is entirely likely. :smile:)
  • Send persistent messages via sdkperf from the vm instance - all messages are received at all brokers, as expected, confirmed via Try Me subscribe to local queue consumers.
  • Disconnect vm from LAN, send 10 more messages which are successfully received locally but not at the other 2 instances (as expected). Reconnect vm machine to network.
  • The internal link to the cloud instance is restored at both ends after a few seconds (and the queued messages sent while offline are successfully forwarded to that instance), but the link to the other LAN instance is not restored and it shows up as a topology issue - the link channel's ip has been reset to 0.0.0.0 and no matter what I do, I can't get it to restore properly. Have tried deleting the link on both sides and tried to re-create it, but it times out trying to connect. Even after deleting the link, both machines still show it as a topology issue so it seems like it is not being COMPLETELY deleted (which is why trying to re-create it is not working, I'm guessing?). I even tried rebooting the vm instance after reconnecting to see if some network connection was being held onto, but the issue continued to persist.
  • The only solution seems to be to delete the whole cluster on all instances and rebuild it from scratch - not very practical, obviously. Is there something I might be missing or maybe my setup/machine configuration is strange and I've found a bug? Eventually, the two LAN instances in this setup will be at different physical locations on different networks, both behind firewalls - my next step is to try this setup to see if maybe it's an issue with both being on the same network or one of them being run inside a vm (as the target setup will not be using a vm).

Any help or insight would be greatly appreciated! Thank you!

Best Answer

  • TD_asilva
    TD_asilva Member Posts: 13
    #2 Answer ✓

    I have not really solved this issue, but I have found a way forward. I setup the 'production'-representative environment with 2 event brokers at separate sites behind firewalls plus the 1 aws cloud instance, this time using the process described here: https://docs.solace.com/Configuring-and-Managing/DMR-Examples-Multi-Site.htm. This time I was able to get the test scenario described above to work. The main differences seem to be that those instructions stipulate that you create a cluster at each site and create external links between the clusters whereas I was previously creating a single cluster at one site and then trying to join that cluster from the other sites (which seemed to implicitly create internal links instead of external ones).

Answers

  • TD_asilva
    TD_asilva Member Posts: 13
    #3 Answer ✓

    I have not really solved this issue, but I have found a way forward. I setup the 'production'-representative environment with 2 event brokers at separate sites behind firewalls plus the 1 aws cloud instance, this time using the process described here: https://docs.solace.com/Configuring-and-Managing/DMR-Examples-Multi-Site.htm. This time I was able to get the test scenario described above to work. The main differences seem to be that those instructions stipulate that you create a cluster at each site and create external links between the clusters whereas I was previously creating a single cluster at one site and then trying to join that cluster from the other sites (which seemed to implicitly create internal links instead of external ones).