Issue with DMR messages being routed from Local to Remote Broker/VPN

AllenW
AllenW Member Posts: 18 ✭✭✭

Hi All,
Just wanted to check if anyone else has experienced a similar issue that we are unable to find the root cause.

A client mentioned an issue with some messages being published locally in a VPN which is DMR routing to a remote VPN. The DMR bridges from what we can see are healthy and up, cluster queues are empty. To test DMR is working we used the "Try Me!" table in the Solace UI, published to a topic locally and saw it transport to the remote broker and land in a queue that was subscribed to the topic - this happened in less than 2 seconds. The try me publisher had no issue when publishing a direct message, however persistent messages did not transport.

However, when we tried to publish via SDKPerf and an publishing application - these messages were not DMR'd to the remote broker. We have a temporary queue set up in the local VPN, which for all these tests received messages each time, they just didn't arrive in the remote VPN's queue.

We tried some combinations or Plain-Text & TLS connections, as well as various transport modes (Direct/Non-Persistent/Persistent) - none of these messages arrived in the remote VPN queue.

Plain-Text Persistent:
sdkperf_jms.bat -cip=remotebroker.net:55555 -cu=CLIENT-USERNAME@REMOTE-VPN-01 -cp="password-placeholder" -ptl="t/topic/testing/dmr/routing" -jcf="JNDI/CF/Mule" -mn=1 -md -mt=persistent


Plain-Text Non-Persistent:
sdkperf_jms.bat -cip=remotebroker.net:55555 -cu=CLIENT-USERNAME@REMOTE-VPN-01 -cp="password-placeholder" -ptl="t/topic/testing/dmr/routing" -jcf="JNDI/CF/Mule" -mn=1 -md -mt=nonpersistent

Plain-Text Direct:
sdkperf_jms.bat -cip=remotebroker.net:55555 -cu=CLIENT-USERNAME@REMOTE-VPN-01 -cp="password-placeholder" -ptl="t/topic/testing/dmr/routing" -jcf="JNDI/CF/Mule" -mn=1 -md -mt=direct

TLS Persistent:
sdkperf_jms.bat -cip=smfs://remotebroker.net:55443 -cu=CLIENT-USERNAME@REMOTE-VPN-01 -cp="password-placeholder" -ptl="t/topic/testing/dmr/routing" -jcf="JNDI/CF/Mule" -mn=1 -md -mt=persistent


TLS Non-Persistent:
sdkperf_jms.bat -cip=smfs://remotebroker.net:55443 -cu=CLIENT-USERNAME@REMOTE-VPN-01 -cp="password-placeholder" -ptl="t/topic/testing/dmr/routing" -jcf="JNDI/CF/Mule" -mn=1 -md -mt=nonpersistent


TLS Direct:
sdkperf_jms.bat -cip=smfs://remotebroker.net:55443 -cu=CLIENT-USERNAME@REMOTE-VPN-01 -cp="password-placeholder" -ptl="t/topic/testing/dmr/routing" -jcf="JNDI/CF/Mule" -mn=1 -md -mt=direct

We have already done a few checks but are stuck for answers, we have checked:
* Cluster queues in local broker does not have any discards.
* Message size is very small so no issue with scaling restrictions.
* The Clustering / Bridge configs match on both side.
* There are no restrictions on ACL Profiles for publishing/subscribing to the topics.
* The subscription has no types / whitespaces etc.

Any clue why the DMR transport could be working for the "try me!" (needs a name one day) message publishing, while it doesn't work for the Application / SDKPerf publishers? I'm hoping the fast that the Solace publisher works for DMR and SDKPerf doesn't, might offer us a clue.
In all tests the messages are arriving at the temp queue in the local VPN.

Thanks!

Answers

  • Aaron
    Aaron Member, Administrator, Moderator, Employee Posts: 664 admin

    Hey @AllenW , that sounds like an interesting problem..! I have one quick thought: are you sure the local VPN and/or remote VPN have their max-spool-size configured properly: e.g. it's not 0 (the default when first creating a new Message VPN), and/or the Message VPN's spool is fully consumed (i.e. some other queues are filled and taking up all the spool space)..?

    Might explain why Direct is working, but Persistent is not..? Although your temp queues are receiving the message, so that's strange.

    Also, you're using SdkPerf JMS, and the message's Delivery Mode somewhat depends on your connection factory settings. Could you maybe try downloading the JCSMP or C SdkPerf and try the same publishing test?

    I noticed your SdkPerf commands you posted above, you're publishing into the remote VPN (the -cip= connection string). Shouldn't you be trying to publish into the local VPN and consume on remote?

    Finally: there should be no difference between TLS and non. And the TryMe! publisher webapp is just our JavaScript SMF API, also should be no difference… except as I mentioned, if you're publishing with JMS, and your connection factory is not configured to send non-persistent messages as Direct, then both persistent and non-persistent messages will be sent using Guaranteed transport… and then you wouldn't be able to send a Direct message. (also -mt=direct doesn't work for JMS, only persistent and non-persistent are supported… you'd need to use JCSMP or C or another flavour).

    Finally finally: if you have access to CLI, there are much more verbose stats available for you to look at. Specifically: show message-spool message-vpn blah stats. Lots of ingress/egress discard reasons, promotion stats, etc:

    solace1081b> show message-spool message-vpn default stats
    
    Message VPN:                            default
    ************** Ingress Spool Discard Statistics ***************************
    Spooling Not Ready:                                            0
    Out Of Order Messages:                                         0
    Duplicate Messages:                                            0
    No Eligible Destinations:                                      0
    Spool Over Quota:                                              0
    Queue/Topic-Endpoint Over Quota:                          104737
    Replay-Log Over Quota:                                         0
    Max. Message Usage Exceeded:                                   0
    Max. Message Size Exceeded:                                    0
    Remote Router Spooling Not Supported:                          0
    Spool To ADB Fail:                                             0
    Spool To Disk Fail:                                            0
    Spool File Limit Exceeded:                                     0
    Errored Message:                                               0
    .... etc... 
    

  • AllenW
    AllenW Member Posts: 18 ✭✭✭

    Hi Aaron,

    We had another issue yesterday for the broker where the local VPN resides: It seems we have been victims to SOL-128478. So wanted to resolve that first incase corrupted spool was causing these issues. That was resolved after a spool reset, but this issue is still occurring.

    1. Guaranteed Messaging Stats - VPN Message Spool Stats:

    • Local VPN Messages Queued: 644/310,000 MB (<1%)
    • Remote VPN Messages Queued: 2,176/10,000 MB (22%)

    On top of this, there are no full queues in the remote VPN that could be blocking the flow. I think if that is the case usually, we see the cluster queue blocked in the local VPN anyway.

    2. Using JCSMP with SDKPerf to bypass the JNDI CF:

    sdkperf_java.bat -cip=localhost:55555 -cu=CLIENT-USERNAME@LOCAL-VPN -cp="password-placeholder" -ptl="t/topic/testing/dmr/routing" -msx=1000000 -mn=6 -md
    

    This worked! Messages arrived in the remote queue. So, I'm guessing this is pointing us at the JNDI CF config perhaps?

    Also yes, sorry on the previous SDKPerf commands I shared remotebroker and REMOTE-VPN-01 should have been localhost and local vpn - this was just an error in my masking of our parameters. My bad! The actual commands I have the correct parameters to be publishing to the topic in the local VPN.

    So based off that, I have looked at Direct Transport for our JINDI CF on the local VPN.

    direct-transport: false
    

    I created a clone of the JNDI CF, but for this one enabled direct transport, and tried again with JMS SDKPerf - and these messages arrived in the remote queue. So based of this, it seems only direct messages are currently arriving in the remote VPN.

    The JNDI CF used by the client is shared by many other publishing applications, so not sure if I can enable this or not. I gather this would mean all messages would then be direct?

    Looking at the clustering settings of both local and remote brokers, "direct messaging only" is set to false. So, I believe this should be allowing us to transport persistent messages too right?

    And here are the spool*Vpn stats: fair few discards. And I'm pretty sure these stats would be fresh from a message-spool reset yesterday right?

    Message VPN:                            LOCAL-VPN

    ************** Ingress Spool Discard Statistics ***************************
    Spooling Not Ready: 0
    Out Of Order Messages: 0
    Duplicate Messages: 27
    No Eligible Destinations: 96929
    Spool Over Quota: 0
    Queue/Topic-Endpoint Over Quota: 0
    Replay-Log Over Quota: 0
    Max. Message Usage Exceeded: 0
    Max. Message Size Exceeded: 0
    Remote Router Spooling Not Supported: 0
    Spool To ADB Fail: 0
    Spool To Disk Fail: 0
    Spool File Limit Exceeded: 0
    Errored Message: 0
    Queue Not Found: 0
    Spool Shutdown Discard: 0
    User Profile Deny Guaranteed: 0
    Publisher Not Found: 0
    No Local Delivery Discard: 106963
    TTL Exceeded: 0
    Publish ACL Denied: 0
    Destination Group Error: 0
    Not Compatible With Forwarding Mode: 0
    Low-Priority-Msg Congestion Discard: 0
    Replication Is Standby Discard: 0
    Sync Replication Ineligible Discard: 0
    XA Transaction Not Supported: 0
    Other: 0

    *************** Egress Spool Discard Statistics ***************************
    Messages Deleted: 0
    Messages Expired To Discard: 0
    Messages Expired To DMQ: 0
    Messages Expired To DMQ Failed: 0
    Max Redelivery Exceeded To Discard: 0
    Max Redelivery Exceeded To DMQ: 0
    Max Redelivery Exceeded To DMQ Failed: 0
    TTL Exceeded To Discard: 181

    ********************* Message Processing Statistics ***********************
    Number of Ingress Messages: 166685
    Promoted to Non-Persistent: 14
    Demoted to Direct: 0
    Replicate Promoted: 0
    Async Replicated: 0
    Sync Replicated: 0
    From Replication Mate: 0
    Copied to Replay Log: 0
    Sequenced Topic Matches: 0
    Sequence Number Already Assigned: 0
    Sequence Number Rollover: 0
    Sequence Numbered Messages Discarded: 0
    Transacted Messages Not Sequenced: 0
    Ingress Messages Discarded: 96956
    Messages Spooled to ADB: 69729
    Messages Ingress Selector Examined: 0
    Messages Selector Matched: 0
    Messages Selector Did Not Match: 0
    Messages Egress Selector Examined: 0
    Messages Selector Matched: 0
    Messages Selector Did Not Match: 0
    Egress Messages Discarded: 181
    Number of Egress Messages: 711568
    Redelivered: 482721
    Transport Retransmitted: 0
    Messages Confirmed Delivered: 228847
    Store and Forward: 228847
    Cut-Through: 0
    From Replication Mate: 0
    Request for Redelivery: 0

    Thanks for your help with this.

  • Aaron
    Aaron Member, Administrator, Moderator, Employee Posts: 664 admin

    Hey @AllenW, I really think you should engage with Solace Support to help you resolve this issue, via the usual email channel. Support doesn't monitor Solace Community at all, those of us here are not as deep on all the product issues and resolutions. You can point them to this thread for context.

    For your #2: SdPerf JCSMP (and C, and C#) will send messages using Direct as default. If you add the argument -mt=persistent then I'm guessing it will fail. It sounds like you still have some message-spool issue or something weird.

    SdkPerf JMS only supports non-persistent and persistent… and that connection factory setting "Direct transport" only applies to non-persistent messages… you can configure our JMS to send non-persistent messages with either Direct or Guaranteed transport. 95% of the time, most JMS users will leave that disabled, they want Guaranteed transport for non-persistent messages as it provides a higher quality of service, and allows the use of both persistent and non-persistent messages in transactions. I wrote a blog about all this a couple years ago: https://solace.com/blog/direct-messages-and-non-persistent-messages-in-solace/

    And I'm pretty sure these stats would be fresh from a message-spool reset yesterday right?

    I don't think so. I think you have to administratively clear the stats for them to reset.

  • AllenW
    AllenW Member Posts: 18 ✭✭✭

    Thanks Aaron, we already have a ticket open for the message-spool issue so I will share this with them, wasn't sure if this was related or not.

    This morning I tested the same JCSMP SDKPerf command and confirmed persistent is not arriving in the remote VPN queue, while removing the -mt=persistent flag, it works. So it seems persistent messages are not being DMR'd for some reason.

    Will share some diagnostics with support and see what they come back with.

  • AllenW
    AllenW Member Posts: 18 ✭✭✭

    Thanks for your help with this @Aaron
    We were able to resolve this issue after advice from support.
    For posterity - The issue here was an issue with the next message-ids, I'm guessing from the previous issue of a corrupt message-spool and/or the spool reset

    It now looks like there is a problem with next message-ids.
    

    The DMR bridge rejects the new messages as duplicates because it expects a different message-id.
    

    The fix here was to delete the DMR bridge from both sides. Note that deleting only from the local VPNs didn't resolve this.

    Note: This will delete any queued messages in the cluster queue to remote VPN. In our case, there were no messages as all publishers were using persistent and it seems they never arrived in the cluster queue.

    We found this issue also impacted DMR to other regions. So we set up test queues in all remote DMR VPNs to check:

    1. Go to local VPN
    2. Go to bridges > DMR Bridges: Get a list of all the remote VPNs
    3. In all Remote Message VPNs create test queue which is subscribed to the same topic:
      Queue: q/topic/testing/dmr/routing
      Topic: t/topic/testing/dmr/routing
    4. In Local VPN get a Client-Username that can publish from Access Control
    5. Using "Try Me" Publisher, send 1 "direct" message to topic
      then send "persistent". Each queue in remote VPNs should have 2 messages arrive.

    If both direct and persistent messages arrive in the remove VPN queue, no action required.

    If persistent message doesn't arrive in the remote VPN queue, we need to do the following actions:

    1. Create CLI script to re-create the DMR bridge, using this template (Or in UI if able):
    enable
    configure
    message-vpn <LOCAL-VPN-NAME>
    dynamic-message-routing
    create dmr-bridge <REMOTE-BROKER-EXTERNAL-LINK>
    remote message-vpn <REMOTE-VPN-NAME>
    exit
    no shutdown
    end

    2. Delete the DMR bridge to affected region in both Local and remote VPN.

    3. Run CLi script in both regions to restore DMR bridge

    4. Send another persistent message to validate DMR is now working.

    We just had the CLI script ready as we are unable to use the UI connection and to limit DMR bridge downtime.

    Thanks again.

  • Aaron
    Aaron Member, Administrator, Moderator, Employee Posts: 664 admin

    Sounds like a tricky one! Glad Support was able to help you resolve it. 👍🏼