Of course it is a “it depends” kind of answer…! On a number of different things. And there are sometimes various opposing forces that need to be considered.
Ok, so first off, good stuff checking the ACKs and NACKs coming back from the broker. Some people never catch those events/callbacks, and so could possibly be losing data.
2nd: during an HA failover or temporary network disconnect, no messages will be lost and the API will seamlessly retransmit any messages that were in-transit during the network outage. So that’s one less thing to worry about.
But when your queue is full, or there is a Publish ACL violation, your app can be sent a NACK (aka SessionEvent.RejectedMessageError
event). Solace APIs have no retry logic built into them… they are designed to be as lightweight as possible, and leave such “business level” decisions to the app. So, what do you do?
Usually, I would give the options of:
- try to republish to the same place
- publish it to a different place (error handling queue?)
- log it and continue
- throw it away
Honestly, only the 1st one is decent. Usually you get a NACK because (one of) your queue(s) has filled up, and the broker cannot accept the message. Ideally, your publisher should just wait and periodically retry publishing the message, assuming that whatever admin team is managing the broker sees the “queue full” events and goes and fixes the consumer that is responsible. You probably don’t want to try republishing in a tight loop, so implement some back-off, try again after 5 seconds or whatever.
The problem with the other 3 options is that if that one message was the only one with an issue, you now have a gap in your data. If the publisher is injecting sequence IDs into the messages then the gap might be detected.
Usually, the standard pattern is to back-pressure (aka stop) the publishers when they hit a full queue or other error, and hope that the problem is transient and dealt with by the broker admins / middleware while the publishers keep retrying.
Now, one other thing to consider… Solace native APIs (C# .NET, JCSMP Java, JMS, etc.) support streaming Guaranteed publishing… the Guaranteed (“AD”) protocol is windowed inside the Solace session, and you can tune how many messages can be “in-flight” between the publisher and the broker, as well as the broker and the consumers. If your AD Window is set to size == 1, then you are essentially doing a blocking publish, waiting for the ACK or NACK to come back to the publishing API before sending the next message to teh broker. Usually you don’t get exceptional performance with blocking operations, so setting the AD Window size to be larger (10, 50, up to 255) allows you to publish multiple messages while the ACKs/NACKs return asynchronously. Now, if you have one message in the middle of your stream get NACKed, but others afterwards make it, then you immediately have a hole in your data. The best practice here is to still stop the publisher, keep trying to republish your failed message, and code the consumers to deal with an out-of-order message. At least the out-of–order message won’t be “too far” behind, as it is bound by the size of the AD Publish window.
Phew, that’s enough for now. I hope some other people chime in with their thoughts on handling NACKs…!