Pattern for handling RejectedMessageError events with guaranteed messages?
Suppose I am publishing guaranteed messages. I want to ensure messages are guaranteed-in-order with at-least-once-delivery all the way from the publisher which is generating the message data, to the subscribers consuming that data.
On the publisher side, I can ensure that messages have made it to the appliance by listening to the callbacks with an event type SessionEvent.Acknowledgement
, allowing the publisher to move on and publish the next message.
But I'm not sure what a publisher should do if it receives a SessionEvent.RejectedMessageError
. Nominally we don't want to move on with publishing the next message, because that might imply breaking message order. On the other hand, if the failing message can never be published then that would of course block the entire publishing pipeline.
So I guess this question boils down to the semantics of a SessionEvent.RejectedMessageError
event. Should this be interpreted as
- a transient failure, in which case we might implement some kind of retry mechanic? Or
- a permanent failure for the whole publisher, in which case we should "log and throw", terminate the publisher pipeline and send an alert to an administrator, or
- a permanent failure but only for that message, in which case we can either discard the message, or alert an administrator who can intervene to check whether the message should be discarded, or massaged and then re-attempted?
Or is the answer "it depends", and if so, is there some data supplied in the event that might help to decide which approach to take (maybe SessionEventArgs.Info
?)
Best Answers
-
Of course it is a "it depends" kind of answer..! On a number of different things. And there are sometimes various opposing forces that need to be considered.
Ok, so first off, good stuff checking the ACKs and NACKs coming back from the broker. Some people never catch those events/callbacks, and so could possibly be losing data.
2nd: during an HA failover or temporary network disconnect, no messages will be lost and the API will seamlessly retransmit any messages that were in-transit during the network outage. So that's one less thing to worry about.
But when your queue is full, or there is a Publish ACL violation, your app can be sent a NACK (aka
SessionEvent.RejectedMessageError
event). Solace APIs have no retry logic built into them... they are designed to be as lightweight as possible, and leave such "business level" decisions to the app. So, what do you do?Usually, I would give the options of:
- try to republish to the same place
- publish it to a different place (error handling queue?)
- log it and continue
- throw it away
Honestly, only the 1st one is decent. Usually you get a NACK because (one of) your queue(s) has filled up, and the broker cannot accept the message. Ideally, your publisher should just wait and periodically retry publishing the message, assuming that whatever admin team is managing the broker sees the "queue full" events and goes and fixes the consumer that is responsible. You probably don't want to try republishing in a tight loop, so implement some back-off, try again after 5 seconds or whatever.
The problem with the other 3 options is that if that one message was the only one with an issue, you now have a gap in your data. If the publisher is injecting sequence IDs into the messages then the gap might be detected.
Usually, the standard pattern is to back-pressure (aka stop) the publishers when they hit a full queue or other error, and hope that the problem is transient and dealt with by the broker admins / middleware while the publishers keep retrying.
Now, one other thing to consider... Solace native APIs (C# .NET, JCSMP Java, JMS, etc.) support streaming Guaranteed publishing... the Guaranteed ("AD") protocol is windowed inside the Solace session, and you can tune how many messages can be "in-flight" between the publisher and the broker, as well as the broker and the consumers. If your AD Window is set to size == 1, then you are essentially doing a blocking publish, waiting for the ACK or NACK to come back to the publishing API before sending the next message to teh broker. Usually you don't get exceptional performance with blocking operations, so setting the AD Window size to be larger (10, 50, up to 255) allows you to publish multiple messages while the ACKs/NACKs return asynchronously. Now, if you have one message in the middle of your stream get NACKed, but others afterwards make it, then you immediately have a hole in your data. The best practice here is to still stop the publisher, keep trying to republish your failed message, and code the consumers to deal with an out-of-order message. At least the out-of--order message won't be "too far" behind, as it is bound by the size of the AD Publish window.
Phew, that's enough for now. I hope some other people chime in with their thoughts on handling NACKs..!
8 -
Hey, I made a video about this! https://youtu.be/5wXv2QqVK3U
1
Answers
-
Of course it is a "it depends" kind of answer..! On a number of different things. And there are sometimes various opposing forces that need to be considered.
Ok, so first off, good stuff checking the ACKs and NACKs coming back from the broker. Some people never catch those events/callbacks, and so could possibly be losing data.
2nd: during an HA failover or temporary network disconnect, no messages will be lost and the API will seamlessly retransmit any messages that were in-transit during the network outage. So that's one less thing to worry about.
But when your queue is full, or there is a Publish ACL violation, your app can be sent a NACK (aka
SessionEvent.RejectedMessageError
event). Solace APIs have no retry logic built into them... they are designed to be as lightweight as possible, and leave such "business level" decisions to the app. So, what do you do?Usually, I would give the options of:
- try to republish to the same place
- publish it to a different place (error handling queue?)
- log it and continue
- throw it away
Honestly, only the 1st one is decent. Usually you get a NACK because (one of) your queue(s) has filled up, and the broker cannot accept the message. Ideally, your publisher should just wait and periodically retry publishing the message, assuming that whatever admin team is managing the broker sees the "queue full" events and goes and fixes the consumer that is responsible. You probably don't want to try republishing in a tight loop, so implement some back-off, try again after 5 seconds or whatever.
The problem with the other 3 options is that if that one message was the only one with an issue, you now have a gap in your data. If the publisher is injecting sequence IDs into the messages then the gap might be detected.
Usually, the standard pattern is to back-pressure (aka stop) the publishers when they hit a full queue or other error, and hope that the problem is transient and dealt with by the broker admins / middleware while the publishers keep retrying.
Now, one other thing to consider... Solace native APIs (C# .NET, JCSMP Java, JMS, etc.) support streaming Guaranteed publishing... the Guaranteed ("AD") protocol is windowed inside the Solace session, and you can tune how many messages can be "in-flight" between the publisher and the broker, as well as the broker and the consumers. If your AD Window is set to size == 1, then you are essentially doing a blocking publish, waiting for the ACK or NACK to come back to the publishing API before sending the next message to teh broker. Usually you don't get exceptional performance with blocking operations, so setting the AD Window size to be larger (10, 50, up to 255) allows you to publish multiple messages while the ACKs/NACKs return asynchronously. Now, if you have one message in the middle of your stream get NACKed, but others afterwards make it, then you immediately have a hole in your data. The best practice here is to still stop the publisher, keep trying to republish your failed message, and code the consumers to deal with an out-of-order message. At least the out-of--order message won't be "too far" behind, as it is bound by the size of the AD Publish window.
Phew, that's enough for now. I hope some other people chime in with their thoughts on handling NACKs..!
8 -
Thanks for providing such an extensive opinion on this. I will leave the question unanswered for now in the hope of attracting more opinions... or perhaps it should be turned into a discussion instead?
A "retry for a while with back-off" solution does seem to be a good first step no matter what. Part of your answer speaks to a different question I have on the forum about wrapping the send and callback mechanics into an
async
method by setting aTaskCompletionSource
as theCorrelationKey
on the outgoing message header. Doing this does serialize the sends, but it does so in a non-thread-blocking way. That eliminates the ordering problem without tying up a client thread, since you can respond to a nack'd message before sending any further messages, but as you described it's not great for throughput.I am personally using
System.Threading.Tasks.Dataflow
, so there's a "natural" high performance solution, which is to have the n/acks come back totally independently of the sends (ie, noasync
wrapper around send-ack), and merge them n/acks into the dataflow pipeline using aJoinBlock
. Given your confirmation in my other question that the n/acks always come back in order, that's an ideal use of theJoinBlock
.Of course, that reintroduces the reordering problem if retransmission of nack'd messages is attempted. Hrm. I suppose this is an inherent problem when trying to combine high throughput with guaranteed delivery.
In my particular case, it is probably going to be easiest to err on the side of simplicity. My volume is not high on any one publication, but there might be many publications going out through one session (think many tables changing in one source database). So in my particular case, tying up threads is much more costly than blocking of messages on a single publication. Therefore I can probably get away with serialisedasync
send-ack wrappers, add retry-with-back-off in case ofRejectedMessageError
events, and block further sends on that topic until the retry succeeds or terminate the publication pipeline with an alert if too many retries fail. I'm thinking that's how I will write the code initially at least.With this solution I would still have an
ADWindowSize
greater than 1, to allow multiple publishers to send "at the same time", so to speak. Waiting (asynchronously) for acknowledgement before sending the next message (or retrying the current one) would be happening per-publisher, but not across the entire session.But if performance needs to be bumped, I would move to the
JoinBlock
solution, and then be forced to deal with the ordering issue. I guess I'm kicking that can down the road for now, because it's too damn hard to solve(Aside: It looks like editing an already-edited post on the forum to be lost right now)
1 -
I suppose another possibility here would be to use the
async
send-ack wrapper, but use theSend[]
API, andawait Task.WhenAll
on the array of sent messages. This achieves "batched" parallelism, but also helps with the reordering problem, because you can employ logic like: "if any message in the batch is rejected, pause and retry the whole batch". Is that better or worse? It depends If consumers are idempotent then using this logic they are guaranteed to get all of the messages in order eventually, even if they initially get some out of order. That might be preferable to writing consumer re-sequencing logic in some cases. After all, this is guaranteed at least once delivery, not guaranteed exactly once delivery.0 -
Hey, I made a video about this! https://youtu.be/5wXv2QqVK3U
1 -
Awesome vid for a really interesting topic!
0