Pattern for handling RejectedMessageError events with guaranteed messages?

allmhhuran
allmhhuran Member Posts: 41 ✭✭✭

Suppose I am publishing guaranteed messages. I want to ensure messages are guaranteed-in-order with at-least-once-delivery all the way from the publisher which is generating the message data, to the subscribers consuming that data.
On the publisher side, I can ensure that messages have made it to the appliance by listening to the callbacks with an event type SessionEvent.Acknowledgement, allowing the publisher to move on and publish the next message.
But I'm not sure what a publisher should do if it receives a SessionEvent.RejectedMessageError. Nominally we don't want to move on with publishing the next message, because that might imply breaking message order. On the other hand, if the failing message can never be published then that would of course block the entire publishing pipeline.
So I guess this question boils down to the semantics of a SessionEvent.RejectedMessageError event. Should this be interpreted as

  1. a transient failure, in which case we might implement some kind of retry mechanic? Or
  2. a permanent failure for the whole publisher, in which case we should "log and throw", terminate the publisher pipeline and send an alert to an administrator, or
  3. a permanent failure but only for that message, in which case we can either discard the message, or alert an administrator who can intervene to check whether the message should be discarded, or massaged and then re-attempted?

Or is the answer "it depends", and if so, is there some data supplied in the event that might help to decide which approach to take (maybe SessionEventArgs.Info ?)

Tagged:

Best Answers

  • Aaron
    Aaron Member, Administrator, Moderator, Employee Posts: 508 admin
    edited January 2021 #3 Answer ✓

Answers

  • allmhhuran
    allmhhuran Member Posts: 41 ✭✭✭

    Thanks for providing such an extensive opinion on this. I will leave the question unanswered for now in the hope of attracting more opinions... or perhaps it should be turned into a discussion instead?

    A "retry for a while with back-off" solution does seem to be a good first step no matter what. Part of your answer speaks to a different question I have on the forum about wrapping the send and callback mechanics into an async method by setting a TaskCompletionSource as the CorrelationKey on the outgoing message header. Doing this does serialize the sends, but it does so in a non-thread-blocking way. That eliminates the ordering problem without tying up a client thread, since you can respond to a nack'd message before sending any further messages, but as you described it's not great for throughput.

    I am personally using System.Threading.Tasks.Dataflow, so there's a "natural" high performance solution, which is to have the n/acks come back totally independently of the sends (ie, no async wrapper around send-ack), and merge them n/acks into the dataflow pipeline using a JoinBlock. Given your confirmation in my other question that the n/acks always come back in order, that's an ideal use of the JoinBlock.

    Of course, that reintroduces the reordering problem if retransmission of nack'd messages is attempted. Hrm. I suppose this is an inherent problem when trying to combine high throughput with guaranteed delivery.
    In my particular case, it is probably going to be easiest to err on the side of simplicity. My volume is not high on any one publication, but there might be many publications going out through one session (think many tables changing in one source database). So in my particular case, tying up threads is much more costly than blocking of messages on a single publication. Therefore I can probably get away with serialised async send-ack wrappers, add retry-with-back-off in case of RejectedMessageError events, and block further sends on that topic until the retry succeeds or terminate the publication pipeline with an alert if too many retries fail. I'm thinking that's how I will write the code initially at least.

    With this solution I would still have an ADWindowSize greater than 1, to allow multiple publishers to send "at the same time", so to speak. Waiting (asynchronously) for acknowledgement before sending the next message (or retrying the current one) would be happening per-publisher, but not across the entire session.

    But if performance needs to be bumped, I would move to the JoinBlock solution, and then be forced to deal with the ordering issue. I guess I'm kicking that can down the road for now, because it's too damn hard to solve :smiley:

    (Aside: It looks like editing an already-edited post on the forum to be lost right now)

  • allmhhuran
    allmhhuran Member Posts: 41 ✭✭✭

    I suppose another possibility here would be to use the async send-ack wrapper, but use the Send[] API, and await Task.WhenAll on the array of sent messages. This achieves "batched" parallelism, but also helps with the reordering problem, because you can employ logic like: "if any message in the batch is rejected, pause and retry the whole batch". Is that better or worse? It depends :smiley: If consumers are idempotent then using this logic they are guaranteed to get all of the messages in order eventually, even if they initially get some out of order. That might be preferable to writing consumer re-sequencing logic in some cases. After all, this is guaranteed at least once delivery, not guaranteed exactly once delivery.

  • Aaron
    Aaron Member, Administrator, Moderator, Employee Posts: 508 admin
    edited January 2021 #7 Answer ✓
  • allmhhuran
    allmhhuran Member Posts: 41 ✭✭✭

    Awesome vid for a really interesting topic!