Exponential Backoff missing - should be part of solace functionality

Robert
Robert Member Posts: 58 ✭✭

I wanted to bring up a topic which i miss in solace and that is called exponential backoff.

Exponential Backoff: Is a concept to increase the interval of retries instead of retry permanently. It is important as sometimes a solution could be temp not available or a resource is temp overloaded. So to increase the interval helps to avoid to spam the target endpoint. To retry endless and than without exponential backoff makes seldom sense.

What is the plan in solace about this ?

Handle message failures | Cloud Pub/Sub | Google Cloud (how google implemented)

I got concerned when i saw this post that solace does not try to work on that feature:

An Open Source Approach to Delaying Message Redelivery Attempts - Solace

I see that as a must have feature on a broker and therefore it should not be needed to implement on your own. It is a great post but it limits down to a solution to one language java (using java stream). The solution needed should be generic and therefore being part of broker would solve that problem.

Many thanks to get view on that topic.

Comments

  • Aaron
    Aaron Member, Administrator, Moderator, Employee Posts: 508 admin

    Hi @Robert. Thanks for the post/query. Yeah there a couple different scenarios where exponential back-off timers would work well. I implemented one for reconnections in my JCSMP app... after the initial automated reconnection attempts fail, my app would switch strategies and attempt connections with larger intervals between them... this was mostly to stop the app logs from generating too many entries.

    You could envision a Guaranteed publisher app also wanting to implement a back-off strategy, when trying to send to a queue that is full, probably doesn't help to continually publish to it in a tight loop.

    The Google article you linked to is for subscribers though, where the broker itself would back-off before attempting redelivery. To quote from the page:

    If [Google] Pub/Sub attempts to deliver a message but the subscriber can't acknowledge it, Pub/Sub will retry sending the message. By default, Pub/Sub will try resending the message immediately. However, the conditions that prevented message acknowledgement may not have had time to change when using immediate redelivery, resulting in the message again not being acknowledged, and the message being continuously redelivered.

    There's some things here that are vague to me... the broker attempts to deliver a message but the subscriber can't acknowledge it. What type of acknowledgement is this? A transport ACK that the subscriber says it's received it, or an application ACK which means it is done with it? How long can an ACK be outstanding? In Solace, consumers can ACK messages minutes or hours later if they really want to, as long as the Flow the message was received on stays intact. IIRC, Google will just redeliver a message if it doesn't get an ACK in a specified time. This can of course lead to duplicate message processing.

    In Solace, there is (currently) no way to NACK a single message, so no way to put a single message back on the queue. The Flow would have to be closed (e.g. unbind from the queue) and then all unACKnowledged messages would get redelivered to the next available consumer. I don't think a back-off delay would make sense unless you could NACK individual messages. (or it just times out like in Pub/Sub, which Solace doesn't do).

    What about FIFO ordering? If you put a message back on the queue, would messages behind it be delivered first now? Or maintain ordering which means everything behind it gets delayed too?

    @tkunnumpurath's implementation in his blog is one way to implement custom handling, build it however you want. But it's in the client space, not done by the broker, so the message being put back on the queue is a copy of the original. And since it gets put on the back, then it might be delayed further behind other messages if the queue is overloaded.

    Anyhow, I agree that this could be a useful feature, but I'd want to ask about how exactly you might see this working. Maybe we can get some Solace Product Managers on this thread to comment too.

  • Robert
    Robert Member Posts: 58 ✭✭

    @Aaron the handling in google is as following but depends on the method of integration.

    For push (similar to RDP) it relies the same as on Solace on the response code of the rest endpoint.

    So it pushes the messages to the endpoint and when error comes back the broker knows it should to a nack. If for certain period of time (defined in acknowledge deadline default 10 seconds) no response comes it assume message to be not successfully consumed. That is when retry gets active with exponential backoff.

    So i assume Solace must do the same but not sure where in RDP the time is defined to decide for not successful delivery. As you stated sure that can result in duplicate message as if endpoint still runs and processes e.g. message after 10 seconds and google gets back nack then it will send again although likely ok processed.

    So i think we have to differ really between push and active listener like you mentioned in Solace using a flow control.

    As we use normally RDP i was not aware of the flow handling and not being able to nack or ack single messages. I agree that is what i assumed to be exist and then same would make sense. So i have to look into that deeper in solace to understand better.

    Ordering is for sure a challenge but that is not related to exponential backoff or not. It would be already an issue on first retry. So if order is needed you would put things on hold and likely you would not use to much retry logic as you need to act immediate to get failed messages handled.

    I am personally not a fan of ordering on broker as often that is just 1 piece in a chain of processes and to keep order across multiple components is hard. In most cases you can handle order with data send along on consumer side. (e.g. timestamp tells which message is older or newer for most sync good enough, or sequence)

    I hope we get some workshop around this topics to discuss with product team.

    Again many many thanks for your feedback and view. It really helps.