Slow subscriber causing solace spool quota blow up

Rajesh · November 2020

Hi Solace experts - I am new to solace world and getting to know things. We have solace set up in our application platform and it has been running fine for few weeks. As client applications are increased, we see some slow subscribers in solAdmin tool. The delivery mode set up is "DIRECT" and the client apps subscribe to topics like t/env/loc/app/pub and t/env/loc/app/ssn.
In an ideal scenario when client application is closed, the client app is unsubscribing for all the subscribed topics and the connections from that client disappear in the SolAdmin, all good in this case. However, in some machines, we see slow subscriber i.e. when the app is set to auto shutdown at night say (11:00 p.m.), not all client apps could close gracefully.. one of the client apps are appearing under "slow subscribers" column in SolAdmin tool.
This further caused the message queue build up overnight and exceeded the spool quota causing much bigger problem the next morning to other consumer clients who couldn't consume data from solace. Everything had to be restarted later to get to normal.
I'm sure that not all client apps can be expected to gracefully unsubscribe and close connections.
so Could you please tell me how to handle such clients who are not able to consume messages for whatever reason and how can I save the solace from building up the queue and exceeding the spool quota? Ideally, I would like to identify such slow consumers and disconnect them from the topic channel in solace itself (better if it can be automatically disconnected by solace) and not queue up the messages for them to avoid spool quota blow up. sorry for a long description, but please advise...

Thanks,
Rajesh

TomF · November 2020

Hi @Rajesh, welcome to the wonderful world of Solace!

TL;DR: I don't think you want to persist messages, but you are. Make sure you have your subscribers set up to receive DIRECT messages, they aren't reading from queues or Topic Endpoints.

I think the most important thing to get clear here is the interaction of DIRECT messages, PERSISTENT messages, topics and queues. So my first question is: are you using JMS?
There is some confusing terminology to get to grips with. A DIRECT message is stored in memory - it doesn't survive a broker restart. Note I said DIRECT message - not subscriber. From your description it seems like you don't want subscribers that have closed getting current messages later when they re-connect - in other words you don't want the messages to be queued (persisted). This is what DIRECT messages are meant for. If messages are being queued and quota exceeded, then it's clear your incoming messages are being attracted to a persistent endpoint somewhere - this is what is causing the problem.

So, are your applications getting messages from a topic or a queue? This is where the distinction between JMS and everything else becomes important. JMS has the idea of a durable subscriber. This creates a special type of persistent endpoint called a Topic Endpoint. When the subscriber is off line, the Topic Endpoint will persist messages on that topic. If you aren't using JMS, then you must have a queue endpoint somewhere that is subscribed to your topics.

If you don't use a JMS durable subscriber or a queue, you will get exactly the behaviour you want with DIRECT messages. Pubsub+ keeps a track of all subscribers, and when it detects a slow subscriber it will keep a track of it. If Pubsub+ starts to run out of memory allocated to a client (see Message Delivery Resources), Pubsub+ will start to discard messages for that particular client. We're nice and set a flag for you in subsequent messages (see Message Discard Notification) If things get worse and Pubsub+ starts to run out of memory because there are many slow subscribers, Pubsub+ will start to disconnect the slow subscribers (see Egress Buffer Management).

In summary: I don't think you want to persist messages, but you are. Make sure you have your subscribers set up to receive DIRECT messages, they aren't reading from queues or Topic Endpoints.

Rajesh · November 2020

Hi Tom - thanks for the description and sorry for using confusing terms as I'm getting used to these now .
I checked with my admin and he confirmed that we are not using JMS. What we are using is "Direct" mode of messaging with Non-durable queues and the client applications are subscribing to topics.
The problem we are facing is - the client applications run fine during the day, but when it is shutdown at 11:00 pm, the client application does an Unsubscribe() for all the topics it has subscribed to and then kills the exe.
During this attempt of unsubscribe(), one of the client instance appears as slow subscriber in SolAdmin and the connection appears to be open, but in the client app logs - we see the unsubscribe() call is complete and the connection is closed and exe is also killed. This remains open the next day until when we disconnect manually from SolAdmin.
In above case , the message spool quota was breached due to the Egress discards.
My main concern here is that when the clientApp.exe is killed - why does the connections remain open on solace side?
How do I ensure that nothing remains on solace when my exe is closed? we are using SolClient.Messaging.dll to call the Unsubscribe() method in a .Net client application.
How can we configure solace such that - if there is any client process that is popping up as a slow subscriber - can solace disconnect the client automatically?
Thanks in advance
Rajesh

TomF · November 2020

@Rajesh don't apologise for our industry's confusing terminology!
Thanks for confirming all of this, I now have a better understanding of what you're seeing. I'm surprised the connection to your closed app is remaining open for so long: there are keepalives at both the TCP level and in the .Net API which should detect the client has disconnected. There is clearly something going on at the network level since that's how PubSub+ detects a slow subscriber (it looks at the network connection and sees congestion to the client.)

So, to answer your question, PubSub+ does not automatically disconnect a slow subscriber unless there are many slow subscribers and it starts to run out of memory. All is not lost, though. PubSub+ has a management and monitoring API called SEMP which we can use to detect this condition and perform the disconnection.

As an example, to return all the clients flagged as slow subscribers in the message VPN, use the following SEMP URI:
"http://<pubsub+ ip:port>/SEMP/v2/monitor/msgVpns/<MsgVpn>/clients?where=slowSubscriber==true"
So for instance for the broker running on my Macbook in Docker:
curl -X GET -u admin:<password> "http://localhost:8080/SEMP/v2/monitor/msgVpns/default/clients?where=slowSubscriber==true"

With this list you can then tell the SEMP API to disconnect them:
"http://<pubsub+ ip:port>/SEMP/v2/action/msgVpns/<MsgVpn>/clients/<client name>/disconnect

This will fix the immediate problem of forcing the disconnection of these rogue clients. Next, we should concentrate on finding out what's happening to cause the problem. I suspect that something is causing the client operating system to not close the TCP connection properly. What we would need to do is perform a packet capture that starts just before 11pm for say 30 minutes, identify which client connection has the problem, and then we can see what's happening at the network level.

Rajesh · November 2020

Hi Tom - thanks for being generous... I see that our admin is using the SEMP tool or similar one to identify the slow clients and disconnect them manually. As you said, we'd want to get to the bottom of this problem.. I did some more analysis in the last 2 days and found that we are closing the connections in the following manner..
flow.Dispose(); // dispose the flow object - sends the UNBIND req to solace appliance( I see CLIENT_UNBIND in solace event log)
session.Unsubscribe(topic, waitToConfirm = true) // unsubscribe on any topic - do not see anything on solace event log for this
session.dispose(); // On solace event log, I see CLIENT_DISCONNECT and CLIENT_CLOSE_FLOW events after this is called from client application
context.Instance.Cleanup();

The problem block is below, where the session.Dispose() doesn't get called after the flow.Dispose() which is called right before this block of code..
lock (mPublishConsumers) // Dictionary of <messageType, ITopic>
{
foreach (var consumerTopic in mPublishConsumers.Values)
{
mSession.Unsubscribe(consumerTopic, true);
}
mPublishConsumers.Clear();
canCloseToken.Cancel();
}
mSession.Dispose(); //
mSession = null;

So I was wondering if the lock(mPublishConsumers) went into a long wait and didn't proceed further to session.Dispose() but I do not see a possible case yet from the code.
In case if I run the "session.UnSubscribe()" on a separate thread and keep the main thread waiting for a timeout of say 10 seconds and continue with session.Dispose() on the main thread? This would ensure the session.Dispose() is called, but will it cause any problem on the other thread executing session.Unsubscribe()? I don't worry about missing the data packets that are sent out on the topics which I may not be able to unsubscribe at this time . Just want to know if this causes the connection to be still kept open on solace appliance..
My goal is to ensure that client session is closed and disconnected with solace appliance.
I am working on getting the packet capture thing as an option( as it takes time )
Thanks again for your comments..

TomF · November 2020

Hi @Rajesh, it's good news you are thinking about disposing of your objects properly.

However, there's still some confusion here. If you are using a flow that means you are either creating a temporary queue, or connecting (binding) to an existing queue. I don't know how your subscription is being called but I suspect it is being called with this flow's endpoint as an argument. This endpoint would then persist messages flowing on this topic, which explains why your persistent spool is filling up.

If you call session.dispose()the session will be closed, perhaps you could call session.disconnect() first. If you do this you will get errors from your other threads when they try to access the session, but since these other threads appear to be locked, this probably isn't a concern.

So, I think the approach should be:
1. Try session.disconnect() and dispose() from your main thread to ensure the session is disconnected properly as a trial to see if it resolves the problem;
2. Review your use of flow. I don't think you need any flows at all. Have you seen the SolaceSamples .Net TopicSubscriber? If you have a look at that, you can see you create a private void to receive the messages and pass that to createSession(). Then you subscribe() on the session. Messages start flowing in to the private void - there are no flow objects anywhere.

Rajesh · December 2020

Thanks for the inputs @TomF and sorry for late update.. you're right, we are using a temporary queue and queuing up messages on that endpoint. Apparently, we also checked with the solace tech support and they too advised that just calling session.Dispose() is sufficient to close all the connections on client side(no need to explicitly do flow.Dispose() or session.disconnect()). We've made the change and will check the behavior in our Live environment and see if this solves the problem... will confirm it in any case and then we can close this thread.

Thanks.

TomF · December 2020

Hi @Rajesh, ahh, now things are beginning to make sense!

When you disconnect from a temporary queue, after 1 minute the queue is deleted. All the messages stored on that queue are implicitly deleted too. So that's why having the flow or session not disconnect is so important: the queue will build up with messages while your application is not consuming them.

It might be worth having a look at the quota configured on your temporary queues. It may be possible to reduce this, so that a run away temporary queue doesn't consume too much spool space.

Slow subscriber causing solace spool quota blow up

Comments

Categories

This Month's Leaders

This Week's Leaders