🎄 Happy Holidays! 🥳
Most of Solace is closed December 24–January 1 so our employees can spend time with their families. We will re-open Thursday, January 2, 2024. Please expect slower response times during this period and open a support ticket for anything needing immediate assistance.
Happy Holidays!
Please note: most of Solace is closed December 25–January 2, and will re-open Tuesday, January 3, 2023.
Troubleshooting frequent / high amounts of fragmentation on an event broker
Hi all,
Trying to get a better understanding of fragmentation and potential culprits of high amounts of fragmentation occurring in a particular event broker. Defragmentation is running daily on a schedule, not threshold. But we are seeing within an hour after defrag runs successfully the estimated fragmentation reaches 99% within an hour.
Estimated Fragmentation: 97%
Estimated Recoverable Space: 10 MB
Last Result:
Completed On: May 9 2024 00:02:41 UTC ## About 1.5hrs ago
Completion %: 100%
Exit Condition: Success
We are seeing this frequently, although it's worth noting only 10MB of recoverable space does not seem like this would be an enormous number of messages.
Could this be because the total number of messages spooled is rather low, therefore it won't take a large amount of fragmented message to cause a high estimated fragmentation value?
ADB Disk Total Current Persistent Store Usage (MB) 0.0240 0.3082 0.3322
Number of Messages Currently Spooled 9 194 203
Either way we would like the find the culprit of the fragmentation, whether this is one particular client or a particular process causing this so frequently.
The spool files that a Solace PubSub+ event broker uses to store Guaranteed messages may become "fragmented" over time when consumers frequently go offline and do not reconnect. Fragmentation can occur because the small number of messages awaiting delivery to those offline consumers are maintained, which prevents the larger number of messages that have been consumed that are also on the spool file from being removed.
Is there anything in particular to look for in logs that would lead to fragmentation?
Looking at the
event boiler I'm unable to find any specific log events related to fragmentation besides when defragmentation starts or stops.
Best Answer
-
Hi @AllenW. So yeah, it's because your total number of messages is quite low, there's not much space to free up. Typically, I wouldn't be looking at the fragmentation amount until you start to creep towards your disk limit. (not message-spool limit, but the actual underlying disk). You can see these details in
show message-spool detail
in CLI.There won't be any logs to point out which publisher or queue is leading to fragmentation.
The biggest cause of fragmentation is when you have guaranteed consumers of differing rates, and it's the occasional ones that are slow to consume their messages. Essentially, (as of today, may change in future?) messages in the spool are written to disk in chunks… several MB in size. As messages are read and ACKed from their queues, individual messages inside these spool files are marked for deletion. But the file as a whole can only be deleted once all messages inside it have been consumed. So if you have a low-volume consumer that leaves its messages on the broker for a long time, then some of these spool files might have 99% of their messages consumed, but perhaps 1% of the messages still stuck on a queue somewhere and hence the broker can't delete the file. This is what we've usually ended up calling the "sparse message-spool problem". The total amount of messages spooled is low, but disk utilization is relatively high. This is where defrag helps a lot.
A useful CLI command might be
show queue * sort-by-messages-spooled
. And then check those apps that are leaving messages on their queue. There's also a way to check the date of the oldest message on a queue, but you can't use wildcards for this, you need to run the command separately for each queue, which is fine for occasional investigations but not good for monitoring:show queue foo message-vpn bar messages oldest detail count 1
Anyhow… dunno if this post helps or not..? If you're a customer, maybe ping Solace Support to ask for more detailed help?
0
Answers
-
Hi @AllenW. So yeah, it's because your total number of messages is quite low, there's not much space to free up. Typically, I wouldn't be looking at the fragmentation amount until you start to creep towards your disk limit. (not message-spool limit, but the actual underlying disk). You can see these details in
show message-spool detail
in CLI.There won't be any logs to point out which publisher or queue is leading to fragmentation.
The biggest cause of fragmentation is when you have guaranteed consumers of differing rates, and it's the occasional ones that are slow to consume their messages. Essentially, (as of today, may change in future?) messages in the spool are written to disk in chunks… several MB in size. As messages are read and ACKed from their queues, individual messages inside these spool files are marked for deletion. But the file as a whole can only be deleted once all messages inside it have been consumed. So if you have a low-volume consumer that leaves its messages on the broker for a long time, then some of these spool files might have 99% of their messages consumed, but perhaps 1% of the messages still stuck on a queue somewhere and hence the broker can't delete the file. This is what we've usually ended up calling the "sparse message-spool problem". The total amount of messages spooled is low, but disk utilization is relatively high. This is where defrag helps a lot.
A useful CLI command might be
show queue * sort-by-messages-spooled
. And then check those apps that are leaving messages on their queue. There's also a way to check the date of the oldest message on a queue, but you can't use wildcards for this, you need to run the command separately for each queue, which is fine for occasional investigations but not good for monitoring:show queue foo message-vpn bar messages oldest detail count 1
Anyhow… dunno if this post helps or not..? If you're a customer, maybe ping Solace Support to ask for more detailed help?
0