Troubleshooting frequent / high amounts of fragmentation on an event broker

AllenW
AllenW Member Posts: 18 ✭✭✭

Hi all,

Trying to get a better understanding of fragmentation and potential culprits of high amounts of fragmentation occurring in a particular event broker. Defragmentation is running daily on a schedule, not threshold. But we are seeing within an hour after defrag runs successfully the estimated fragmentation reaches 99% within an hour.

Estimated Fragmentation:                97%
Estimated Recoverable Space: 10 MB
Last Result:
Completed On: May 9 2024 00:02:41 UTC ## About 1.5hrs ago
Completion %: 100%
Exit Condition: Success

We are seeing this frequently, although it's worth noting only 10MB of recoverable space does not seem like this would be an enormous number of messages.

Could this be because the total number of messages spooled is rather low, therefore it won't take a large amount of fragmented message to cause a high estimated fragmentation value?

                                          ADB         Disk           Total
Current Persistent Store Usage (MB)        0.0240       0.3082          0.3322
Number of Messages Currently Spooled 9 194 203

Either way we would like the find the culprit of the fragmentation, whether this is one particular client or a particular process causing this so frequently.

The spool files that a Solace PubSub+ event broker uses to store Guaranteed messages may become "fragmented" over time when consumers frequently go offline and do not reconnect. Fragmentation can occur because the small number of messages awaiting delivery to those offline consumers are maintained, which prevents the larger number of messages that have been consumed that are also on the spool file from being removed.

Is there anything in particular to look for in logs that would lead to fragmentation?

Looking at the https://docs.solace.com/Admin-Ref/Solace-PubSub-Event-Reference/event_ref_boiler.html event boiler I'm unable to find any specific log events related to fragmentation besides when defragmentation starts or stops.

Best Answer

  • Aaron
    Aaron Member, Administrator, Moderator, Employee Posts: 664 admin
    #2 Answer ✓

    Hi @AllenW. So yeah, it's because your total number of messages is quite low, there's not much space to free up. Typically, I wouldn't be looking at the fragmentation amount until you start to creep towards your disk limit. (not message-spool limit, but the actual underlying disk). You can see these details in show message-spool detail in CLI.

    There won't be any logs to point out which publisher or queue is leading to fragmentation.

    The biggest cause of fragmentation is when you have guaranteed consumers of differing rates, and it's the occasional ones that are slow to consume their messages. Essentially, (as of today, may change in future?) messages in the spool are written to disk in chunks… several MB in size. As messages are read and ACKed from their queues, individual messages inside these spool files are marked for deletion. But the file as a whole can only be deleted once all messages inside it have been consumed. So if you have a low-volume consumer that leaves its messages on the broker for a long time, then some of these spool files might have 99% of their messages consumed, but perhaps 1% of the messages still stuck on a queue somewhere and hence the broker can't delete the file. This is what we've usually ended up calling the "sparse message-spool problem". The total amount of messages spooled is low, but disk utilization is relatively high. This is where defrag helps a lot.

    A useful CLI command might be show queue * sort-by-messages-spooled. And then check those apps that are leaving messages on their queue. There's also a way to check the date of the oldest message on a queue, but you can't use wildcards for this, you need to run the command separately for each queue, which is fine for occasional investigations but not good for monitoring:

    show queue foo message-vpn bar messages oldest detail count 1
    

    Anyhow… dunno if this post helps or not..? If you're a customer, maybe ping Solace Support to ask for more detailed help?

Answers

  • Aaron
    Aaron Member, Administrator, Moderator, Employee Posts: 664 admin
    #3 Answer ✓

    Hi @AllenW. So yeah, it's because your total number of messages is quite low, there's not much space to free up. Typically, I wouldn't be looking at the fragmentation amount until you start to creep towards your disk limit. (not message-spool limit, but the actual underlying disk). You can see these details in show message-spool detail in CLI.

    There won't be any logs to point out which publisher or queue is leading to fragmentation.

    The biggest cause of fragmentation is when you have guaranteed consumers of differing rates, and it's the occasional ones that are slow to consume their messages. Essentially, (as of today, may change in future?) messages in the spool are written to disk in chunks… several MB in size. As messages are read and ACKed from their queues, individual messages inside these spool files are marked for deletion. But the file as a whole can only be deleted once all messages inside it have been consumed. So if you have a low-volume consumer that leaves its messages on the broker for a long time, then some of these spool files might have 99% of their messages consumed, but perhaps 1% of the messages still stuck on a queue somewhere and hence the broker can't delete the file. This is what we've usually ended up calling the "sparse message-spool problem". The total amount of messages spooled is low, but disk utilization is relatively high. This is where defrag helps a lot.

    A useful CLI command might be show queue * sort-by-messages-spooled. And then check those apps that are leaving messages on their queue. There's also a way to check the date of the oldest message on a queue, but you can't use wildcards for this, you need to run the command separately for each queue, which is fine for occasional investigations but not good for monitoring:

    show queue foo message-vpn bar messages oldest detail count 1
    

    Anyhow… dunno if this post helps or not..? If you're a customer, maybe ping Solace Support to ask for more detailed help?