Questions on Config-Sync between mates in HA Group and Replication

rdesoju · February 2020

Hi,
It appears that when config-sync is enabled in a HA-Group and Replication mates, it is provisioning additional MessageVPN and Queues. I have a few questions on the same:

As it is provisioning the additional VPN and queues, does it mean config-sync between mates in a HA-Group and Replication Group is asynchronous?
If it is Asynchronous, how durable it is when one of the node/site crashes and fails over to other node/site? Should we expect a message loss ?
If it is synchronous, does it acknowledge only when message is persisted on other node/site?
Is it Storage layer replication? or Network layer replication?

Thanks,
Raghu

TomF · February 2020

Yes, if we lose the mate we declare redundancy as down but we carry on accepting messages. Since we can't talk to the backup, message sync stops. When the backup comes back on-line we re-synchronise.

Back to the question of your failure scenario with the secondary not having enough space, that's an interesting one, I haven't tried that Did you check the redundancy status before attempting to fail over? I'm willing to bet you would have seen redundancy was down. The reason was probably "AD Not Ready" meaning there was a problem with the guaranteed messaging sub-system on the backup. We would have generated asynchronous alerts telling you this had happened - both the backup having a problem and redundancy going down - hence you question on which alerts to monitor, I'm guessing.

To summarise: Solace prioritises consistency over availability, in CAP theorem terms.

rdesoju · February 2020

It appears that mate-link service is responsible for syncing up message queues on both nodes in a HA group.
Does it sync up the other node asynchronously or synchronously? is it storage layer replication? or network layer?
Please clarify.
Thanks,
Raghu

TomF · February 2020

@raghu the synchronisation across a mate link has to be synchronous, otherwise you will end up with windows during which message loss can occur. This is why we recommend you keep the brokers close to each other, since performance can be adversely affected by long mate link round trip times. We will only acknowledge the producer once the message has been persisted on both the backup and primary.
We do not use storage layer replication, which is a bad solution to this problem since the backup broker would have to load the entire storage volume on fail over. We replicate on a per-message basis.

rdesoju · February 2020

Hi @TomF
Thanks for your insights on it. During my test on the failover behavior, I have simulated a disk space issue on secondary while primary is active, in which case secondary's spool cannot grow with primary when slow consumer is the scenario. Primary is able to continue processing independently and at this time when Primary is failed purposefully, Secondary did not take over the activity as it is already out of sync with primary (my best guess).
Could you please provide your insights in this situation?
Is it honoring Durability vs availability?
What would be possible out come in case same issue happens because of a network degradation between primary and secondary?
What would be the role of monitoring node in this situation? Does it detect which one is current and help in decision making to make some node active?

Thanks

rdesoju · February 2020

Another question is, since mate-link is synchronous, what happens if mate-link is down? does it detect and stop syncing with other nodes? If yes, how does it determine mate-link is down? does it take help of monitoring node in this case?

TomF · February 2020

Yes, if we lose the mate we declare redundancy as down but we carry on accepting messages. Since we can't talk to the backup, message sync stops. When the backup comes back on-line we re-synchronise.

Back to the question of your failure scenario with the secondary not having enough space, that's an interesting one, I haven't tried that Did you check the redundancy status before attempting to fail over? I'm willing to bet you would have seen redundancy was down. The reason was probably "AD Not Ready" meaning there was a problem with the guaranteed messaging sub-system on the backup. We would have generated asynchronous alerts telling you this had happened - both the backup having a problem and redundancy going down - hence you question on which alerts to monitor, I'm guessing.

To summarise: Solace prioritises consistency over availability, in CAP theorem terms.

Questions on Config-Sync between mates in HA Group and Replication

Best Answer

Answers

Categories

This Month's Leaders

This Week's Leaders