Questions on Config-Sync between mates in HA Group and Replication
Hi,
It appears that when config-sync is enabled in a HA-Group and Replication mates, it is provisioning additional MessageVPN and Queues. I have a few questions on the same:
- As it is provisioning the additional VPN and queues, does it mean config-sync between mates in a HA-Group and Replication Group is asynchronous?
- If it is Asynchronous, how durable it is when one of the node/site crashes and fails over to other node/site? Should we expect a message loss ?
- If it is synchronous, does it acknowledge only when message is persisted on other node/site?
- Is it Storage layer replication? or Network layer replication?
Thanks,
Raghu
Best Answer
-
Yes, if we lose the mate we declare redundancy as down but we carry on accepting messages. Since we can't talk to the backup, message sync stops. When the backup comes back on-line we re-synchronise.
Back to the question of your failure scenario with the secondary not having enough space, that's an interesting one, I haven't tried that Did you check the redundancy status before attempting to fail over? I'm willing to bet you would have seen redundancy was down. The reason was probably "AD Not Ready" meaning there was a problem with the guaranteed messaging sub-system on the backup. We would have generated asynchronous alerts telling you this had happened - both the backup having a problem and redundancy going down - hence you question on which alerts to monitor, I'm guessing.
To summarise: Solace prioritises consistency over availability, in CAP theorem terms.
5
Answers
-
@raghu the synchronisation across a mate link has to be synchronous, otherwise you will end up with windows during which message loss can occur. This is why we recommend you keep the brokers close to each other, since performance can be adversely affected by long mate link round trip times. We will only acknowledge the producer once the message has been persisted on both the backup and primary.
We do not use storage layer replication, which is a bad solution to this problem since the backup broker would have to load the entire storage volume on fail over. We replicate on a per-message basis.0 -
Hi @TomF
Thanks for your insights on it. During my test on the failover behavior, I have simulated a disk space issue on secondary while primary is active, in which case secondary's spool cannot grow with primary when slow consumer is the scenario. Primary is able to continue processing independently and at this time when Primary is failed purposefully, Secondary did not take over the activity as it is already out of sync with primary (my best guess).
Could you please provide your insights in this situation?
Is it honoring Durability vs availability?
What would be possible out come in case same issue happens because of a network degradation between primary and secondary?
What would be the role of monitoring node in this situation? Does it detect which one is current and help in decision making to make some node active?Thanks
0 -
Yes, if we lose the mate we declare redundancy as down but we carry on accepting messages. Since we can't talk to the backup, message sync stops. When the backup comes back on-line we re-synchronise.
Back to the question of your failure scenario with the secondary not having enough space, that's an interesting one, I haven't tried that Did you check the redundancy status before attempting to fail over? I'm willing to bet you would have seen redundancy was down. The reason was probably "AD Not Ready" meaning there was a problem with the guaranteed messaging sub-system on the backup. We would have generated asynchronous alerts telling you this had happened - both the backup having a problem and redundancy going down - hence you question on which alerts to monitor, I'm guessing.
To summarise: Solace prioritises consistency over availability, in CAP theorem terms.
5