Python client reconnection failure

sergkr21
sergkr21 Member Posts: 2
edited October 2022 in General Discussions #1

Hi,

I use Python package solace-pubsubplus v 1.2.0 to connect to Solace. My client has the following reconnection settings: reconnect retries = 20 and reconnect retry wait = 3000 ms. Here is my code for that:

messaging_service = MessagingService.builder().from_properties(broker_props) \
    .with_reconnection_retry_strategy(RetryStrategy.parametrized_retry(20, 3000)) \
    .build()

But each time when Solace has scheduled HA failover my client fails to reconnect to the server again. Below you can find error messages that I have from error handler (it just prints error reason and message). Even if I do have reconnection settings I don't see any attempts of reconnecting on client side after the failure. Any ideas how to resolve this issue?

2022-03-23 08:00:05,423 ERROR [error_handlers.py:on_reconnecting():48] - Error on_reconnecting
2022-03-23 08:00:05,424 ERROR [error_handlers.py:on_reconnecting():49] - Error cause: {'caller_description': 'From service event callback', 'return_code': 'Ok', 'sub_code': 'SOLCLIENT_SUBCODE_COMMUNICATION_ERROR', 'error_info_sub_code': 14, 'error_info_contents': 'TCP: Could not read from socket 13, error = Connection reset by peer (104)'}
2022-03-23 08:00:05,424 ERROR [error_handlers.py:on_reconnecting():50] - Message: TCP: Could not read from socket 13, error = Connection reset by peer (104)
2022-03-23 08:00:09,699 ERROR [error_handlers.py:on_reconnecting():48] - Error on_reconnecting
2022-03-23 08:00:09,699 ERROR [error_handlers.py:on_reconnecting():49] - Error cause: {'caller_description': 'From service event callback', 'return_code': 'Ok', 'sub_code': 'SOLCLIENT_SUBCODE_COMMUNICATION_ERROR', 'error_info_sub_code': 14, 'error_info_contents': 'TCP: Could not read from socket 14, error = Connection reset by peer (104)'}
2022-03-23 08:00:09,699 ERROR [error_handlers.py:on_reconnecting():50] - Message: TCP: Could not read from socket 14, error = Connection reset by peer (104)
2022-03-23 08:00:10,082 ERROR [error_handlers.py:on_service_interrupted():53] - Error on_service_interrupted
2022-03-23 08:00:10,082 ERROR [error_handlers.py:on_service_interrupted():54] - Error cause: {'caller_description': 'From service event callback', 'return_code': 'Unknown (503)', 'sub_code': 'SOLCLIENT_SUBCODE_SERVICE_UNAVAILABLE', 'error_info_sub_code': 115, 'error_info_contents': 'Service Unavailable'}
2022-03-23 08:00:10,082 ERROR [error_handlers.py:on_service_interrupted():55] - Message: Service Unavailable
2022-03-23 08:00:13,091 ERROR [error_handlers.py:on_service_interrupted():53] - Error on_service_interrupted
2022-03-23 08:00:13,091 ERROR [error_handlers.py:on_service_interrupted():54] - Error cause: {'caller_description': 'From service event callback', 'return_code': 'Unknown (503)', 'sub_code': 'SOLCLIENT_SUBCODE_SERVICE_UNAVAILABLE', 'error_info_sub_code': 115, 'error_info_contents': 'Service Unavailable'}
2022-03-23 08:00:13,091 ERROR [error_handlers.py:on_service_interrupted():55] - Message: Service Unavailable


Thanks,

Sergiy

Comments

  • AlexanderJHall
    AlexanderJHall Member Posts: 6
    edited October 2022 #2

    Hi,


    I also have a similar issue - every time we flip to HA our python client gets disconnected and fails to reconnect. We attempt to reconnect every 3 seconds for 5 minutes using a RetryStrategy.parameterized_retry, but to no avail.


    The errors are:

    2022-10-15 07:00:04.882 INFO:
    on_reconnecting. Error cause: {'caller_description': 'From service event callback', 'return_code': 'Ok', 'sub_code': 'SOLCLIENT_SUBCODE_COMMUNICATION_ERROR', 'error_info_sub_code': 14, 'error_info_contents': 'TCP: Could not read from socket 15, error = Connection reset by peer (104)'} Message: TCP: Could not read from socket 15, error = Connection reset by peer (104)
    2022-10-15 07:00:04.908 INFO:
    on_reconnecting. Error cause: {'caller_description': 'From service event callback', 'return_code': 'Ok', 'sub_code': 'SOLCLIENT_SUBCODE_COMMUNICATION_ERROR', 'error_info_sub_code': 14, 'error_info_contents': 'TCP: Could not read from socket 13, error = Connection reset by peer (104)'} Message: TCP: Could not read from socket 13, error = Connection reset by peer (104)
    RETURNING losrttp {t1}losrttp|1665176523219|1665666481128|Peq1IUXQXpyyGxx9ItM4wnQe4uU= None
    2022-10-15 07:05:08,409 [WARNING] solace.messaging.receiver: [_message_receiver.py:270] [[SERVICE: 0x7fb905cdab30] [RECEIVER: 0x7fb9064264a8]] Unable to Receive message. Messaging service disconnected.
    2022-10-15 07:05:08.409 INFO:
    on_service_interrupted. Error cause: {'caller_description': 'From service event callback', 'return_code': 'Unknown (401)', 'sub_code': 'SOLCLIENT_SUBCODE_LOGIN_FAILURE', 'error_info_sub_code': 19, 'error_info_contents': 'Unauthorized'} Message: Unauthorized
    2022-10-15 07:05:08.409 INFO: SolaceQueueReaderSource-id(140432642918048)thread(140434170173248) raised exception Unable to Receive message. Messaging service disconnected.
    2022-10-15 07:05:08.409 INFO: _complete_run raised exception. Exception was Unable to Receive message. Messaging service disconnected.
    

    Traceback (most recent call last):

    snip
    snip
     File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/solace/messaging/receiver/_impl/_message_receiver.py", line 272, in _is_message_service_connected
       raise IllegalStateError(UNABLE_TO_RECEIVE_MESSAGE_MESSAGE_SERVICE_NOT_CONNECTED)
    solace.messaging.errors.pubsubplus_client_error.IllegalStateError: Unable to Receive message. Messaging service disconnected.
    2022-10-15 07:05:08.710 INFO:
    on_service_interrupted. Error cause: {'caller_description': 'From service event callback', 'return_code': 'Unknown (401)', 'sub_code': 'SOLCLIENT_SUBCODE_LOGIN_FAILURE', 'error_info_sub_code': 19, 'error_info_contents': 'Unauthorized'} Message: Unauthorized
     
    


    The code to build the connection is below.


    from solace.messaging.config.retry_strategy import RetryStrategy
    from solace.messaging.messaging_service import MessagingService
     _KEEP_TRYING_TO_CONNECT_FOR_NO_MORE_THAN_MINUTES = 5
     # 5 Minutes should be enough to ride out a badly timed reboot of the solace server.
     _INTERVAL_BETWEEN_CONNECTION_ATTEMPTS_MS = 3000
     _MAX_CONNECTION_ATTEMPTS = int(
       _KEEP_TRYING_TO_CONNECT_FOR_NO_MORE_THAN_MINUTES * 60 / (_INTERVAL_BETWEEN_CONNECTION_ATTEMPTS_MS / 1000)
     )
     # We try to reconnect for 5 minutes, every 3000 Milliseconds so max connection attempts is 5 * 60 / (3000/1000) = 100
    
     _DEFAULT_RECONNECT_STRATEGY = RetryStrategy.parametrized_retry(_MAX_CONNECTION_ATTEMPTS,
                                                                  _INTERVAL_BETWEEN_CONNECTION_ATTEMPTS_MS)
     _DEFAULT_CONNECT_STRATEGY = RetryStrategy.parametrized_retry(_MAX_CONNECTION_ATTEMPTS,
                                                                _INTERVAL_BETWEEN_CONNECTION_ATTEMPTS_MS)
     
     class SolaceMessagingService:
     
       def __init__(self,
                    broker_props: Dict,
                    service_event_handler=ServiceEventHandler(),
                    connection_retry_strategy: Optional['RetryStrategy'] = None,
                    reconnection_retry_strategy: Optional['RetryStrategy'] = None):
     
           # Note: The reconnections strategy could also be configured using the broker properties object
           self.messaging_service = MessagingService.builder().from_properties(broker_props) \
               .with_reconnection_retry_strategy(reconnection_retry_strategy or _DEFAULT_RECONNECT_STRATEGY) \
               .with_connection_retry_strategy(connection_retry_strategy or _DEFAULT_CONNECT_STRATEGY) \
               .build()      
    


    It would be wonderful to hear back with any suggestions!

  • sergkr21
    sergkr21 Member Posts: 2
    edited October 2022 #3

    Hi,

    In my case I resolved this issue using reconnection parameters 20 and 3000 and default connection strategy:

    messaging_service = MessagingService.builder().from_properties(broker_props)..with_reconnection_retry_strategy(RetryStrategy.parametrized_retry(20, 3000)).build()
    
    

    But this is what is recommended in the best practice guide:


    4.4. High Availability Failover and Reconnect Retries

    4.4 Reconnect duration should be set to at least 300 seconds when designing

    applications for High Availability (HA) support.

    All APIs

    When using High Availability (HA) redundant Solace router setup, a failover from one Solace router to its

    mate will typically occur within 30 seconds. However, an application should attempt to reconnect for at

    least for five minutes. Below is an example of setting the reconnect duration to five minutes using the

    following session property values:

    • Connect retries = 1

    • Reconnect retries = 5

    • Reconnect retry wait = 3000ms

    • Connect retries per host = 20


    You can find all these properties here:

    solace.messaging.config.solace_properties.transport_layer_properties

    and configure them as part of broker_props.

  • Tamimi
    Tamimi Member, Administrator, Employee Posts: 531 admin

    Hey @sergkr21 and @AlexanderJHall thanks for raising this concern. We have alot of fixes coming up in v1.4.0 of the API so stay tuned. Some of the issues we saw was around hanging and deadlocks which might be different from your interruption issues, so can you please raise a ticket with support containing the details of your issue and sending to to support@solace.com so we can track it for the next release if it was not already fixed. Thanks for raising issues you face!