4.10.1Message Broker Failure Management

 

Several Smile CDR modules make use of message brokers, acting as consumers. For example, the Channel Import module consumes messages from a broker.

In the normal course of operation, a consumer will be ingesting messages from one or more brokers, and performing whatever processing is necessary internally. However, it is possible that messages may be dropped or silently lost when something goes wrong during that processing.

To mitigate this issue, Smile CDR provides facilities by which any module that requires a channel can also specify a retry channel, a failed channel, and a collection of other settings to control retry behaviour. Any module that requires an intake channel has the ability to also specify the following settings:

module.module_name.config.channel.retry.name                          =my-retry-channel
module.module_name.config.channel.retry.delay_milliseconds            =5000
module.module_name.config.channel.retry.maximum_delay_milliseconds    =6000
module.module_name.config.channel.retry.maximum_attempts              =3
module.module_name.config.channel.retry.strategy                      =CONSTANT
module.module_name.config.channel.retry.retriable_exceptions          =ca.uhn.fhir.rest.server.exceptions.InvalidRequestException
module.module_name.config.channel.failed.name                         =my-failure-channel

Setting these properties in a module will cause Smile CDR to automatically wrap any message handling in a retry mechanism following the rules outlined in the properties. Note that all the configuration settings above must be set (i.e. no blank, null or zero values) for the retry mechanism to be enabled. Below is a rough explanation of how the retry mechanism works.

Channel Retry Mechanism

  1. The channel consumer will attempt to process the message on the channel.
  2. If this processing fails, and the failure is of a type listed in channel.retry.retriable_exceptions, the message handler will set headers on the message indicating retry count, first failure timestamp, and last failure timestamp. The message then moves onto the channel defined in channel.retry.name.
  3. The retry handler will consume messages off of the retry channel. When this happens, it calculates the next time we should attempt to process, based on several factors. Once determined, the retry handler will sleep until it is time to retry processing.
    1. channel.retry.delay_milliseconds is the minimum amount of milliseconds between attempts.
    2. channel.retry.strategy determines whether the backoff is exponential or constant. If the delay is 5000ms, a constant backoff would retry every 5 seconds, whereas an exponential backoff would try at 5 seconds, then 10, then 20, and so on.
    3. channel.retry.maximum_delay_milliseconds provides an upper bound for delay in case of exponential growth.
  4. If the message has the maximum set by channel.retry.maximum_attempts, then the message handler will publish it to the failed channel as defined in channel.failed.name. The message handler will also add the unhandled exception to the headers of the message. Smile CDR does not consume messages off of this channel, those messages should be consumed by an external reporting system, or potentially a dead-letter consumer.