Batch And Scheduled Jobs
Smile CDR has several mechanisms it uses internally for executing background tasks. These are:
- Batch Jobs are used to execute long-running data processing tasks.
- Clustered Scheduled Jobs are used to execute recurring tasks that are cluster aware.
- Local Scheduled Jobs are used to execute recurring tasks that are non-cluster aware.
Batch Jobs are jobs that can process large amounts of data in a distributed way, taking advantage of all available processing power in a cluster to do so.
Batch jobs in Smile CDR include:
Batch job processing in Smile CDR leverages an internal framework called Batch2. The Batch2 framework uses a combination of processing channels and database tables in order to distribute work across the cluster.
Batch jobs are divided into a series of steps. Each step has a distinct identifier within the Batch Job definition and performs a specific function with a defined set of inputs and outputs. The first step accepts a set of job parameters as input and produces zero or more work chunks as a result. Subsequent steps accept these work chunks as input, and may produce subsequent work chunks as output, except for the final step which does not. Some jobs have a special kind of final step called a reduction step that prepares a report or aggregates data across all output from the previous step.
Using the FHIR Bulk Export Batch Job as an example:
- The first step, called fetch-resources accepts the bulk export parameters and performs a series of FHIR searches for resource IDs to export. For every 1000 resource IDs (this threshold is configurable) it accumulates, it will emit a work chunk which will be processed by the next step.
- The second step, called expand resources loads the actual resource contents from the database using the resource IDs found in the individual work chunks. This step then outputs work chunks containing the complete resource bodies associated wih the resource IDs.
- The third step, called write-to-binaries creates Binary resources containing the resource bodies found in the input work chunks.
- The fourth step, which is a reduction step, creates a report containing the Binary resource IDs associated with the bulk export. These IDs are provided back to the client who initiated the export job.
For each work chunk emitted by a job step, two things happen:
- A Work Chunk entry is persisted in the database. Each work chunk is assigned a unique UUID and is stored with all of the data that was emitted by the step.
- A notification message is placed on a Message Broker channel. Depending on the specific job type, this message may be sent immediately or it may be held back until all work chunks for the current step have been successfully completed processing and then sent.
The Message Broker channel used to send and receive work chunk notifications is named
- characters may be replaced with
. characters in the name depending on the Replace Hyphens with Periods setting. Work chunk notification messages contain the UUID associated with the work chunk, but do not contain the associated data. Workers receive these notifications, load the associated data from the database, and then begin processing it.
Assuming that an external message broker has been configured, the use of a message channel allows the server to distribute processing across all Smile CDR processes within the same node.
The following diagram shows the individual steps in the Bulk Export Batch Job. Note that the database and kafka channels are shown multiple times in order to clearly show the flow of data, but these all refer to the same database table and message channel respectively.
Optimizing Batch Job Performance
If you are designing a Smile CDR installation which will handle large amounts of data, it is important to consider the following things:
- For high performance messaging (including work chunk notification), the use of an external Apache Kafka Message Broker is highly recommended.
- If Kafka is being used, the batch2-work-notification topic should be configured with a Partition Count that is large enough to allow many work chunks to be processed concurrently. For small servers a low partition count such as 4 may be sufficient, but for large clusters with many processes and lots of data a partition count into the hundreds may be needed. Remember that every partition can be processed in parallel so as you add partitions you need to account for additional CPU and RAM needs in the server cluster (either by increasing resources available to existing processes, or by adding additional processes).
- If you want to maximize the throughput of Batch Jobs, the Batch Job Executor: Maximum Thread Count setting should be increased to a level that allows a unique thread for every partition. This setting is applied per-process, meaning that a value of 10 will cause 10 Kafka consumer threads to be created for batch work notification on each Smile CDR process in the same node. For example, if you have 80 partitions, and expect to scale your Smile CDR cluster up to 10 processes then a Maximum Thread Count value of 8 will ensure that a total of 80 threads are available to handle the 80 partitions.
Clustered Scheduled Jobs
Smile CDR employs a system called Quartz to provide cluster-aware scheduling for recurring jobs. Clustered jobs are scheduled on a set frequency, and will execute on only a single process within the cluster for each occurrence of the scheduling frequency.
Clustered Scheduled Jobs are typically used for maintenance. For example:
- The PurgeExpiredFilesJob job executes once per hour, and deletes any Bulk Export files that are older than the configured Bulk Export file retention.
- The StaleSearchDeletingSvcImpl job executes once per minute, and deleted any cached search results that are older than the Expire Search Results After Minutes threshold.
- The JobMaintenanceScheduledJob job is calculates and updates the progress on in-flight and completed Batch Jobs.
Local Scheduled Jobs
The Quartz scheduler is also used to schedule non-cluster-aware jobs. These jobs execute at a given frequency on every process within the cluster at the same frequency.
These jobs are typically used to expire internal memory caches and advance processing in internal maintenance jobs.
The Scheduler Thread Count setting is used to control the number of threads that are used to process scheduled jobs.
If this value is set to the default of 4, then 4 threads will be available on each Smile CDR process for executing Clustered Scheduled Jobs, and an additional 4 threads will be available for executing Local Scheduled Jobs.
Because most scheduled jobs are fast and relatively lightweight, it is uncommon to need to modify this setting.