13.24 Batch and Scheduled Jobs 13.26 Bulk Resource Modification

13.25.1Tokenization and Field-Level Encryption

Some organizations, jurisdictions, or projects have policies requiring that specific fields within resources be obfuscated when at rest in the database.

For example, consider the following simple Patient resource:

{
	"resourceType": "Patient",
	"id": 123,
	"name": [ {
		"family": "Simpson",
		"given": [ "Homer" ]
	} ],
	"birthDate": "1956-05-12",
	"gender": "male"
}

In the example above, the complete resource text will be stored in a text field within the database. In addition, for FHIR search indexing, specific strings are extracted from the resource and also stored in dedicated database tables designed to support searching.

Of course, database-level encryption and HTTPS transport security should generally already be used to protect this data, but sometimes additional security is needed for specific fields. Smile CDR Tokenization causes specific elements within the FHIR resource body to be extracted and replaced with "tokens", which are opaque strings serving as placeholders for these elements.

By using the tokenization capability, these strings can be replaced by opaque placeholder strings, which reduces the ability for someone to re-identify the data if they have access to the database.

These tokenized strings can take any form and do not need to match the format of the original datatype. For this reason, they are stored in an extension and the original value is removed. The following example shows a Patient.birthDate element with its value replaced by a tokenized string:

{
	"_birthDate": {
		"extension": [
			{
				"url": "https://smilecdr.com/fhir/ns/StructureDefinition/resource-tokenized-value",
				"valueCode": "cce82123-748d-4597-b52b-9200646ab788"
			}
		]
	}
}

13.25.2Tokenization Algorithm

Smile CDR relies on a user-supplied algorithm for tokenization, and does not provide the tokenization capability (i.e. the actual algorithm used to convert between a plaintext string and a token) directly. This is because tokenization should be external to Smile CDR for separation of concerns.

The tokenization algorithm:

Must be deterministic, meaning that repeated calls for the same input must produce the same output.
May be reversible, in which case the tokenized string will be stored in the database but the data can be detokenized before returning the containing FHIR resource to a client.

13.25.2.1Tokenization Provider

The tokenization algorithm is supplied via an implementation of the ITokenizationProvider interface. This interface provides two methods, one for converting a string into a token and the other for performing the reverse.

You can use the General Purpose Interceptor Demo Project as a starting point for creating your own tokenization provider.

13.25.2.2Example Tokenization Provider

The following example shows a tokenization provider:

/**
 * A simple tokenization provider which tokenizes a few PHI fields within the Patient resource. This spec uses
 * the (completely insecure!) ROT-13 algorithm for tokenization and is intended for demonstration
 * purposes only.
 */
public class ExampleTokenizationProvider implements ITokenizationProvider {

   /**
    * This method is called in order to tokenize one or more source strings. The system tries
    * to provide batches of strings for tokenization so that if an external tokenization
    * service is used, and it supports batching, this capability can be leveraged.
    *
    * @param theRequestDetails The request associated with the tokenization. This object
    *                          contains one or more strings for tokenization.
    * @param theRequests       The requests, which include the specific rule as well as the object being tokenized
    */
   @Override
   public TokenizationResults tokenize(RequestDetails theRequestDetails, TokenizationRequests theRequests) {
      TokenizationResults retVal = new TokenizationResults();
      for (TokenizationRequest request : theRequests) {
         String source = request.getObjectAsString();
         String token = rot13(source);
         retVal.addResult(request, token);
      }
      return retVal;
   }

   /**
    * This method is called in order to convert one or more tokenized strings back into their
    * original source value. This method must return exactly the same value as was originally
    * provided for tokenization. Method is only called if one or more of the configured
    * tokenization rules declare support for de-tokenization.
    */
   @Override
   public DetokenizationResults detokenize(RequestDetails theRequestDetails, DetokenizationRequests theRequests) {
      DetokenizationResults retVal = new DetokenizationResults();
      for (DetokenizationRequest request : theRequests) {
         String token = request.getToken();
         String source = rot13(token);
         retVal.addResult(request, source);
      }
      return retVal;
   }

   /**
    * Implementation of ROT13 obfuscation, based on a solution found
    * here: https://stackoverflow.com/questions/8981296/rot-13-function-in-java
    * This is not intended to be a suitable production tokenization algorithm,
    * it is simply provided as an easy way to demonstrate the concept!
    */
   public static String rot13(String theInput) {
      StringBuilder sb = new StringBuilder();
      for (int i = 0; i < theInput.length(); i++) {
         char c = theInput.charAt(i);
         if (c >= 'a' && c <= 'm') c += 13;
         else if (c >= 'A' && c <= 'M') c += 13;
         else if (c >= 'n' && c <= 'z') c -= 13;
         else if (c >= 'N' && c <= 'Z') c -= 13;
         else if (c >= '0' && c <= '4') c += 5;
         else if (c >= '5' && c <= '9') c -= 5;
         sb.append(c);
      }
      return sb.toString();
   }
}

13.25.3Tokenization Rules

The configured tokenization rules define the set of FHIR data elements which will be tokenized. Essentially, they are a collection of FHIRPath Expressions which will be extracted from resources being stored in the repository and replaced with equivalent tokens.

The rules are configured using the Tokenization Rules Text or Tokenization Rules File settings. The value is a JSON document using the TokenizationRules model.

Each rule must contain a FHIRPath expression, beginning with a resource type. For example, the expression Patient.name.family instructs the module that when storing a Patient resource, each repetition of the Patient.name element must have the family name extracted and replaced with a token.

If a given expression corresponds to a search parameter which is active on the server, that search parameter must also be declared in the rule. See Searching and Tokenization below.

The following example shows a rules collection with several active rules for the Patient resource.

{
	"rules" : [ {
		"description" : "Rule for a path including a search parameter",
		"path" : "Patient.identifier",
		"searchParameter" : "identifier",
		"searchValueNormalization" : "IDENTIFIER",
		"status" : "ACTIVE"
	}, {
		"description" : "Another rule for a path including a search parameter", 
		"path" : "Patient.name.family",
		"searchParameter" : "family",
		"searchValueNormalization" : "STRING",
		"status" : "ACTIVE"
	}, {
		"description" : "Rule for a path with no associated search parameter",
		"path" : "Patient.maritalStatus",
		"status" : "ACTIVE"
	} ]
}

13.25.4Searching and Tokenization

When an element in a resource is tokenized and that element is also used as a search parameter expression, declaring the search parameter as a part of the Tokenization Rule causes the search index to also be tokenized.

For example this means that if you have chosen to tokenize the Patient.name.family element (which is used to support the family Search Parameter) the tokenized string will be indexed instead of the original value. Suppose the configured tokenization algorithm tokenizes the value "Smith" with the token "ABCDEFG". When performing a search using this parameter, the value being searched for will also be tokenized in order to ensure that values can still be found.

To make this work, Smile CDR automatically creates internal SearchParameter resources with the same name as the original SearchParameter but with the suffix -tokenized. Therefore, if a FHIR client performs a search for Patient?family=smith, the search term will be automatically tokenized and the search will be treated as Patient?family-tokenized=ABCDEFG.

13.25.4.1Search Normalization

If you need to support searching on a tokenized value, you may need to declare a normalization rule in order for the search to behave in the way a client would expect. Several normalization modes are available:

NONE – Use this value if no normalization is needed (this is the default)
STRING – The search index string is normalized according to the FHIR string normalization rules (case-insensitive, diacritics are removed, etc.). This mode should only be used if the expression refers to a primitive string, and the search parameter is of type string. For example, the expression Patient.name.family.
IDENTIFIER – This mode should be selected if the expression points to an element of type Identifier and the search parameter is of type token. For example, the expression Patient.identifier.
CODEABLECONCEPT – This mode should be selected if the expression points to an element of type CodeableConcept and the search parameter is of type token. For example, the expression Observation.code.

13.25.4.2Limitations

Note the following limitations on searches which use tokenized values:

The feature works by creating internal search parameters using a name suffixed with -tokenized. For example, if you are tokenizing the Patient.identifier element, an internal search parameter called identifier-tokenized will be created by the system in order to support searches by identifier. These internal parameters may not be used directly by FHIR clients.
When searching using a SearchParameter of type string:
- The search always searches for a "normalized exact match". This means that the search is case-insensitive and ignores accents and other diacritic marks, but must match on all characters (i.e. no left match).
When searching using a SearchParameter of type token:
- No modifiers (e.g. `:in') may be used. If the element being indexed includes a system and a value, both must be specified in the search URL.
- It is only possible to search for identifiers which have both a system and a value. This means that it is not possible to search for an identifier which has only a system (Identifier.system) or a value (Identifier.value).

13.25.5Tokenization and Validation

When using either Repository Validation or Endpoint Validation with this feature, resources are validated prior to tokenization.

This means that tokenization will not cause validation failures if a mandatory data element is then removed and tokenized. However, if a non-reversible tokenization algorithm is chosen it could mean that the resource will no longer meet the same requirements when it is returned.

13.25.6Other Limitations

The FHIR PATCH operation is not currently supported when Tokenization is enabled.

13.25.7Updating Tokenization Rules

If it is necessary to change your tokenization rules, you may make changes to your rules file and restart your FHIR storage module at any time. However, this may cause issues if you have existing data stored in your repository which was tokenized under the previous rules.

Smile CDR provides the Update Tokenization operation ($sdh.update-tokenization) that can be used to apply the new rules to any existing data in the repository.

13.25.7.1Adding New Tokenization Rules

When a new tokenization rule is added, any resources that are added or modified will be tokenized using the new rules. Any existing data will still be available and will be returned with the original non-tokenized values. New data will be detokenized on access and will be returned with the original non-tokenized values as well (assuming the tokenization algorithm is reversible and the rule is configured to actually detokenize).

If the new rule includes a search parameter, searches using this parameter will only find resources added or modified after the rule was added.

The Update Tokenization operation can be used to apply the new rules to existing data, after which all data will be tokenized, and search parameters will be able to find all applicable resources.

13.25.7.2Removing Tokenization Rules

If a tokenization rule is no longer needed, it should generally not be immediately removed if any data is already stored with tokenized data. Once a rule is removed, existing tokenized values will be returned without being de-tokenized.

In order to remove a rule, you should use the following steps:

Change the rule status value from ACTIVE to QUIESCE, and restart the FHIR Storage module. This means that the rule will no longer be used for tokenization, but existing tokenized data will still be detokenized.
Invoke the Update Tokenization operation for the affected resources.
Wait for the Update Tokenization operation to complete. Important: this means that the background task has completed, not just that the initial HTTP request has returned successfully.
The rule can now be removed from the rules definition or changed to DISABLED status. Old rules with a status of QUIESCE do not strictly need to be removed or disabled, but they do add a tiny amount of processing for relevant resource reads so it is good to remove them eventually.

13.25.8Update Tokenization Operation ($sdh.update-tokenization)

Initiating an Update Tokenization job requires the FHIR_OP_UPDATE_TOKENIZATION permission.

The Update Tokenization operation can be used to apply the new rules to any existing data in the repository. This operation is invoked by executing an HTTP POST request to the $sdh.update-tokenization operation.

This operation follows the FHIR Asynchronous Interaction Request Pattern. This means that the operation will return immediately, and will return a response containing a URL to poll for the status of the operation. It can also be monitored using the Batch Job Endpoint or through the Web Admin Console.

13.25.8.1Input Parameters

The Update Tokenization operation accepts a Parameters resource as input. All Common Bulk Modification Request Parameters are supported.

13.25.8.2Example Request

The following example shows an HTTP POST request to the $sdh.update-tokenization operation:

POST /$sdh.update-tokenization
Prefer: respond-async
Content-Type: application/fhir+json

{
	"resourceType": "Parameters",
	"parameter": [
		{
			"name": "url",
			"valueString": "Patient?"
		}
	]
}

13.25.8.3Example Response

The response will contain a URL to poll for the status of the operation:

202 Accepted
Content-Location: http://example.org/$sdh.update-tokenization-status?_jobId=33b280a7-1715-4d38-a7d4-6f5d1c705551
Content-Type: application/fhir+json;charset=utf-8

{
	"resourceType":"OperationOutcome",
	"issue":[
		{
			"severity":"information",
			"code":"informational",
			"diagnostics":"$sdh.update-tokenization job has been accepted. Poll for status at the following URL: http://localhost:8000/$hapi.fhir.bulk-patch.poll-for-status?_jobId=33b280a7-1715-4d38-a7d4-6f5d1c705551"
		}
	]
}

13.24 Batch and Scheduled Jobs 13.26 Bulk Resource Modification