11.24.1Tokenization and Field-Level Encryption

 

Some organizations, jurisdictions, or projects have policies requiring that specific fields within resources be obfuscated when at rest in the database.

For example, consider the following simple Patient resource:

{
	"resourceType": "Patient",
	"id": 123,
	"name": [ {
		"family": "Simpson",
		"given": [ "Homer" ]
	} ],
	"birthDate": "1956-05-12",
	"gender": "male"
}

In the example above, the complete resource text will be stored in a text field within the database. In addition, for FHIR search indexing, specific strings are extracted from the resource and also stored in dedicated database tables designed to support searching.

Of course, database-level encryption and HTTPS transport security should generally already be used to protect this data, but sometimes additional security is needed for specific fields. Smile CDR Tokenization causes specific elements within the FHIR resource body to be extracted and replaced with "tokens", which are opaque strings serving as placeholders for these elements.

By using the tokenization capability, these strings can be replaced by opaque placeholder strings, which reduces the ability for someone to re-identify the data if they have access to the database.

These tokenized strings can take any form and do not need to match the format of the original datatype. For this reason, they are stored in an extension and the original value is removed. The following example shows a Patient.birthDate element with its value replaced by a tokenized string:

{
	"_birthDate": {
		"extension": [
			{
				"url": "https://smilecdr.com/fhir/ns/StructureDefinition/resource-tokenized-value",
				"valueCode": "cce82123-748d-4597-b52b-9200646ab788"
			}
		]
	}
}

11.24.2Tokenization Algorithm

 

Smile CDR relies on a user-supplied algorithm for tokenization, and does not provide the tokenization capability (i.e. the actual algorithm used to convert between a plaintext string and a token) directly. This is because tokenization should be external to Smile CDR for separation of concerns.

The tokenization algorithm:

  • Must be deterministic, meaning that repeated calls for the same input must produce the same output.
  • May be reversible, in which case the tokenized string will be stored in the database but the data can be detokenized before returning the containing FHIR resource to a client.

11.24.2.1Tokenization Provider

The tokenization algorithm is supplied via an implementation of the ITokenizationProvider interface. This interface provides two methods, one for converting a string into a token and the other for performing the reverse.

You can use the General Purpose Interceptor Demo Project as a starting point for creating your own tokenization provider.

11.24.2.2Example Tokenization Provider

The following example shows a tokenization provider:

/**
 * A simple tokenization provider which tokenizes a few PHI fields within the Patient resource. This spec uses
 * the (completely insecure!) ROT-13 algorithm for tokenization and is intended for demonstration
 * purposes only.
 */
public class ExampleTokenizationProvider implements ITokenizationProvider {

   /**
    * This method is called in order to tokenize one or more source strings. The system tries
    * to provide batches of strings for tokenization so that if an external tokenization
    * service is used, and it supports batching, this capability can be leveraged.
    *
    * @param theRequestDetails The request associated with the tokenization. This object
    *                          contains one or more strings for tokenization.
    * @param theRequests       The requests, which include the specific rule as well as the object being tokenized
    */
   @Override
   public TokenizationResults tokenize(RequestDetails theRequestDetails, TokenizationRequests theRequests) {
      TokenizationResults retVal = new TokenizationResults();
      for (TokenizationRequest request : theRequests) {
         String source = request.getObjectAsString();
         String token = rot13(source);
         retVal.addResult(request, token);
      }
      return retVal;
   }

   /**
    * This method is called in order to convert one or more tokenized strings back into their
    * original source value. This method must return exactly the same value as was originally
    * provided for tokenization. Method is only called if one or more of the configured
    * tokenization rules declare support for de-tokenization.
    */
   @Override
   public DetokenizationResults detokenize(RequestDetails theRequestDetails, DetokenizationRequests theRequests) {
      DetokenizationResults retVal = new DetokenizationResults();
      for (DetokenizationRequest request : theRequests) {
         String token = request.getToken();
         String source = rot13(token);
         retVal.addResult(request, source);
      }
      return retVal;
   }

   /**
    * Implementation of ROT13 obfuscation, based on a solution found
    * here: https://stackoverflow.com/questions/8981296/rot-13-function-in-java
    * This is not intended to be a suitable production tokenization algorithm,
    * it is simply provided as an easy way to demonstrate the concept!
    */
   public static String rot13(String theInput) {
      StringBuilder sb = new StringBuilder();
      for (int i = 0; i < theInput.length(); i++) {
         char c = theInput.charAt(i);
         if (c >= 'a' && c <= 'm') c += 13;
         else if (c >= 'A' && c <= 'M') c += 13;
         else if (c >= 'n' && c <= 'z') c -= 13;
         else if (c >= 'N' && c <= 'Z') c -= 13;
         else if (c >= '0' && c <= '4') c += 5;
         else if (c >= '5' && c <= '9') c -= 5;
         sb.append(c);
      }
      return sb.toString();
   }
}

11.24.3Tokenization Rules

 

The configured tokenization rules define the set of FHIR data elements which will be tokenized. Essentially they are a collection of FHIRPath Expressions which will be extracted from resources being stored in the repository, and replaced with equivalent tokens.

The rules are configured using the Tokenization Rules Text or Tokenization Rules File settings. The value is a JSON document using the TokenizationRules model.

Each rule must contain a FHIRPath expression, beginning with a resource type. For example, the expression Patient.name.family instructs the module that when storing a Patient resource, each repetition of the Patient.name element must have the family name extracted and replaced with a token.

If a given expression corresponds to a search parameter which is active on the server, that search parameter must also be declared in the rule. See Searching and Tokenization below.

The following example shows a rules collection with several active rules for the Patient resource.

{
	"rules" : [ {
		"description" : "Rule for a path including a search parameter",
		"path" : "Patient.identifier",
		"searchParameter" : "identifier",
		"searchValueNormalization" : "IDENTIFIER"
	}, {
		"description" : "Another rule for a path including a search parameter", 
		"path" : "Patient.name.family",
		"searchParameter" : "family",
		"searchValueNormalization" : "STRING"
	}, {
		"description" : "Rule for a path with no associated search parameter",
		"path" : "Patient.maritalStatus"
	} ]
}

11.24.4Searching and Tokenization

 

When an element in a resource is tokenized and that element is also used as a search parameter expression, declaring the search parameter as a part of the Tokenization Rule causes the search index to also be tokenized.

For example this means that if you have chosen to tokenize the Patient.name.family element (which is used to support the family Search Parameter) the tokenized string will be indexed instead of the original value. Suppose the configured tokenization algorithm tokenizes the value "Smith" with the token "ABCDEFG". When performing a search using this parameter, the value being searched for will also be tokenized in order to ensure that values can still be found.

To make this work, Smile CDR automatically creates internal SearchParameter resources with the same name as the original SearchParameter but with the suffix -tokenized. Therefore, if a FHIR client performs a search for Patient?family=smith, the search term will be automatically tokenized and the search will be treated as Patient?family-tokenized=ABCDEFG.

11.24.4.1Search Normalization

If you need to support searching on a tokenized value, you may need to declare a normalization rule in order for the search to behave in the way a client would expect. Several normalization modes are available:

  • NONE – Use this value if no normalization is needed (this is the default)
  • STRING – The search index string is normalized according to the FHIR string normalization rules (case-insensitive, diacritics are removed, etc.). This mode should only be used if the expression refers to a primitive string, and the search parameter is of type string. For example, the expression Patient.name.family.
  • IDENTIFIER – This mode should be selected if the expression points to an element of type Identifier and the search parameter is of type token. For example, the expression Patient.identifier.
  • CODEABLECONCEPT – This mode should be selected if the expression points to an element of type CodeableConcept and the search parameter is of type token. For example, the expression Observation.code.

11.24.4.2Limitations

Note the following limitations on searches which use tokenized values:

  • When searching using a SearchParameter of type string, the search always searches for a "normalized exact match". This means that the search is case-insensitive and ignores accents and other diacritic marks, but must match on all characters (i.e. no left match).
  • When searching using a SearchParameter of type token, no modifiers (e.g. `:in') may be used. If the element being indexed includes a system and a value, both must be specified in the search URL. If the element being indexed includes only a code (e.g. Patient.gender), then the search URL must include only the code as well.

11.24.5Tokenization and Validation

 

When using either Repository Validation or Endpoint Validation with this feature, resources are validated prior to tokenization.

This means that tokenization will not cause validation failures if a mandatory data element is then removed and tokenized. However, if a non-reversible tokenization algorithm is chosen it could mean that the resource will no longer meet the same requirements when it is returned.

11.24.6Other Limitations

 

The FHIR PATCH operation is not currently supported when Tokenization is enabled.