Some organizations, jurisdictions, or projects have policies requiring that specific fields within resources be obfuscated when at rest in the database.
For example, consider the following simple Patient resource:
{
"resourceType": "Patient",
"id": 123,
"name": [ {
"family": "Simpson",
"given": [ "Homer" ]
} ],
"birthDate": "1956-05-12",
"gender": "male"
}
In the example above, the complete resource text will be stored in a text field within the database. In addition, for FHIR search indexing, specific strings are extracted from the resource and also stored in dedicated database tables designed to support searching.
Of course, database-level encryption and HTTPS transport security should generally already be used to protect this data, but sometimes additional security is needed for specific fields. Smile CDR Tokenization causes specific elements within the FHIR resource body to be extracted and replaced with "tokens", which are opaque strings serving as placeholders for these elements.
By using the tokenization capability, these strings can be replaced by opaque placeholder strings, which reduces the ability for someone to re-identify the data if they have access to the database.
These tokenized strings can take any form and do not need to match the format of the original datatype. For this reason, they are stored in an extension and the original value is removed. The following example shows a Patient.birthDate element with its value replaced by a tokenized string:
{
"_birthDate": {
"extension": [
{
"url": "https://smilecdr.com/fhir/ns/StructureDefinition/resource-tokenized-value",
"valueCode": "cce82123-748d-4597-b52b-9200646ab788"
}
]
}
}
Smile CDR relies on a user-supplied algorithm for tokenization, and does not provide the tokenization capability (i.e. the actual algorithm used to convert between a plaintext string and a token) directly. This is because tokenization should be external to Smile CDR for separation of concerns.
The tokenization algorithm:
The tokenization algorithm is supplied via an implementation of the ITokenizationProvider interface. This interface provides two methods, one for converting a string into a token and the other for performing the reverse.
You can use the General Purpose Interceptor Demo Project as a starting point for creating your own tokenization provider.
The following example shows a tokenization provider:
/**
* A simple tokenization provider which tokenizes a few PHI fields within the Patient resource. This spec uses
* the (completely insecure!) ROT-13 algorithm for tokenization and is intended for demonstration
* purposes only.
*/
public class ExampleTokenizationProvider implements ITokenizationProvider {
/**
* This method is called in order to tokenize one or more source strings. The system tries
* to provide batches of strings for tokenization so that if an external tokenization
* service is used, and it supports batching, this capability can be leveraged.
*
* @param theRequestDetails The request associated with the tokenization. This object
* contains one or more strings for tokenization.
* @param theRequests The requests, which include the specific rule as well as the object being tokenized
*/
@Override
public TokenizationResults tokenize(RequestDetails theRequestDetails, TokenizationRequests theRequests) {
TokenizationResults retVal = new TokenizationResults();
for (TokenizationRequest request : theRequests) {
String source = request.getObjectAsString();
String token = rot13(source);
retVal.addResult(request, token);
}
return retVal;
}
/**
* This method is called in order to convert one or more tokenized strings back into their
* original source value. This method must return exactly the same value as was originally
* provided for tokenization. Method is only called if one or more of the configured
* tokenization rules declare support for de-tokenization.
*/
@Override
public DetokenizationResults detokenize(RequestDetails theRequestDetails, DetokenizationRequests theRequests) {
DetokenizationResults retVal = new DetokenizationResults();
for (DetokenizationRequest request : theRequests) {
String token = request.getToken();
String source = rot13(token);
retVal.addResult(request, source);
}
return retVal;
}
/**
* Implementation of ROT13 obfuscation, based on a solution found
* here: https://stackoverflow.com/questions/8981296/rot-13-function-in-java
* This is not intended to be a suitable production tokenization algorithm,
* it is simply provided as an easy way to demonstrate the concept!
*/
public static String rot13(String theInput) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < theInput.length(); i++) {
char c = theInput.charAt(i);
if (c >= 'a' && c <= 'm') c += 13;
else if (c >= 'A' && c <= 'M') c += 13;
else if (c >= 'n' && c <= 'z') c -= 13;
else if (c >= 'N' && c <= 'Z') c -= 13;
else if (c >= '0' && c <= '4') c += 5;
else if (c >= '5' && c <= '9') c -= 5;
sb.append(c);
}
return sb.toString();
}
}
The configured tokenization rules define the set of FHIR data elements which will be tokenized. Essentially, they are a collection of FHIRPath Expressions which will be extracted from resources being stored in the repository and replaced with equivalent tokens.
The rules are configured using the Tokenization Rules Text or Tokenization Rules File settings. The value is a JSON document using the TokenizationRules model.
Each rule must contain a FHIRPath expression, beginning with a resource type. For example, the expression Patient.name.family
instructs the module that when storing a Patient resource, each repetition of the Patient.name
element must have the family name extracted and replaced with a token.
If a given expression corresponds to a search parameter which is active on the server, that search parameter must also be declared in the rule. See Searching and Tokenization below.
The following example shows a rules collection with several active rules for the Patient resource.
{
"rules" : [ {
"description" : "Rule for a path including a search parameter",
"path" : "Patient.identifier",
"searchParameter" : "identifier",
"searchValueNormalization" : "IDENTIFIER",
"status" : "ACTIVE"
}, {
"description" : "Another rule for a path including a search parameter",
"path" : "Patient.name.family",
"searchParameter" : "family",
"searchValueNormalization" : "STRING",
"status" : "ACTIVE"
}, {
"description" : "Rule for a path with no associated search parameter",
"path" : "Patient.maritalStatus",
"status" : "ACTIVE"
} ]
}
When an element in a resource is tokenized and that element is also used as a search parameter expression, declaring the search parameter as a part of the Tokenization Rule causes the search index to also be tokenized.
For example this means that if you have chosen to tokenize the Patient.name.family
element (which is used to support the family
Search Parameter) the tokenized string will be indexed instead of the original value. Suppose the configured tokenization algorithm tokenizes the value "Smith" with the token "ABCDEFG". When performing a search using this parameter, the value being searched for will also be tokenized in order to ensure that values can still be found.
To make this work, Smile CDR automatically creates internal SearchParameter resources with the same name as the original SearchParameter but with the suffix -tokenized
. Therefore, if a FHIR client performs a search for Patient?family=smith
, the search term will be automatically tokenized and the search will be treated as Patient?family-tokenized=ABCDEFG
.
If you need to support searching on a tokenized value, you may need to declare a normalization rule in order for the search to behave in the way a client would expect. Several normalization modes are available:
Patient.name.family
.Patient.identifier
.Observation.code
.Note the following limitations on searches which use tokenized values:
-tokenized
. For example, if you are tokenizing the Patient.identifier
element, an internal search parameter called identifier-tokenized
will be created by the system in order to support searches by identifier. These internal parameters may not be used directly by FHIR clients.Identifier.system
) or a value (Identifier.value
).When using either Repository Validation or Endpoint Validation with this feature, resources are validated prior to tokenization.
This means that tokenization will not cause validation failures if a mandatory data element is then removed and tokenized. However, if a non-reversible tokenization algorithm is chosen it could mean that the resource will no longer meet the same requirements when it is returned.
The FHIR PATCH operation is not currently supported when Tokenization is enabled.
If it is necessary to change your tokenization rules, you may make changes to your rules file and restart your FHIR storage module at any time. However, this may cause issues if you have existing data stored in your repository which was tokenized under the previous rules.
Smile CDR provides the Update Tokenization operation ($sdh.update-tokenization
) that can be used to apply the new rules to any existing data in the repository.
When a new tokenization rule is added, any resources that are added or modified will be tokenized using the new rules. Any existing data will still be available and will be returned with the original non-tokenized values. New data will be detokenized on access and will be returned with the original non-tokenized values as well (assuming the tokenization algorithm is reversible and the rule is configured to actually detokenize).
If the new rule includes a search parameter, searches using this parameter will only find resources added or modified after the rule was added.
The Update Tokenization operation can be used to apply the new rules to existing data, after which all data will be tokenized, and search parameters will be able to find all applicable resources.
If a tokenization rule is no longer needed, it should generally not be immediately removed if any data is already stored with tokenized data. Once a rule is removed, existing tokenized values will be returned without being de-tokenized.
In order to remove a rule, you should use the following steps:
ACTIVE
to QUIESCE
, and restart the FHIR Storage module. This means that the rule will no longer be used for tokenization, but existing tokenized data will still be detokenized.DISABLED
status. Old rules with a status of QUIESCE
do not strictly need to be removed or disabled, but they do add a tiny amount of processing for relevant resource reads so it is good to remove them eventually.The Update Tokenization operation can be used to apply the new rules to any existing data in the repository. This operation is invoked by executing an HTTP POST request to the $sdh.update-tokenization
operation.
This operation follows the FHIR Asynchronous Interaction Request Pattern. This means that the operation will return immediately, and will return a response containing a URL to poll for the status of the operation. It can also be monitored using the Batch Job Endpoint or through the Web Admin Console.
The Update Tokenization operation accepts a Parameters resource as input, with the following parameters:
url
– A list of resource search urls to be examined and updated according to the current tokenization rules. These URLs are FHIR search URLs and follow the same format as Reindex URLs.The following example shows an HTTP POST request to the $sdh.update-tokenization
operation:
POST /$sdh.update-tokenization
Prefer: respond-async
Content-Type: application/fhir+json
{
"resourceType": "Parameters",
"parameter": [
{
"name": "url",
"valueString": "Patient?"
}
]
}
The response will contain a URL to poll for the status of the operation:
202 Accepted
Content-Location: http://example.org/$sdh.update-tokenization-status?_jobId=MY-INSTANCE-ID
Content-Type: application/fhir+json;charset=utf-8
{
"resourceType":"OperationOutcome",
"issue":[
{
"severity":"information",
"code":"informational",
"diagnostics":"$sdh.update-tokenization job has been accepted"
}
]
}