Character encoding is one of the most arcane subjects a technologist will ever encounter, yet it underpins literally every piece of text we see on our computers. This article will focus on HL7v2 and the Redox API, and how they treat different encodings.
First things first—you should have some background on fundamental encoding before reading this article. I recommend this article by Kunstube as a comprehensive introduction. If you feel comfortable with what makes ASCII different from UTF-8, then continue reading.
Redox API always uses UTF-8
If you read the introductory article it should be clear why UTF-8 has taken over as the de-facto standard for the web—it’s space-efficient and can represent the entire unicode character space.
Similarly, when you receive something from us, it will be encoded using UTF-8.
If you send 我爱你❤️ into the API, it will show up that way in the Redox dashboard. If you try to send it to an EHR, weird stuff can start to happen.
Many products do not handle Unicode well
Many EHR vendors were founded (and developed much of their codebase) before Unicode was conceived, much less widely supported. If you remember trying to get webpages to display in the late 90’s by switching the encoding, many pieces of software from that time support character sets. If you’ve ever heard of Windows 1252, that’s an extra 128 characters that you get from the extra bit on top of ASCII.
If you have a desperate need to put unicode into an EHR, make sure you check with your Redox install team first. In some cases, only certain fields (like names) will support unicode, and the way we actually update it can vary depending on how well they have read the HL7 specs.
HL7v2 support for Unicode is a whole science unto itself
Most of the details of how to do encoding other than ASCII in HL7v2 is in Chapter 2 of the HL7v2 spec.
The process goes something like this:
- The default is 7-bit ASCII
- Delimiters must be in 7-bit ASCII (including carriage returns)
- If MSH-18 is populated the first repetition denotes the default encoding for the message
- If the first repetition is blank, the default remains 7-bit ASCII
- Additional repetitions indicate other coding schemes that may be used in the message
So doing UTF-8 is as simple as putting “UNICODE UTF-8” in MSH-18, right?
Not quite—as is the case with most HL7v2 implementations, this is generally an oversight on the part of most implementers.
To complicate matters, those multiple different encodings—and EHRs who can vary encoding by field—are accommodated by using escape sequences. So you can mix Unicode, IR87, and more in one field and the parser is supposed to be able to process it.
How Redox handles this
Our HL7 parser is a pretty nifty piece of tech. We parse all messages out into JSON. Since we can do this step without having to worry about the encoding, we push the actual processing of each field down the pipeline if needed, and apply individual translations per field as described above.
In practice, we haven’t run many situations where content other than 7-bit ASCII is sent. In some rare cases though, a handful of tiny symbols can wreak havoc. The degree symbol ° and exponent symbols like ² and ³ have a nasty habit in showing up in units. The degree symbol, for example, is represented as B0
(176) in windows-1252, and F8
(248) in Code Page 437
Interestingly enough, the Unicode code point for ° is 2 bytes (00B0
). In UTF-8, that’s C2B0
, so if you interpreted the message as windows-1252, you’d get À°, └░ in 437, and the correct ° symbol in UTF-8. Conversely, if the message was in windows-1252, you would most likely get some kind of error because B0
by itself is not a valid UTF-8 sequence. Yikes.
Advice for designing applications
At this point, you can start to see how if we know what symbol was supposed to be sent, we can work our way backwards using a programmer’s calculator.
Working backwards is a lot of work, though, so if you’re designing an application that integrates with Redox, keep these things in mind.
- Be able to send/receive UTF-8 when talking to Redox API
- Make sure the guts of your application (database, external services, etc.) can handle that UTF-8
- Use your eyeballs. As I mentioned above, if someone sending HL7 is not following the rules, bad encodings can be impossible to spot.
We’re keeping our eyeballs peeled too. Good luck and 慢走.