Why interoperability is hard: The plurality of concerns

December 5, 2016
Dharma Indurthy Staff Cloud Engineer

At Redox, we have embarked on tackling a huge problem: interoperability. It’s not an atomic problem; it’s divisible into a number of more elementary, hard problems. We embarked on tackling interoperability because it’s challenging and so valuable to solve. Both aspects contribute to making Redox such an exciting place to work.

But what makes a problem hard?  It’s common knowledge that interoperability is hard, but less common is a deep understanding of what makes it so. I submit that much of what makes interoperability and it’s divisible parts so hard is the plurality of necessary concerns in the space we work in. Some of those concerns are:

Meeting all of those concerns for any given problem is very difficult. Additionally, these concerns are coupled, such that designing for one might contribute to or detract from another. There is often an additional human problem alongside a technical solution, so implicit in engineering something robust is becoming it’s advocate. To give you a window into what this is like for Redox, let’s consider a fairly elementary problem and deep dive into our effort to resolve that problem in the midst of some of these concerns.

Problem: Connecting a Health System to Redox

Being Functional

A function solution to this problem is not hard. Restricting our discussion to the HL7v2 realm, which is the standard for most of the healthcare integration going on right now, it turns out that the technology at health systems has evolved toward building interface engines that send data over Transmission Control Protocol (TCP). TCP is a fine protocol for sending information and guarantees fidelity in transmitting a payload from a source to a target, and it underlies most of the data transfer you might be familiar with, e.g. the back-and-forth between a web browser and a whatever web servers it is browsing. So, were our only concern being function, we would design something like so:

In the above diagram, we (Redox) expose an endpoint to our partner to send to, that is an IP address and a port, and the Health System would target that with its interface engine. We would distribute that traffic across the instances that host our application. If we suppose that the app is scalable across hosts (and RedoxEngine is), then the number of application servers would be variable and the load balancer would need to be built to auto-detect changes and reconfigure. If Redox needs to send to a health system, i.e. if an application sent us a payload intended for Medical Center, RedoxEngine would send back to the health system interface engine on whatever IP and port they expose for their inbound interface.

This would be so easy. This scheme is so simple that it is incidentally pretty usable and maintainable too. But the most obvious issue is that it is not secure. There is no native encryption around TCP traffic, so anyone that intercepted the traffic would obtain a treasure trove of PHI. Not good.

Being Functional and Secure

To add a layer of security, Redox sets up virtual private networks (VPNs) with health systems. This allows us to share data privately, almost as if Redox was installed on premise with the health system. The virtual in VPN means that privacy is established based on encryption, so despite the data traveling over the internet, only networks with the right keys can decrypt it. So now, we end up with a configuration like this:

Each side of the connection, i.e. the Medical Center and Redox, now have an appliance called a VPN gateway. Each side uses a tool to define local network IPs that are exposed, remote network IPs that are allowed, and the keys used to encrypt and decrypt the data. This constitutes a network tunnel. The Medical Center sets network rules to route traffic intended for Redox to their VPN gateway. Likewise, in order to send back to a health system from Redox, network rules in Redox must route traffic intended for Medical Center back to the our VPN gateway. The gateways then relay this to the remote network over the encrypted tunnel.

Unfortunately, this adds a lot of complexity and detracts from usability. At least on the Redox side, we need to scale to hundreds or thousands of VPNs which is a lot of configurations to maintain. We also use firewall tools to translate IPs to avoid conflicts if both the health system and Redox use the same IP ranges. We use those same tools to ensure that each health system sends only to its allowed ports. Building these out in an organized way while ensuring no health system configurations conflicts with any other is challenging. The health system must maintain it’s own apparatus for their side of the VPN. Also, intrinsic to adding any new necessary appliance is exposing yourself to another point of failure.

Being Functional, Secure, and Available

In order to be secure and available, we need redundancy. When you operate in the cloud, you cannot count on any un-managed appliance. Because we are constrained to HIPAA compliant tools, we self-manage a lot of our infrastructure, and are thus obligated to manage infrastructure failure. Below is an example of what we need to do obtain high availability.

In this case, we have a secondary load balancer and a secondary VPN gateway. This provides us a ready-to-go backup in case our primary instance fails. Not shown, however, is the business logic that would drive this. We need to account for different failure modes and implement logic to quickly detect and resolve failure.

Were we to lose our primary load balancer, we would need to start directing traffic to the secondary. That means that we need to

  1. Detect failure of the primary load balancer
  2. Reconfigure the Redox VPN to route inbound traffic to the secondary load balancer

Presumably, the secondary load balancer has kept pace with the primary in terms of having an up-to-date configuration, so that is all we would need to do.

A more complicated scenario is if we lost our VPN primary.  In that case, we would need to do the following:

  1. Detect failure of the primary VPN gateway
  2. Reassign the public IP of the VPN gateway to the secondary (since the gateways broker the tunnel over the internet, their respective public IPs are what’s used to begin to establish the tunnel.  AWS allows provisioning a public IP (EIP) that can be reassigned to other appliances).
  3. Obtain an up-to-date list of configurations (ideally we update configuration with one source of truth.  For Redox, this is stored in Amazon’s Simple Storage Service).
  4. Update all configurations and firewall settings to reassign the internal IP of the old primary to the internal IP of the new primary.
  5. Update the network routing rules to send traffic outbound from Redox to the Medical Center through the VPN.

This turns out to be a doable if complicated maneuver. Upon completion, which occurs within seconds of detecting failure, all VPN tunnels served by the failed VPN gateway are restored. Implicit in making maneuvers like this successful is having robust monitoring and alerting solutions, accounting for failure modes within the failover process, communicating effectively with affected parties, etc.

Our Plurality of Concerns

This example has yet to address the concerns of scalability and performance. To that end, we had to build a separate thorough apparatus to test load across our VPN appliances so we have an idea of what a gateway could handle. Also, as our solution becomes more sophisticated, we suffer costs in complexity. What was once usable and easy to maintain now has multiple layers of configuration per health system, along with rigorous monitoring and alerting requirements. So we have to go back and make sure to organize our tools to be as maintainable as possible. And finally, we have to be an advocate of our solution, exposing it to scrutiny both internally and to the organizations we work with.

This is the work we do at Redox. The good news is that we are thinking about the plurality of our concerns, and we do that as upfront in the design as possible. Meeting all of these expectations is both daunting and thrilling. It’s the nature of our space, and it’s why we do what we do.