Distributed cache for highly available local representations of remote bounded context data

October 13, 2021
Brandon Palmer Principal Software Engineer, Architect

The need for a Distributed Cache

As our products have evolved from monolithic services to microservices, sharing data between bounded contexts has gotten harder. With a monolith, all code could access the same centralized data stores. We’ve broken those monoliths into independently developed and maintained microservices with their own data in independent datastores which can’t be directly accessed by other services. To have read-only access to data from remote services, we’ve seen the advent of Event Driven Architectures but that also has limitations and has led to inconsistent implementations across an organization. A solution to this problem is the implementation of a Distributed Cache.

Options which haven’t solved the problem for us

We’ve worked through numerous ways to get read-only data from one service to another (the origin service is still the source of truth for the data). For these following descriptions, we will reference two services:

A common use-case is that the core service contains data about users and the edge service needs to have that user data for making decisions or labeling assets.

The edge service performs REST based fetches for data from the core service

With this architecture, the core service exposes internal endpoints for fetching data by the edge service. There are several challenges here: 

The core service publishes all events to a Kafka compacted topic which the edge service listens to (EDA)

With this architecture, the core service publishes all data changes to an event platform like Kafka with compacted topics. This does require that all data must be back-loaded into the topic to have a complete picture of all the data for the core service. The edge service listens to the topics and builds its own view of the data (in memory or on disk).

The core service publishes change events to a pub-sub bus which the edge service listens to (EDA) and then performs fetches to the core service to get the full data element.

This architecture is similar to the first except that the edge service receives events that some data element has changed and performs a REST fetch query (like the first architecture). This has the combined drawbacks of both the first and second architecture but eases the burden on the edge service of loading the whole topic at boot time. It is most ideal when we keep a persistent view of the data, like in a database.

The core service publishes data to a central cache and the edge service performs fetches for every object if needs in real-time

With this last architecture, all data is pushed to a central KV store like ElastiCache by the core service and the edge service requests the needed data when it’s needed for operations.

General Challenges

With all of these architectures, we run into several general problems:

The Distributed Cache (distCache) Solution

With distCache, our goal is for edge containers to build up an in-memory cache of all data needed from the core service and for that cache to be automatically updated any time the core service changes the data. Data needed by the edge containers is stored in a central cache but because of edge container caching, if the central cache is unavailable, the containers can continue to function. We’ve built in the flexibility to create new projections of the core data in the central cache depending on the consumption needs (generally built around building a unique KV set).

We’ve written a distCache library to abstract all set and get operations with built-in caching behind the scenes:

set:

await dcPub.set('thisKey1', 'this value 1');

get:

console.log(await dcClient.get('thisKey1');

When the edge service needs data for a key from the core service, the distCache library first checks it’s local in-memory cache for that key and then otherwise fetches it from a central ElastiCache (EC) Redis KV store. These KV sets stored in EC are built based on the lookup needs of the edge containers. The distCache library does periodic scanning of it’s in-memory cache and evicts the oldest keys which violate any of the TTL settings (cacheMaxTTL, cacheMaxBytes, cacheMaxItems); this allows the edge service to specify what is appropriate to keep cached. When we configure the library’s connection to EC, we intend for it to target the read-replicas to decrease downtime failure risks.

Data is written into EC by cache producer services. These services consume from compacted Kafka topics specific to the model in the core service. The Kafka topics will generally have all columns from a table but only certain columns will be interesting to specific edge consumer containers. This allows us to generate KV namespaces that match the query needs of the edge service. When we have a new edge consumer data need, we can create a new cache producer to generate a new set of data in the KV. Because all data for the full model has been stored in the Kafka compacted topic, the new cache producer can read from the whole topic and publish all the data to the EC without the core service needing to retransmit all the data. This helps to decouple the edge consumption needs from the core’s production of data.

From the core service’s perspective, it just needs to write all change events to the Kafka compacted topic (including one initial backload the first time it begins publishing). The Kafka topics are set up with only a single partition so we can guarantee FIFO of all events. This does limit us to only a single consumer per CG but the volume of these topics is generally very low. If we did spread the data over multiple partitions, we may run into problems if we did need it to be perfectly FIFO. Publishing to Kafka topics has the added benefit of providing a place to receive events for any service which only has interest in the event notification stream from the core service’s data.

Availability and Scalability and Performance

Some important requirements from this implementation are to provide high-quality availability and scalability for our platform:

Live Use

Redox has been looking at how to solve this problem for quite some time and we’re beginning to replace our existing blend of solutions with the distCache library. We have several models in our core monolith which are needed by all the microservices which surround it. All events from these models are being published to the compacted topics and we’ve begun building the cache producers and moving the consumers over to using the library. We are using a shared MSK cluster but a dedicated EC cluster for the data persistence.


Join us!

Redox is constantly hiring and we have a ton of open positions all across the company. Come check our open positions and help fix healthcare.

Stay in the know! Subscribe to our newsletter.