The need for a Distributed Cache
As our products have evolved from monolithic services to microservices, sharing data between bounded contexts has gotten harder. With a monolith, all code could access the same centralized data stores. We’ve broken those monoliths into independently developed and maintained microservices with their own data in independent datastores which can’t be directly accessed by other services. To have read-only access to data from remote services, we’ve seen the advent of Event Driven Architectures but that also has limitations and has led to inconsistent implementations across an organization. A solution to this problem is the implementation of a Distributed Cache.
Options which haven’t solved the problem for us
We’ve worked through numerous ways to get read-only data from one service to another (the origin service is still the source of truth for the data). For these following descriptions, we will reference two services:
- the core service which is the owner and source of data
- the edge service which is interested in the data in the core service
A common use-case is that the core service contains data about users and the edge service needs to have that user data for making decisions or labeling assets.
The edge service performs REST based fetches for data from the core service
With this architecture, the core service exposes internal endpoints for fetching data by the edge service. There are several challenges here:
- More API endpoints need to be developed, secured and maintained: It’s likely that additional endpoints for the core service will need to be built. If there are no existing REST endpoints, that additional work also needs to be done to create them. Once these endpoints are developed, ongoing maintenance must be performed. If there are other customer-facing endpoints, it’s likely that different authentication and authorization may be needed for the internal calling services.
- Polling: If the edge service calls the core service for every data need, that can cause considerable load and scaling depends on the core deployment. If the edge service performs polls and then caches the results, we need to fine-tune the balance between overwhelming the core service with queries vs the freshness of data.
- Dependency chaining, coupling, and latency: When one service needs to make queries to another, any issues with the target service can cause us issues. All services need to be available at all times and this problem can compound if there are multiple services linked in series. With any remote calls, the calling service is now latency delayed by the network transit time plus the query response time. As we chain more services, those latiences can compound.
- Locality: Customer facing endpoints are designed for the customer consumption needs, not the needs of other internal services. When the edge services use the existing REST endpoints, they must conform to the shape of the objects returned by the core service and then manipulate them to their local needs.
The core service publishes all events to a Kafka compacted topic which the edge service listens to (EDA)
With this architecture, the core service publishes all data changes to an event platform like Kafka with compacted topics. This does require that all data must be back-loaded into the topic to have a complete picture of all the data for the core service. The edge service listens to the topics and builds its own view of the data (in memory or on disk).
- How do we know when we’ve received all the events on the topic: When a new edge service boots up, how do we know that we’ve received all the events on the topic or if there is just a delay in consuming the events? We’ve explored sending time-stamped nonce events and ensuring that the edge service has consumed up to that nonce event, but that got far too complicated
- Too much to cache: If the edge service caches the data in-memory, it could quickly overwhelm the container’s memory allocation.
The core service publishes change events to a pub-sub bus which the edge service listens to (EDA) and then performs fetches to the core service to get the full data element.
This architecture is similar to the first except that the edge service receives events that some data element has changed and performs a REST fetch query (like the first architecture). This has the combined drawbacks of both the first and second architecture but eases the burden on the edge service of loading the whole topic at boot time. It is most ideal when we keep a persistent view of the data, like in a database.
The core service publishes data to a central cache and the edge service performs fetches for every object if needs in real-time
With this last architecture, all data is pushed to a central KV store like ElastiCache by the core service and the edge service requests the needed data when it’s needed for operations.
- The central store becomes a single point of failure (SPF): If there are ever delays or outages of the central store, all edge consumers instantly stop functioning. If a large number of containers depend on one datasource, any issues there could have major impact.
With all of these architectures, we run into several general problems:
- Most of these solutions require fairly complex code: With architectures where it’s required to add more REST endpoints to the core service, that becomes more code to support and maintain. Dealing with polling and service chaining requires complicated circuit-breaker implementation for dealing with various failure scenarios. Developing the logic to detect the end-of-log for Kafka is very complex and requires synchronization between the core and edge services.
- Containers don’t store data locally: With many of these architectures, the edge containers don’t cache data. This means that any outage of the upstream systems causes immediate degradation.
- Inconsistency: Because of the pros and cons of all the architectures, we’ve seen multiple teams implement solutions which are tailored for their specific need. This becomes a maintenance problem when the core and edge services are owned by different teams. It also becomes an issue when multiple edge services, likely owned by different teams, all need to get the same data from the core service.
The Distributed Cache (distCache) Solution
With distCache, our goal is for edge containers to build up an in-memory cache of all data needed from the core service and for that cache to be automatically updated any time the core service changes the data. Data needed by the edge containers is stored in a central cache but because of edge container caching, if the central cache is unavailable, the containers can continue to function. We’ve built in the flexibility to create new projections of the core data in the central cache depending on the consumption needs (generally built around building a unique KV set).
We’ve written a distCache library to abstract all set and get operations with built-in caching behind the scenes:
await dcPub.set('thisKey1', 'this value 1');
When the edge service needs data for a key from the core service, the distCache library first checks it’s local in-memory cache for that key and then otherwise fetches it from a central ElastiCache (EC) Redis KV store. These KV sets stored in EC are built based on the lookup needs of the edge containers. The distCache library does periodic scanning of it’s in-memory cache and evicts the oldest keys which violate any of the TTL settings (cacheMaxTTL, cacheMaxBytes, cacheMaxItems); this allows the edge service to specify what is appropriate to keep cached. When we configure the library’s connection to EC, we intend for it to target the read-replicas to decrease downtime failure risks.
Data is written into EC by cache producer services. These services consume from compacted Kafka topics specific to the model in the core service. The Kafka topics will generally have all columns from a table but only certain columns will be interesting to specific edge consumer containers. This allows us to generate KV namespaces that match the query needs of the edge service. When we have a new edge consumer data need, we can create a new cache producer to generate a new set of data in the KV. Because all data for the full model has been stored in the Kafka compacted topic, the new cache producer can read from the whole topic and publish all the data to the EC without the core service needing to retransmit all the data. This helps to decouple the edge consumption needs from the core’s production of data.
From the core service’s perspective, it just needs to write all change events to the Kafka compacted topic (including one initial backload the first time it begins publishing). The Kafka topics are set up with only a single partition so we can guarantee FIFO of all events. This does limit us to only a single consumer per CG but the volume of these topics is generally very low. If we did spread the data over multiple partitions, we may run into problems if we did need it to be perfectly FIFO. Publishing to Kafka topics has the added benefit of providing a place to receive events for any service which only has interest in the event notification stream from the core service’s data.
Availability and Scalability and Performance
Some important requirements from this implementation are to provide high-quality availability and scalability for our platform:
- Availability: Because almost all of our containers will rely on this infrastructure for multiple data elements, we need to ensure we can handle multiple types of failure and maintenance without disruption:
- If core producers are down, new data is simply not written to Kafka until the containers recover
- If Kafka is down, the cache producer containers will remain idle until they recover and then write any changes from the Kafka topic to EC.
- If EC is down, the edge containers will continue to function either from the EC read-replica or from their own local cache. Upon EC recovery, change notifications will be sent to the edge containers using Redis pubsub.
- If necessary, we can shard or replicate our KV data onto multiple EC clusters
- Scalability: Again because of the number of containers we expect to be connected to this infrastructure, we need to ensure it can handle the high volume:
- Containers cache data locally in-memory and only perform fetches against EC when cache data is missing or TTLs out.
- EC can scale with up to 5 read-replicas on sufficiently large EC instances. Due to the speed of Redis, it would take a considerable number of containers in a thundering-herd, all trying to pre-cache all data, to overwhelm properly scaled EC clusters.
- Kafka has been proven to scale very wide but with the expected data volume, that shouldn’t at all be a need.
- Performance: Because the data is cached locally in memory of the operating container, the fetch time is in the nano/microsecond latency range. When making external fetches, we are in the millisecond to 100s of millisecond range. With high volume and multiple potential lookups, these latencies can quickly add up.
Redox has been looking at how to solve this problem for quite some time and we’re beginning to replace our existing blend of solutions with the distCache library. We have several models in our core monolith which are needed by all the microservices which surround it. All events from these models are being published to the compacted topics and we’ve begun building the cache producers and moving the consumers over to using the library. We are using a shared MSK cluster but a dedicated EC cluster for the data persistence.