Engineering

Redox and Amazon RDS: The Art of a Non-Event

Posted August 20, 2017
By Dharma Indurthy

At Redox, we use AWS for all our infrastructure needs, and logging into the AWS console often reminds me of walking into a toy store as a kid—there are so many cool things to play with. Unfortunately, not all services make sense for Redox. It’s unclear to me, for example, how we might leverage Alexa, although Redox has a lot of visionaries, so I can’t promise that we won’t use it one day.

Another limiting factor is HIPAA compliance. Not all Amazon services are compliant, so some toys are off limits for Redox’s core needs. For example, we’d love to integrate AWS Lambda, Amazon’s solution for serverless architecture, into more of our core services, but for compliance reasons, we cannot*. Fortunately, Amazon has worked to expand what services qualify as HIPAA compliant, so things are always changing, but in the meantime, some toys are tantalizingly off the table. (*Updated—as of 8/31, AWS Lambda is HIPAA Compliant).

On November 29, 2016, at ReInvent—Amazon’s annual AWS conference—Amazon announced Postgresql RDS as HIPAA compliant. RDS is Amazon’s managed database service, and until then, it had been out of reach for us (we use postgres as our store for persistent data). This announcement at ReInvent was one of the most exciting reveals for us, and we quickly started planning a migration.

It’s a recurring theme in the projects I’m part of that the way forward requires a dramatic change that hopefully no one will notice. At Redox, at least as far as infrastructure goes, we aim to make big events for us non-events for our customers, and that goal held true here: migrate our core database to a completely new system and do so in a way that no one would notice.


Why RDS?

Self-managing a database offers a lot of challenges, especially if you’re intent on building a highly-available service. Of course, there is an upfront cost in architecting and deploying a reliable database system. We had already succeeded in that endeavor—we were using a highly-available configuration of a master and slave that already met HIPAA requirements. We had built orchestration for automated failover. After all that, one might ask, why migrate?

Well, any self-managed service takes upkeep. There are series of tasks like OS patching, minor version upgrades, security monitoring, etc., that occupies time and effort. We also used a configuration in which our failover database also served as a read replica. Because this one appliance served multiple duties, some Redox services could be impacted by our failover/replica being impaired. Of course, we could make this better, but deploying a third replica instance and dividing the roles was a project. RDS has this problem solved out of the box.

We also favored Amazon’s failover solution. Orchestrating a reliable and effective failover is challenging, and getting it wrong could result in an extended interruption to consumers of your service or something worse like data corruption or a split brain. In architecting our own self-managed failover, we found it very challenging to succeed both at recovering fast and with fidelity, and often we pursued the latter at some cost to the former.

All told, managing our database cluster occupied space on our project list that cost us time, resources and the latitude to invest more in our core services.

Embarking

We began with research. AWS offers comprehensive documentation on how RDS works and how to architect it. However, their documentation on how to migrate was wanting. Naively, before I read much, I imagined we could use postgres native replication to build out the RDS cluster, and then during our next scheduled maintenance window, promote the cluster and point the application to it. We’d performed a handful of database migrations like this for our self-managed databases. However, this approach was notably absent from AWS documentation. It soon became apparent that postgresql RDS instances could not be used as replicas for a self-managed database, nor can you create self-managed replicas of RDS instances. This is due to security reasons for postgres specifically, whereas these options are available for MySQL.

Fortunately, AWS offers its Data Migration Service (DMS) which supports replication from a postgres source to a postgres target. Reading over the prerequisites and limitations, I decided that that DMS could work for us except for one big problem: we were using Postgresql 9.3 and the lowest version DMS supports as a source is 9.4. So this meant that we would need to upgrade our database first. This effectively lengthened our project to migrate from a one-quarter project to a two-quarter project due to our scheduled maintenance limitations.

Additionally, we found few resources that described in detail the maneuver we were attempting.  Some leveraged external tools like Bucardo to execute the change. Some were detailed but failed efforts. Clearly, this wasn’t going to be as easy as I thought.


Upgrade Postgresql

Upgrading postgres incurred its own set of challenges. Postgresql replication across different major releases is not possible, so we effectively had two non-trivial database migrations to perform: 1) to an upgraded Postgres self-managed cluster and 2) to Postgres RDS. Strictly speaking, we could have performed an upgrade in place instead of a migration. However, at the time, the Redox database had grown to several hundred gigabytes, and we were constrained by a maximum scheduled outage of 2 hours. Upgrading to another major release runs conversions on the postgres data filesystems, which in our case took several hours, so we pursued a migration plan.

So, how does one perform a database migration when there is no native replication to rely on?  It became pretty clear that we’d need to perform some kind of a base copy to an upgraded postgres instance/cluster and then copy new and updated data into the replica to get it up to date before we cut over.  Fortunately, postgres offers some tools to perform a dump of a database and a restore which can work across versions. You can even do this with no intermediate flat file by piping the output of the former to the latter. With some experimenting, we settled on using this:

 

pg_dump -h [SOURCE IP/DNS] -U [SOURCE DB USER] -Fc [SOURCE DB NAME] | 
sudo -u [TARGET DB USER] pg_restore -v -h [TARGET DB IP/DNS] --dbname=
[TARGET DB NAME]

 

We found it necessary to use the custom pg_dump format, i.e. the -Fc argument, to successfully copy across major releases.

So, we had a plan: perform the pg_dump – pg_restore to produce an upgraded database, and then during the downtime, 1) perform the incremental copy to get it up to date, and 2) repoint the application. We knew the incremental copy would be our most vulnerable piece since it was something we would build ourselves.

We chose a target version of 9.5, which was nearly the latest version at the time. In particular, this version enabled the ON CONFLICT clause for INSERTs, which essentially functions as an UPSERT action. This turned out to be crucial.

Our postgres database is composed of a set of mostly static data like organization settings and source and destination settings, and dynamic data like messages and transmissions.  The former accounts for the majority of our tables, but the latter accounts for the majority of the data on disk. It’s quite likely that data in your relational database would also conform to these general categories. For most of our data, both static and dynamic, we had reliable timestamp fields that indicated when the corresponding record was created or last updated. For most of our tables, especially our largest tables, one or both fields were indexed. This made it possible for us to programmatically copy incremental changes.

We knew that performing an incremental copy might result in copying new rows and updated rows. Deciding whether a row was new or updated was a non-trivial problem. The UPSERT function saved us. We could do something like this for a given table in bash:

upstatement="on conflict ($PRIMARYKEY) do update set "

for colind in "${!COLUMNS[@]}"; do

column="${COLUMNS[colind]}"

upstatement="$upstatement\"$column\"=excluded.\"$column\""

if [[ "$(($colind + 1))" -lt "${#COLUMNS[@]}" ]]; then

upstatement="$upstatement, "

fi

done

echo "insert into public.\"$table\" (select * from \"temp$table\") 
$upstatement" | tee /dev/tty | sudo -u postgres psql $TARGETCONNECTION

This effectively meant that if we encountered a collision in primary key during an INSERT (that is, we were inserting a primary key that already existed), we would apply the upstatement, which would would copy values from the insert record into the record we collided with. The tee allowed the UPSERT to log to STDIO along with piping to the target.


Testing

This was not a conceptually complex maneuver, but databases are inherently complex creatures and there are always gotchas. Also, we needed timings to be assured so that we could complete the maneuver within our allotted outage time. Fortunately, we have a test environment that is production-like both in terms of the infrastructure composition and inbound and outbound traffic. We use a test-harness that simulates our customer traffic with a dial we can use for scaling up or down (to be detailed in a forthcoming blog post). The non-prod database itself is composed of a similar quantity of data as production, so we had something realistic to work with. I largely credit the success of this whole endeavor to the faithfulness of our test environment.

The Gotchas

One gotcha we discovered was copying records did not update the next id for serial primary keys. So creating a new message or transmission on the upgraded database once we cut over to it would assign an id that collided with an existing message. We had to update tables with serial primary keys as a result using something like this in bash:

if [[ "$PRIMARYKEYTYPE" == "integer" ]]; then

echo "SELECT setval(pg_get_serial_sequence('$TABLE', '$PRIMARYKEY'), 
coalesce(max($PRIMARYKEY'),0) + 1, false) FROM public.\"$TABLE\"" | 
tee /dev/tty | sudo -u postgres psql $TARGETCONNECTION

fi

The above command applies to a given $TABLE that exists on the target database public schema.  For each table with an integer ID, we set the serial sequence number for the primary key to the maximum ID in the table plus 1. One assumption here is that a primary key in our database is serial if and only if it is of an integer type. This was true of our database at the time of our testing.

Another gotcha was that not every table had the updated field indexed, so the incremental copy took prohibitively long for our large tables. Fortunately, we could rely on the created field, but this increased the likelihood that the copy would encounter collisions. Not a real problem, since we leveraged the UPSERT. Once we were using the right fields, the copy succeeded within 20 minutes!

Finally, during testing, we hit errors when application database migrations had changed data structures on our primary database, like adding a column to an existing table. The database was a moving target, not just in terms of how much content we had but also in structure. Production activity of course added to it, and developers routinely added columns or tables. Freezing changes to database structures was considered, but a freeze would delay other projects, and making the freeze both narrow yet effective was challenging. Instead, we used a strategy to synchronize migrations between the current and upgraded cluster before performing any and every incremental copy. This succeeded in our testing and gave us confidence in performing the maneuver without restricting development.

Because this was so foundational a change, we formalized and ran through a series of regression tests to ensure the engine performed correctly after the upgrade maneuver. We also needed to validate the 9.5 database performance, which we determined to be equal to or better than 9.3.

We successfully executed this maneuver on April 8th, 2017 with minimal issues. A final gotcha was that parallel to this project, application changes had created new tables with fields that used the bigint data type for serial primary keys. I mentioned an assumption above that serial ids would only be integer, so we did not update the next id for these fields. A freeze would have avoided this issue. We discovered this only after the migration to production, but we quickly applied a database migration to perform the id update. Production operation of the engine was unaffected due to the quick response.


Migrating to RDS

Given we had an unconventional upgrade to perform, one might ask whether we should have used the same approach to migrate to RDS directly. We could have just created an 9.5 RDS cluster and applied the same pg_dump/pg_restore + custom incremental copy to that. We chose not to do this mainly because we wanted to take the time to experiment with RDS. The RDS failover mechanism was new to us, as was using its database recovery tools. We wanted to spend the time to get to know RDS before we cut over to it. Also, given that DMS was Amazon’s recommended migration tool, we tacitly relied on that guidance.

During the aforementioned research phase, we had decided that DMS could work for us. At the time, there was no cloudformation support, so this was purely a manual effort (this has now changed). We setup DMS according to the documentation and created a replica using our test environment primary database. We immediately discovered a issue: our JSON fields were getting truncated. Obviously, that was no good for us. We reached out to Amazon support to understand what was going on, and received this reply:

Thank you for the detailed description about the case, your understanding is correct on JSON as a data type in general is supported and JSONB is not supported. Unfortunately, any data which is bigger than 512 is getting truncated during the migration. It’s a known bug and I am already working with the DMS service team to find a workaround and fix for the issue and service team is working to get a workaround with high priority. Unfortunately, I don’t have an ETA at this point in time, on when the issue could be fixed.

Amazon is notorious for never providing timelines—kudos to you if you’ve ever experienced something different. We did not obtain an acceptable workaround for this issue.

Fortunately, our previous migration to an upgraded Postgres cluster provided us the tools to complete a maneuver to RDS ourselves. We ended up using the same methodology to perform a dump/restore and incremental copy to cutover. The timings for this operation were very similar to our results for the postgres upgrade—unsurprising since we sized our RDS cluster identically to our self-managed database cluster.

So, the majority of our time was spent getting used to RDS. We performed failovers by performing reboots, which worked out nicely. The application failover was almost seamless, with no evidence of data loss to a split brain. A restore of a production-sized database was completed within 30 minutes, which was a dramatic improvement over the capabilities of our self-managed database. We also operated for a few weeks in non-production to see if any instabilities or the maintenance window caused noticeable disruptions to our operation. We did not observe any interruptions during this time.

RDS did pose some changes on the way Redox operates—we lost some metrics we relied on, like load average, so we adapted our monitoring to use CPU.  We had stored daily database backups and logs in S3, but RDS only permitted 30 days of backups and a couple weeks of logs. In order to maintain our level of archiving, which was deemed a requirement for compliance reasons, we created a backup manager, a self-managed instance which would obtain a base backup from our RDS cluster replica on a daily basis and save that to S3, along with copying over any new postgres logs to S3. This way, we lost nothing in the transition except point-in-time restore to database images that were more than 30 days old.

We successfully cutover to RDS on July 8th, 2017. Along with some other maintenance we applied during the outage, we completed 15 minutes ahead of schedule and there were no notable issues. Over the following days, we did experience some transient DNS resolution failures when resolving the RDS endpoint. We rely on Amazon DNS, so the cause was opaque to us. This caused a few database disconnections, but the app recovered without interruption.

 

Takeaways

We succeeded in this project for three reasons:

  • We researched thoroughly
  • We had previously done the work to build a production like test-environment
  • We spent the requisite time to test

Given the foundational nature of a database, a migration project like this is one that should begin with anxiety but end in confidence. We were only able to obtain that confidence as a result of testing and research. So far, RDS has afforded us exemplary performance and incurred no interruption of Redox services.

 

And I’m proud to say that so far, no one can tell the difference.