Weather from WeatherOps

Our Recent Migration to the Cloud Leads to Several Questions

Written by Daphne Thompson | Feb 23, 2017 3:58:12 PM

WDT recently migrated its Weather Mass Notification System (WMNS) from a traditional datacenter to the Amazon Web Services (AWS) cloud. One of the largest hardships in this migration was managing nearly 300GB of databases. Since this was our first ‘live’ migration to AWS, several questions arose in the process.

How do we break apart one database from a server of many databases, and keep them in sync?

In the traditional datacenter configuration, one physical database server might be home to multiple databases for multiple projects. However, in the microservice paradigm of the cloud, it is more common to split each project up, dedicating resources only for one project. This required us to split the WMNS databases from two physical servers to three managed instances.

We investigated using Amazon’s Database Migration Service (DMS) but found it to be far too slow to handle the throughput the WMNS databases see, even on a ‘slow’ day. Instead, we settled on creating an Amazon Elastic Compute (EC2) MySQL intermediate slave to replicate only the desired databases from each physical server before replicating the intermediate slave to Amazon’s Relational Database Service (RDS). In this system, any write actions to the datacenter server would be replicated to RDS by way of the EC2 intermediate. If replication lag to RDS was less than a few minutes, we considered this to be sufficient to minimize downtime while we switched from the datacenter to the AWS System.

To mimic the source database server, we used an i2.2xl EC2 instance to run MySQL for the intermediate. This was the smallest instance size we found that could reliably keep a live replication running.

What kind of instance size can keep up with the traffic?

We chose to use Amazon RDS to keep management of our databases simple. Preliminary use of RDS MySQL appeared to be sufficient against our Staging environment, but we learned it was not fast enough against our Production dataset. Replicating to RDS MySQL was unable to keep up, even using provisioned Input/Output Operations Per Second (IOPS) and large instance types – both of which comes at a high cost. Faced with the risk of not being able to use RDS at all, we tried using Aurora, Amazon’s alternative to proprietary high-speed databases.

The initial replication to Aurora was promising. The transfer of over 300GB of data completed in under two days and could keep in sync with end-to-end replication lag less than one second. To size the Aurora instance properly, we tried to mimic our datacenter’s server capacity and settled on using R3.2XL instances (one writer, one replica). Once the transfer (from datacenter to EC2 intermediate to Aurora cluster) was complete, and all databases were up to date, we could identify that the WMNS databases saw nearly 2000 select queries per second in total during the winter.

 

How will RDS manage during severe weather?

The WMNS cutover to AWS was made in late January when weather patterns were calm, and traffic was relatively slow. We noticed, however, the database engine would occasionally crash, bringing the WMNS system down for a minute while Aurora failed over to a new instance. This was not the test of automatic failover we had in mind. Thankfully, our database administrators and AWS Support identified the queries responsible for the crash and patched the database instances to prevent the crash from happening again.

During frontal passage through Oklahoma City on the night of February 19, we received our first good test of database throughput.

While this line of storms rolled through the Oklahoma City metro, our databases were experiencing nearly 4000 selects per second. During this time, our writer instance (blue line) experienced average CPU load, but the reader instance (purple line) experienced nearly 100% CPU utilization, showing that we should have added an extra read replica instance to help balance the load.

Once the lightning storms left the OKC area around 2:00 AM, the database CPU load began to settle back down to more normal levels. During this time, the WMNS delivered over 1.5 million push messages, 6,000 emails, and 5,000 text messages. WDT is always striving to do the best job we can for you. Have questions on how we can help you? Contact us today for more information.