One benefit of using Amazon RDS is that you gain access to a large array of performance statistics provided by the Amazon CloudWatch service without having to do any extra work. We were eager to start viewing these statistics to see how the new platform was doing.
Checking out our stats
The CloudWatch command line tools are a breeze to download and install locally–you don’t have to run them on your EC2 instances. In just a few minutes we were able to view the minimum, maximum and average CPU utilization, available storage space, and number of database connections for any period over any time frame since the launch of our RDS database instance.
By running the following command, we immediately we noticed that our CPU utilization was maxing out in the mid-to-high 90’s every hour:
The numbers seemed too high. Something was amiss.
Tracking down the cause
We knew we had several options for investigating the problem, but since we were new to RDS we decided to use this as an opportunity to improve our familiarity with some of the core features of the system. We developed the following plan of attack:
- Launch a clone instance of our production RDS database
- Turn on slow query tracking for the clone database
- Launch a new, standalone EC2 instance running the same code as our website, but pointed at our clone database
- Exercise the standalone instance and then check the MySQL slow query logs to see if we had any bad queries
This is why we love AWS: with a few simple steps we can produce a brand new environment that looks and acts just like our production environment at all levels, with the same data. If a query is only taking a long time when there is a lot of data, we’ll find it. And we won’t be tampering with our real site in the process–it will happily hum away, utterly oblivious to our performance experiments.
Attack of the clones
Launching the clone database instance was effortless. As with our initial migration to RDS, we decided to use the ruby client:
Turning on slow query tracking took a little more effort. This required creating a new DB parameter group for our clone database instance. Our production instance uses the default parameter group, so we didn’t need to copy over any custom settings. We just set up our new group to flag any queries executing for more than 1 second on any table with more than 100 rows:
With slow query tracking turned on, we launched and configured our EC2 instance and manually walked through a few user workflows to exercise the most commonly used parts of our site. Then we checked the slow query log on our clone database instance:
select * from mysql.slow_log;
Bingo. The result set included one entry.
We fired up a mysql command line client and ran EXPLAIN on the bad query, which showed that we were doing a full table scan on a table with several hundred thousand rows.
After adding the appropriate index, we kept a close eye on our maximum hourly CPU utilization in our production environment. It immediately dropped to the low 80’s after the fix and has been holding steady at or below that level since.
Catching index problems in production is definitely not our first choice. We’re reviewing our approach for identifying needed indexes at development time and how we can improve on it. If you have any suggestions, let us know!
—Written by Jeff