Save Time and Money on AWS by Running Performance Tests

By
Brian Tarbox
January 27, 2023

I recently asked a fellow AWS Hero why anyone would use EC2 instances when there were so many alternatives such as Lambda and containers. He told me that he usually assumes going the EC2 route is cheaper. I’ve been increasingly suspicious that far too many people take the “default” (dare I say legacy?) path when building up their infrastructure thereby sending themselves down a more expensive (and often architecturally inferior) path.

Alternatives aside, the EC2 configurations I'm finding in the field are frequently the “default” m2.2xlarge instances. As of this writing, EC2 supports over 400 different instance types, with vCPU counts ranging from one to 256, disk size ranging from 75 to 30,000 Gib, memory ranging from 0.5 to 1536 GiB and prices ranging from $0.002 to $32.7 per hour. The "default" EC2 instance is often the expendiant path to get something up and running but there is probably a better option for your wallet. You will probably be wasting time and money until you find a more optimized configuration for your workload.

The type of workload might be the biggest influence to cost-effectiveness of one design over another. Event-driven, stateless compute jobs - like those found in data pipelines - are often good candidates for Lambdas. Self-hosted relational databases are long-lived and stateful. They should probably be hosted on several EC2 instances managed as pets (versus cattle). In between these extremes are many other workload profiles that can be served by services like ECS/EKS (with or without Fargate), Batch, Autoscaling Groups and Step Functions.

Prompt: A 1945 photo of a developer who is stressed out because they don’t know which AWS service to use.

Which AWS Service for my Workload?

One key to selecting which AWS service is understanding data independence of your workload. In a traditional database all the data is interrelated in the sense that a query can depend on the entire set of previous transactions on the data. Processing an IOT sensor value, by contrast, might not depend on any other data. You would not run a database on a spot instance, but you might use spots (or Lambdas) for IOT fleet processing.

Once the model of computation is understood, it’s helpful to understand the limiting factors. The Phoenix Project talks about limiting factors in organizations. It makes the point that optimizing anything other than the limiting factor is a waste of time. If you speed up any process other than the slowest process in the system, the speed of the total system will not improve.

Fortunately, evaluating the bottleneck between CPU, memory or I/O is much simpler than navigating human dynamics. If you are supporting a Java program that requires 128 megabytes of memory and your instance type has 10 gigabytes of memory you are wasting resources and money. If that same program never uses more than 500 megabytes of disk and it has a 100 gigabyte EBS disk you are also wasting money. The importance of this is highlighted by the items in a Trusted Advisor report. Many of the recommendations usually involve under-utilization of resources (it’s worth noting that memory utilization is not one of the items covered by Trusted Advisory; for that you need to add custom metrics to your instance/application).

Here is my hypothesis as to why people often default to choosing EC2 instances when another solution would be a better fit for their application: most teams have no idea what the limiting factor of their computation is.

To find this out, you might first look to analyze telemetry from your production instances. But, production traffic patterns might not allow you to isolate the limiting factor. Also, it's probably very complicated and risky to test different AWS services by switching them in and out of production.

This can be where performance testing can help. It offers the flexibility to experiment with different instances, compute models and workload patterns - and it might not be as hard as you think to setup. Performance testing can be a useful tool to find opportunities to save time and money.

Prompt: a black and white photo of a developer who is very stressed out about choosing which aws service to use.

Performance Testing Using Spot Fleets

Creating and executing a performance test suite is not a small undertaking but there are ways to make it surprisingly manageable. An often overlooked approach is to use Spot Fleets. Spot Fleets were designed to help avoid the issue of running a large number of Spot instances on a particular instance type and then losing them all if the auction price went up. Each spot instance type is a completely independent auction so spikes in the price of m2.2xlarge has no effect on the price of m2.3xlarge or c2.2xlarge. So, a Spot Fleet specifies a number of different instance types that can support an application.

Credit: What is the Difference Between Spot Fleet vs Spot Instances

You can specify the instance types to use by listing actual instance types or by specifying the attributes your instances should have. However, since this experiment is based on the premise that you want to discover the levels of CPU, memory, disk, etc you should not use attribute specification. You should, however, select a diverse set of instance types: with fairly wide ranges of CPU, disk and memory options in order to see how your application behaves.

The cool thing about Spot Fleets is that you can place a high dollar bid in order to increase the chances of a longer reservation. Recall that with Spots you do not pay the bid price you pay the spot price. So, for example if you bid $0.95/hour on an instance but the actual auction price for the instance is $0.15/hour you only the $0.15/hour. With a high bid value and diverse Spot Fleet you can create a high probability that your application will not be interrupted.

Run the application for a period of time and analyze the performance of the application on the various instance types. At this point you should have the crucial information: what is your application’s limiting factor? This will let you pick an appropriate EC2 instance type that meets your needs at the minimum price. You can also use the same technique to appropriately size your containers if you choose to use ECS or EKS.

For Lambda workloads, there is the Lambda Power Tuning tool. This tool is a Step Function that runs a Lambda with various memory settings to determine the relationship with resources and performance.

Lambda prices are quoted in Gigabyte Seconds, which means cost = SecondsRunning * MemoryAllocated (cost is actually calculated in 100ms increments). You control both vCPU and Memory by selecting a memory size for your lambda. As you increase the amount of memory the vCPU increases as show in this table (from this Luc van Donkersgoed article).

Increasing the memory/cpu allocated to a function will reduce its execution time, up to the limiting factor.

Notes on Pricing

EC2 pricing depends on instance type, reservation type (on demand, reserved, spot) and is billed in one second increments. The following table shows some comparisons but can actually be misleading. The assumption of EC2 and containers is that they are generally running all the time (subject to autoscaling). So, in 24 hours they will likely be running (and billed for) 24 hours. Lambdas on the other hand only run as needed. So, in 24 hours a Lambda might only run for only a few hours or minutes. The Lambda that drives my Alexa skill (which gets about one million invocations a year) actually only ran for a total of 45 minutes in the last three months.

| | Per Hour 1vCPU/1Gig mem | 4vCPU/8Gig mem || ----------------------- | ----------------------- | -------------------- || Lambda | $0.06000 | $0.48 || Fagate | $0.04493 | $0.20 || Fargate Spot | $0.01348 | $0.06 || EC2 | $0.01160 | $0.10 || EC2 Spot | varies | varies |

Fargate pricing depends on the requested vCPU, OS, architecture and storage, each of which can be independently controlled. Prices are calculated per second with a one minute minimum.

Conclusion

If you have a significant AWS footprint, some of your components may have been launched on EC2 instances by default. It might be a good time to understanding your application's actual resource needs will help you make better decisions about the architecture. Running your applications through a suite of performance tests might just make the task of finding the right AWS service less stressful and more joyful.

Prompt: extremely happy software developers playing with unicorns and kittens

Generated images by Midjourney

Recommended Posts