During one of our first large scale implementations we observed a non-obvious networking related issue, involving ECS, ECR, and NAT Gateway that led to an unnecessary spike in cost. I’ll walk through how we discovered the issue and how you can ensure you don’t make the same mistake!
Initial setup and an unexpected bill
To get started, we setup what has now become our standard container architecture in AWS.
We had our service running and development was well underway. However, we began to noticed that our AWS bill was higher than we were expecting. After digging in further we noticed we had an unusually high amount of data being processed by our AWS managed NAT Gateway. Below is a snippet of our bill after the first week.
We were burning roughly $7 dollars a day just on network traffic alone, and this was just for one service! If we didn’t figure out what was causing this huge spike in traffic, we would be spending hundreds of dollars a month just on network traffic.
We needed to track down where the data egress was coming from. We were able to leverage CloudWatch Insights against our VPC Flow logs to determine the root of the issue. The VPC Flow Logs contained logs of all network traffic within our VPC and CloudWatch Insights allowed us to run ad-hoc queries against this data set.
First we ran a query to to see what sources were sending the most data. Based on the IP range in the dstAddr we were able to determine that our ECS tasks were indeed the sources of the large amount of data transfer
Next we needed to determine where that data was going. We ran another query, this time checking to see if we were downloading any large data sets.