Infrastructure costs reduction (AWS Cloud)

Overview

My role

Lead DevOps Engineer (planning, managing team and implemetation)

Project description

The goal of the project was to reduce overall infrastructure costs, primarily whithin AWS Cloud as well across supporting services, while minimizing any negative impact on application performance and on development and operational processes. The entire initiative was driven by a small DevOps team under my leadership.

Initial project state

The application was built using a microservice architecture and hosted on Kubernetes. In addition to the production environment, two additional non-production environments were maintained to ensure effective quality control. The infrastructure was fully provisioned in AWS, with the majority of costs generated by workloads runing on Amazon Elastic Kubernetes Services (EKS) and Amazon Relational Database Service (RDS). Logging and monitoring were provided through Datadog and Sentry SaaS solutions.

Cost analysis and optimization plan

The analysis covered several different aspects, as outlined below:

  • Review of developmnent and operational processes: We verified whether all tools and services were actively by the teams, assessed their usefulness, and evaluated utilization levels. Theis was achived by creating a comprehensive inventory of services in use and conduction interviews with development and operations teams.
  • Review of infrastructure architecture: The primary focus was on how the infrastructure supported the application, identifying duplicated components, unnecessary elements, or areas where more cost-effective alternatives could be introduced. Based on existing documentation and discussions with architects, we gained a clear undestanding of architectural requirements.
  • Utilization analysis: This involved analyzis how AWS resources were utilized, primarily usign AWS utilization reports and AWS CloudWatch metrics.

While these areas overlapped, the initial analysis resulted in a first draft of proposed changes, including their impact on costs, processes, and the estimated effort required. Based on this, priorities were defined, focusing first on "low-hanging fuits" and intitiatives with the highest potential cost reduction.

Optimization initiatives and implementation

Based on analysis the following action were taken:

  • One of the non-production environments was identified as legacy and rarely used for development purposes. The full application stack was not deployed there, which made its removal reasonable given the limited effort required. The task involved manually removing AWS resources (as this part of the infrastructure was not managed via Infrastructure as Code) and removing deployments for this environment from CircleCI pipelines.
  • Removal of MySQL RDS instances for legacy processes: Some legacy processes relied on over-provisioned RDS instances. Although these processes could not be fully decommissioned, they were migrated to the main production database cluster. The task involved migrating databases using AWS Database Migration Service, updating legacy application configurations, performing backups, and manually removing the unnecessary RDS instances.
  • Scaling down RDS and Redis clusters : Based on Amazon CloudWatch metrics, the size of RDS and Redis (Amazon ElasticCache) instances was reduced, and high availability was disabled for non-production environments.
  • Application workload optimization: All application workloads were orchestrated by Kubernetes, but in an inefficient manner. The optimization started with an analysis of pod utilization, resource requests and limits, and Horizontal Pod Autoscaling settings. As the deployment pipelines had already been rebuilt, adjusting these parameters did not require significant effort. After this initial phase and based on new cluster metrics, Kubernetes Autoscaler was implemented. The cluster was then configured using a mix of reserved and spot instances, with the understanding that non-production environments could tolerate lower stability.
  • Logging and monitoring cost reduction: Datadog was used as the primary logging and monitoring solution and provided significant convenience and functionality, but it was also a major cost driver. After initial cost-reduction attempts, a decision was made to move to a more cost-efficient, though less feature-rich, solution. Elastic Cloud was introduced as the central logging platform, with increased reliance on AWS CloudWatch for infrastructure monitoring. Additionally, tuning the Sentry configuration helped recover some of the observability capabilities that were lost after moving away from Datadog.
  • Other actions: The use of Kubernetes Jobs instead of continuously running workers further reduced resource consumption. This approach was applied both to application workloads and to selected supporting processes. Additionally, KEDA was implemented to provide more efficient and responsive scaling of workloads based on event-driven triggers.

Outcome

The project resulted in a 50% reduction in overall infrastructure and operational costs, primarily across AWS services and observability tooling, while maintaining application performance and development velocity. Cost savings were achieved without introducing disruptions to production workloads or negatively impacting delivery processes.