Building a Production-Ready EKS Infrastructure for SaaS on AWS: Part 1 - The VPC Foundation

Published on
6 mins read
--- views

Building a Production-Ready EKS Infrastructure for SaaS on AWS: Part 1 - The VPC Foundation

This is the first article in our comprehensive series on building a complete, production-ready infrastructure for running a SaaS application on AWS. By the end of this series, you'll have a fully functional, secure and scalable Amazon Elastic Kubernetes Service (EKS) cluster along with several essential components and tools.

In this initial article, we'll focus on laying the foundation - constructing a robust Virtual Private Cloud (VPC) that will host our entire infrastructure. We'll leverage Terraform to provision a VPC with best-in-class security and scalability features.

Understanding the "why" behind the infrastructure choices is just as important as the "how", so throughout this guide, I'll explain the reasoning and benefits of the various architectural decisions.

So let's start with the foundation: the Virtual Private Cloud (VPC).

Architecture Overview

Our VPC architecture follows AWS best practices with:

  • Multi-AZ deployment across 3 availability zones
  • Public, private, and intra (isolated) subnets
  • NAT Gateways for outbound internet access
  • VPC Endpoints for secure AWS service access
  • Network flow logs for security monitoring

Understanding the Infrastructure Code

Let's break down each component of our VPC configuration and understand why we're implementing it this way.

1. Base Configuration and Variables

data "aws_availability_zones" "available" {
  state = "available"
}

locals {
  project_name = "${var.project_name}-${var.environment}"
  azs          = slice(data.aws_availability_zones.available.names, 0, 3)
  tags = {
    project_name = var.project_name
    environment  = var.environment
    admin        = data.aws_caller_identity.current.arn
  }
}

Why?

  • We dynamically fetch available AZs instead of hardcoding them
  • We use exactly 3 AZs for optimal balance between high availability and cost
  • Using slice ensures we don't accidentally deploy to more AZs than needed

How it helps:

  • Prevents deployment failures if specific AZs are unavailable
  • Ensures consistent high availability across regions
  • Controls costs by limiting the number of NAT gateways and other per-AZ resources
  • Establishes standardized tagging for resource management

2. VPC Configuration

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.1.2"
  
  name            = local.project_name
  cidr            = var.vpc_params.vpc_cidr
  azs             = local.azs
  tags            = local.tags
  enable_flow_log = var.vpc_params.enable_flow_log
  # ... other configurations ...
}

Key features:

  • Uses the official AWS VPC Terraform module
  • Adds tags through tags of the VPC resource itself, those tags will be inherited by all the resources inside the VPC
  • Captures information about all IP traffic going in and out of your VPC network interfaces with Flow logs

Cost Warning ⚠️

Flow logs can be very costly, so keep them disabled in development environments.

3. Subnet Design

module "vpc" {
  # ... other configurations ...
  private_subnets = [for k, v in local.azs : cidrsubnet(var.vpc_params.vpc_cidr, var.vpc_params.private_bits, k)]
  public_subnets  = [for k, v in local.azs : cidrsubnet(var.vpc_params.vpc_cidr, var.vpc_params.public_bits, k + 48)]
  intra_subnets   = [for k, v in local.azs : cidrsubnet(var.vpc_params.vpc_cidr, var.vpc_params.intra_bits, k + 64)]
  
  private_subnet_tags = {
    "kubernetes.io/role/internal-elb" = 1
  }
  public_subnet_tags = {
    "kubernetes.io/role/elb" = 1
  }
}

Why Three Subnet Tiers?

  1. Public Subnets:

    • Used only for internet-facing load balancers
    • Why? Minimizes attack surface by limiting internet-accessible resources
  2. Private Subnets:

    • Hosts your EKS worker nodes and application pods
    • Why? Provides internet access through NAT while protecting from inbound traffic
  3. Intra Subnets:

    • Where the EKS cluster control plane (ENIs) will be provisioned and never need internet access. This placement is solely for network communication, not for hosting the control plane itself, as some might assume
    • Why? Maximum security for kubernetes control plane

Why Use cidrsubnet?

  • Automatically calculates non-overlapping CIDR ranges
  • Makes it easy to adjust subnet sizes using variables
  • Ensures consistent IP allocation across environments

What Are These Tags?

private_subnet_tags = {
  "kubernetes.io/role/internal-elb" = 1
}
public_subnet_tags = {
  "kubernetes.io/role/elb" = 1
}

Why Do We Need These Tags?

These tags are specifically designed for the AWS Load Balancer Controller (AWS LBC) in EKS (we'll revisit that later in the series). They serve as automatic subnet discovery markers that tell Kubernetes where to create load balancers.

4. NAT Gateway Configuration

module "vpc" {
  # ... other configurations ...
  enable_nat_gateway     = var.vpc_params.enable_nat_gateway
  single_nat_gateway     = var.vpc_params.single_nat_gateway
  one_nat_gateway_per_az = var.vpc_params.one_nat_gateway_per_az
}

Why Configurable NAT Gateways?

  • Development environments: Use single_nat_gateway = true to reduce costs
  • Production environments: Use one_nat_gateway_per_az = true for high availability
  • Why variable? Allows environment-specific optimization of cost vs. reliability

5. VPC Endpoints for AWS Services

module "endpoints" {
  source  = "terraform-aws-modules/vpc/aws//modules/vpc-endpoints"
  version = "~> 5.1.1"

  endpoints = {
    s3 = {
      service         = "s3"
      service_type    = "Gateway"
      route_table_ids = flatten([module.vpc.intra_route_table_ids, module.vpc.private_route_table_ids])
    },
    dynamodb = {
      service         = "dynamodb"
      service_type    = "Gateway"
    },
    sts = {
      service             = "sts"
      private_dns_enabled = true
      subnet_ids          = module.vpc.private_subnets
    },
    ecr_api = {
      service             = "ecr.api"
      private_dns_enabled = true
      subnet_ids          = module.vpc.private_subnets
    },
    ecr_dkr = {
      service             = "ecr.dkr"
      private_dns_enabled = true
      subnet_ids          = module.vpc.private_subnets
    }
  }
}

Why VPC Endpoints?

  1. Security:

    • Traffic to AWS services never leaves AWS network
    • Removes need for internet access for AWS API calls
  2. Performance:

    • Direct connection to AWS services
    • Lower latency than internet-based API calls
  3. Cost:

    • Reduces NAT gateway traffic
    • Can lower data transfer costs

Cost Warning ⚠️

Interface endpoints incur hourly costs along with additional data processing charges. In contrast, gateway endpoints do not use AWS PrivateLink and have no additional charges for usage.

Why These Specific Endpoints?

  • S3: Required for pulling and accessing application assets
  • ECR: Needed for pulling container images without internet access
  • STS: Required for IAM authentication
  • DynamoDB: Often used as NoSQL DB for Serverless applications

6. Security Groups and Policies

data "aws_iam_policy_document" "endpoint_policy" {
  statement {
    effect    = "Deny"
    actions   = ["*"]
    resources = ["*"]
    condition {
      test     = "StringNotEquals"
      variable = "aws:SourceVpc"
      values = [module.vpc.vpc_id]
    }
  }
}

Why This Policy?

  • Implements explicit deny by default
  • Only allows access from within our VPC
  • Prevents any potential cross-VPC access

What's Next?

In the next article of this series, we'll build upon this VPC foundation to:

  1. Provision EKS cluster
  2. Install essentials cluster components and tools
  3. Set up cluster autoscaling

Our VPC configuration isn't just a network, it's the security foundation for our entire Kubernetes infrastructure. Each design decision supports either security, scalability, or operational efficiency (often all three), giving us the robust foundation we need for a production-grade EKS cluster.

Stay tuned for Part 2, where we'll dive into EKS cluster configuration and explain how it integrates with the VPC we've just built!

Remember to adjust the CIDR ranges and NAT gateway configuration based on your specific requirements and budget considerations. The provided code can be easily customized through variables while maintaining the secure architecture patterns.

GitHub Repository for the Project