Centralized EKS monitoring across multiple AWS accounts

Complex systems require extensive monitoring and observability. Systems as complex as Kubernetes clusters have so many moving parts that sometimes it's a task and a half just to configure their monitoring properly. Today I'm going to talk in depth about cross-account observability for multiple EKS clusters, explore various implementation options, outline the pros and cons of each approach, and explain one of them in close detail. Whether you’re an aspiring engineer seeking best-practice advice, a seasoned professional ready to disagree with everything, or a manager looking for ways to optimize costs -- this article might be just right for you.

Title image

Cluster observability #

What is good observability? Good observability answers the questions:

As you can see, apart from the health of the cluster itself, good observability monitors the health of the application and user-facing metrics as well. So far, we can highlight two types of data in our cluster: cluster operational metrics and application metrics.

EKS Cluster Monitoring

But is that all? Didn't we forget something?

How are we going to find out whether we have issues with our services if our monitoring is down? Exactly -- we can't. So we have to introduce meta-monitoring metrics: the data about the health of the monitoring system itself.
And now we have three types of metrics.

EKS Cluster Monitoring Monitoring

Observability fundamentals #

The bare minimum set of data we can scrape from our cluster comes from the Prometheus Node Exporter (or VictoriaMetrics vmagent). This is the most technical data -- CPU, memory, network latencies, temperature, you name it. It doesn't get more technical than this.

K8s Node

But should we stop here? Technical data about our system shows only our part of the setup -- the health of the underlying system. It's extremely helpful when something goes very wrong with our service, but it doesn't show the full picture. To see that, we also need visibility into the customer-facing side of our service. We need application metrics: request latency, number of responses, application errors. The good thing is that exposing a metrics endpoint within our application may be enough -- Prometheus will scrape this endpoint by itself without additional daemons.

K8s Nodes In A Cluster

Something's missing, right? For example, where can we see application errors? Not just the error codes or counts, but actual error messages?

Sure, kubectl logs my-pod is cool and all, but a production-ready app is rarely a single pod (at least I hope so, for most of us).

So we think about log collection as well. This adds yet another pod to each node we have: a log collector agent.

Cluster With Logging Agent

Is the picture complete now?

Not quite. As mentioned previously, to check whether our monitoring setup is even alive, it would be nice to have some sort of external monitoring -- the monitoring of the monitoring. Monitoringception. Since the sole purpose of this external service is to answer the question of whether the monitoring is alive, it can be a very simple setup: a daemon that queries the monitoring endpoint. Because this daemon must remain alive even when the whole system is down, it has to run outside the perimeter of the application workload -- on an external EC2 instance outside the EKS cluster. To achieve the maximum independence of the system this service can be set up in an entirely different cloud provider or even on a dedicated VPS in a datacenter on another continent (or in your home lab).

It can even be done as a dead-man switch: an alert in the monitoring system that should always be in the "failing" state and should turn green only if the monitoring system is down. Don't like constant influx of alerts? Set up an notification when data hasn't been received for X minutes, it should do the trick as well.[1]

Full EKS Cluster

Cross-account observability #

Now let's crank it up a notch.

This is what a service architecture might look like. But in modern development, it rarely is like this. Mature businesses typically divide their infrastructure by environments -- at the very least, development and production. They might also include, but are not limited to, staging, UAT, etc.

So we spin up both environments in the same cluster, but in different namespaces, right?

This is a bad practice, as resources are still shared between both applications, and one can affect the other in ways we wouldn't want.

The other option is to spin up another environment in the same AWS account, but on a different cluster. Is this a good option? Most likely, yes. Buuuut the Well-Architected Framework and other guidelines advise against sharing a single account between several environments. Full environment separation is better for both account and application security.

So now we have two AWS accounts.

It was fairly simple to organize observability within a single account -- the networks were contained, enclosed in a single perimeter, and available directly from one endpoint to another. Having several accounts introduces the complexity of cross-account networking and finally brings us to the actual topic of this article.

Cross-account networking options #

As with every task in cloud operations, this topic can be approached in several ways:

  1. VPC Peering
  2. AWS Transit Gateway
  3. AWS NLBs with VPC Endpoints and VPC Endpoint Services
  4. AWS VPN / Direct Connect
  5. Public internet + Authorization + TLS

Let's discuss each of them briefly.

Note

For the examples below, the cost of each implementation will include only additional charges (i.e., networking and service charges). The price of the monitoring services (Prometheus + storage, Grafana, etc.) will not be referenced.

Note

A table summarizing all the options with detailed price breakdown and implementation complexity comparisons can be found in the annex.

VPC Peering #

This option assumes creating multiple VPC Peering connections from the root account to branched accounts. It is quite cheap and may work well for small organizations with one or two accounts. It may also work if you only plan to create your infrastructure, because it requires CIDR planning: VPC CIDRs cannot overlap with this option. On top of that, you will have to configure route tables manually for each account. On the bright side: same-AZ traffic is free, and VPC peering is also free.

Pros

Cons

AWS Transit Gateway #

This setup looks more like an enterprise-scale solution: better convenience, higher price. It allows for centralized management of the observability networking setup but requires extensive route planning. It is, however, very scalable by utilizing Transit Gateways, which can be attached to a large number of accounts. It can also be shared across organization accounts via Resource Access Manager (one more point for house Enterprise).

Pros

Cons

AWS NLBs + VPC Endpoints and VPC Endpoint Services #

The third option, almost the golden mean, involves exposing services via Network Load Balancers and Endpoint Services and connecting to them via VPC Endpoints. It is cheaper than the previous option but more expensive than the first. NLBs and VPC Endpoints cost money to simply run. On the other hand, it doesn't limit the number of accounts that can be connected, doesn't require route table configuration, doesn't traverse the public internet unexposed, and with one-to-many connections, you only need to configure a small number of endpoints and endpoint services.

Pros

Cons

AWS VPN / Direct Connect #

This option is an overkill unless you have a dedicated NOC department, because it has all the fun: full network connectivity, encrypted traffic, interface configuration, BGP configuration, and many associated charges for traffic and connection speed.

Pros

Cons

Public Internet + Authorization + TLS #

And the final option: the very basic "just shove it onto the public internet and slap a login page on top of it" approach. Super simple to set up! Just spin up an ALB before the Prometheus instance, season it with TLS certificates, and expose it to the public internet. Super unsafe! Works for hobby and experimental projects. Please don't use in a production environment (unless you are absolutely certain -- but still, please don't).

Pros

Cons


As you can see, there are several options to consider and choose from. All of them have their pros and cons. For my scenario, I ended up choosing to set up NLBs with VPC Endpoints.

Why?

The first reason is that it's relatively easy to implement. To set up a connection between accounts, we need only an endpoint service and endpoints in the other accounts (plural). They connect in a one-to-many relationship: one endpoint service can handle several connecting endpoints.

The second reason is that I had overlapping CIDRs in each account, so VPC Peering was immediately off the table.

The third reason is that it has moderately less configuration overhead. I won't have to configure client connections or route tables. Proper routing between multiple accounts is a very precise art at which I most certainly suck.

And the last reason is that it's still quite secure. The traffic from services doesn't leave the AWS network, doesn't traverse the public internet, and connections are only allowed after approval (it can be configured to auto-approve, though, which may be convenient -- but healthy paranoia is my constant companion).

Chosen cross-account solution #

The whole system is actually not that overcrowded.

For the sake of this article, let's assume we have three products: Bulba, Char, and Squi. Each product has two environments: normal and sparkling. We also have a root account (let's call it Ketch) for organization management and, for simplicity, observability as well. This gives us seven AWS accounts and seven EKS clusters. Why seven and not six? Well, the root account also runs an EKS cluster as an observability backend.

Each product account (Bulba, Char, and Squi) includes the following elements:

The root account (Ketch) also contains several elements:

That's quite a lot of components to keep track of. Luckily, we have IaC to ease the configuration and deployment of these components.

To understand why so many components are needed (particularly the endpoint service-endpoint pairs), we need to look at the traffic flow model.

Prometheus cross-account connections

Zooming in a bit, we can see the exact details of how the connection is configured.

Prometheus connection setup

Since our goal is to keep metrics data secure and prevent it from traversing public subnets or the public internet, we create an endpoint service. This service acts as an open receiver on one account (the product account) for the connector on the root account. This creates a 1-to-1 connection that is both secure and governed, since requests to connect to the endpoint service must be approved.

The setup for metrics differs from the setup for logs and traces.

Prometheus uses a pull model: the backend queries the endpoints for data, effectively pulling the data from them. Loki and Tempo, however, use a push model: deployed logs and traces collection pods send (push) the data to the centralized backend.

In this case, the endpoint service/endpoint pair is reversed and simpler: only one endpoint service per service (two in total) is created in the root account, and all product accounts create endpoints and request connections with the root endpoint service.

And this is how the accounts look in the end:

Product account

Root account

Tip

For consistency, you might want to use a metrics-gathering service with a push model (e.g., the aforementioned VictoriaMetrics vmagent). This way, endpoint services are only created in the root account, and product accounts only have endpoints.

Operational considerations #

As mentioned previously, the solution I chose in this article is not the only correct one out there. For this setup, I had specific requirements that needed to be fulfilled, as well as particular implementation caveats to consider.

My example is definitely not the cheapest option. The cheapest would be using VPC Peering. But unfortunately, my existing setup -- with the same CIDRs in EKS clusters and the possibility to extend beyond those six (+1) AWS accounts -- made this option unavailable.

The described setup is also located in a single region -- in cross-region data transfer scenarios, costs can increase drastically and very quickly. For each additional region, an additional VPC endpoint/service will have to be created.

There's also always the possibility to just expose the backend ports to the public internet (with proper authorization, of course!) and don't bother with endpoint configuration entirely. But even with TLS-encrypted traffic, this is a rather unsafe option and will absolutely not help you pass any SOC 1/SOC 2/ISO 27001 certification.

Closing thoughts #

This was a very interesting challenge to tackle and implement -- I had an absolute blast setting up the POC and confirming that it works. I was excited by the variety of options I could choose from and the differences between the tools available in these scenarios.

And I think it's beautiful. It not only shows the complexity of AWS services (which can sometimes be a downside), but also that there's always more than one solution to each problem. Every engineer will approach a challenge differently -- which, in my humble opinion, means our jobs are secure for the observable future.

Thank you for reading, and see you in the next one!

Annex #

A. Summarizing table #

Option CIDRs can overlap Scalability Implementation Complexity Price
VPC Peering X ●○○ ●●○○ $
AWS Transit Gateway ●●● ●●●○ $$$
AWS NLBs + VPC Endpoints + Services ●●○ ●●○○ $$
AWS VPN / Direct Connect ●●● ●●●● $$$-$$$$
Public internet + Authorization + TLS ●●● ●○○○ $$

B. Price breakdown #

The variety, complexity and difference of the outlined options require more in-depth research than outlined in the scope of this article. I agree with that. To make your life easier and to highlight the cryptic $ signs in the table above, I decided to create an example price breakdown and give you some specific numbers, based on which you can at least in some proximity figure whether an option is sutable for you or not.

To keep the numbers comparable, we will assume the following prerequisites:

Option 1: VPC Peering

Total Monthly Cost: $528.6

Option 2: AWS Transit Gateway

Total Monthly Cost: $634.6

Option 3: AWS NLBs + VPC Endpoints + Services

Total Monthly Cost: $590.4

Option 4a: Site-to-Site VPN

Total Monthly Cost: $633.6

Option 4b: AWS Direct Connect

Total Monthly Cost: $670.6[3]

Option 5: Public Internet + TLS + Auth

Total Monthly Cost: $563.6

Pro tip

For pricing calculations, I used the following AWS resources::
- Amazon VPC pricing
- Elastic Load Balancing pricing
- Amazon EC2 On-Demand Pricing
- Amazon EBS pricing
- AWS Pricing Calculator

As well as this handy AWS EC2 instance and price comparison tool:
- Amazon EC2 Instance Type Comparison

C. IaC snippets #

The code presented in this section does not represent a fully working infrastructure. It highlights only the most relevant parts of the PoC implementation. It has not been tested and serves solely as a reference. You can copy it, but without additional initial setup (at minimum a VPC and an EKS cluster) and further adjustments, it will not work. Use at your own risk!

Product account

hcl# variables.tf
variable "eks_nlb_endpoint_services" {
  description = "EKS NLB endpoint services configuration"
  type = map(object({
    nlb_arn            = list(string)
    allowed_principals = list(string)
  }))
}

variable "loki_service_name" {
  description = "VPC Endpoint Service name for Loki"
  type = string
}

variable "tempo_service_name" {
  description = "VPC Endpoint Service name for Tempo"
  type = string
}

variable "vpc_id" {
  type        = string
}

variable "private_subnet_ids" {
  type        = list(string)
}

variable "name_prefix" {
  type        = string
}
hcl# main.tf
resource "aws_vpc_endpoint_service" "nlb_endpoint_services" {
  for_each = var.eks_nlb_endpoint_services

  acceptance_required        = true
  allowed_principals         = each.value.allowed_principals
  network_load_balancer_arns = each.value.nlb_arn
}

resource "aws_security_group" "observability" {
  name        = "${var.name_prefix}-observability"
  description = "Allow traffic from observability resources"
  vpc_id      = var.vpc_id
}

resource "aws_vpc_security_group_ingress_rule" "loki" {
  security_group_id = aws_security_group.observability.id
  description       = "Allow traffic from observability resources: Loki"
  cidr_ipv4         = "10.0.0.0/16"
  from_port         = 8080
  to_port           = 8080
  ip_protocol       = "tcp"
}

resource "aws_vpc_security_group_ingress_rule" "tempo" {
  security_group_id = aws_security_group.observability.id
  description       = "Allow traffic from observability resources: Tempo"
  cidr_ipv4         = "10.0.0.0/16"
  from_port         = 4317
  to_port           = 4318
  ip_protocol       = "tcp"
}

resource "aws_vpc_endpoint" "loki" {
  vpc_id              = var.vpc_id
  service_name        = var.loki_service_name
  vpc_endpoint_type   = "Interface"
  security_group_ids  = [aws_security_group.observability.id]
  subnet_ids          = var.private_subnet_ids
  private_dns_enabled = false
}

resource "aws_vpc_endpoint" "tempo" {
  vpc_id              = var.vpc_id
  service_name        = var.tempo_service_name
  vpc_endpoint_type   = "Interface"
  security_group_ids  = [aws_security_group.observability.id]
  subnet_ids          = var.private_subnet_ids
  private_dns_enabled = false
}
hcl# terraform.tfvars
eks_nlb_endpoint_services = {
  prometheus = {
    nlb_arn = [
      "arn:aws:elasticloadbalancing:us-west-2:${data.aws_caller_identity.current.account_id}:loadbalancer/net/xxxx-yyyy/zzzz"
    ]
    allowed_principals = ["arn:aws:iam::123456789012:root"]
  }
}
loki_service_name = "com.amazonaws.vpce.us-west-2.vpce-svc-qwertyuiop"
tempo_service_name = "com.amazonaws.vpce.us-west-2.vpce-svc-qwertyuiop"
vpc_id             = "vpc-xxxxx"
private_subnet_ids = ["subnet-xxxxx", "subnet-yyyyy"]
name_prefix        = "char"

Root account

hcl# variables.tf
variable "account_id" {
  type        = string
}

variable "name_prefix" {
  type        = string
}

variable "vpc_id" {
  type        = string
}

variable "private_subnet_ids" {
  type        = list(string)
}

variable "allowed_principals" {
  description = "Allowed AWS account principals for VPC endpoint services"
  type        = list(string)
}

variable "loki_nlb_arns" {
  type        = list(string)
}

variable "tempo_nlb_arns" {
  type        = list(string)
}

variable "char_normal_prometheus_vpcesvc_name" {
  description = "VPC Endpoint Service name for Prometheus; normal account"
  type        = string
}

variable "char_sparkling_prometheus_vpcesvc_name" {
  description = "VPC Endpoint Service name for Prometheus; sparkling account"
  type        = string
}
hcl# s3.tf
resource "aws_s3_bucket" "loki_chunks" {
  bucket = "${var.name_prefix}-${var.account_id}-loki-chunks"
}

resource "aws_s3_bucket" "loki_ruler" {
  bucket = "${var.name_prefix}-${var.account_id}-loki-ruler"
}

resource "aws_s3_bucket" "tempo" {
  bucket = "${var.name_prefix}-${var.account_id}-tempo-traces"
}
hcl# iam.tf
resource "aws_iam_policy" "loki_buckets" {
  name = "${var.name_prefix}-loki-buckets"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "LokiBuckets"
        Effect = "Allow"
        Action = [
          "s3:ListBucket",
          "s3:PutObject",
          "s3:GetObject",
          "s3:DeleteObject"
        ]
        Resource = [
          aws_s3_bucket.loki_chunks.arn,
          "${aws_s3_bucket.loki_chunks.arn}/*",
          aws_s3_bucket.loki_ruler.arn,
          "${aws_s3_bucket.loki_ruler.arn}/*"
        ]
      }
    ]
  })
}

resource "aws_iam_role" "loki_pod_identity" {
  name = "${var.name_prefix}-loki-pod-identity"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "AllowEksAuthToAssumeRoleForPodIdentity"
        Effect = "Allow"
        Principal = {
          Service = "pods.eks.amazonaws.com"
        }
        Action = [
          "sts:AssumeRole",
          "sts:TagSession"
        ]
      }
    ]
  })

  tags = var.common_tags
}

resource "aws_iam_role_policy_attachment" "loki_pod_identity" {
  role       = aws_iam_role.loki_pod_identity.name
  policy_arn = aws_iam_policy.loki_buckets.arn
}

resource "aws_iam_policy" "tempo_bucket" {
  name = "${var.name_prefix}-tempo-bucket"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "LokiBuckets"
        Effect = "Allow"
        Action = [
          "s3:ListBucket",
          "s3:PutObject",
          "s3:GetObject",
          "s3:DeleteObject",
          "s3:GetObjectTagging",
          "s3:PutObjectTagging"
        ]
        Resource = [
          aws_s3_bucket.tempo.arn,
          "${aws_s3_bucket.tempo.arn}/*"
        ]
      }
    ]
  })
}

resource "aws_iam_role" "tempo_pod_identity" {
  name = "${var.name_prefix}-tempo-pod-identity"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "AllowEksAuthToAssumeRoleForPodIdentity"
        Effect = "Allow"
        Principal = {
          Service = "pods.eks.amazonaws.com"
        }
        Action = [
          "sts:AssumeRole",
          "sts:TagSession"
        ]
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "tempo_pod_identity" {
  role       = aws_iam_role.tempo_pod_identity.name
  policy_arn = aws_iam_policy.tempo_bucket.arn
}

resource "aws_eks_pod_identity_association" "loki" {
  cluster_name    = var.name_prefix
  namespace       = "monitoring"
  service_account = "loki"
  role_arn        = aws_iam_role.loki_pod_identity.arn
}

resource "aws_eks_pod_identity_association" "tempo" {
  cluster_name    = var.name_prefix
  namespace       = "monitoring"
  service_account = "tempo"
  role_arn        = aws_iam_role.tempo_pod_identity.arn
}
hcl# vpc.tf
resource "aws_vpc_endpoint_service" "loki" {
  acceptance_required        = true
  allowed_principals         = var.allowed_principals
  network_load_balancer_arns = var.loki_nlb_arns
}

resource "aws_vpc_endpoint_service" "tempo" {
  acceptance_required        = true
  allowed_principals         = var.allowed_principals
  network_load_balancer_arns = var.tempo_nlb_arns

}

resource "aws_vpc_endpoint" "char-normal-prometheus" {
  vpc_id              = var.vpc_id
  service_name        = var.char_normal_vpcesvc_name
  vpc_endpoint_type   = "Interface"
  security_group_ids  = [aws_security_group.observability.id]
  subnet_ids          = var.private_subnet_ids
  private_dns_enabled = false
}

resource "aws_vpc_endpoint" "char-sparkling-prometheus" {
  vpc_id              = var.vpc_id
  service_name        = var.char_sparkling_vpcesvc_name
  vpc_endpoint_type   = "Interface"
  security_group_ids  = [aws_security_group.observability.id]
  subnet_ids          = var.private_subnet_ids
  private_dns_enabled = false
}

resource "aws_security_group" "observability" {
  name        = "observability"
  description = "Allow traffic from observability resources"
  vpc_id      = var.vpc_id
}

resource "aws_vpc_security_group_ingress_rule" "prometheus" {
  security_group_id = aws_security_group.observability.id
  cidr_ipv4         = "10.0.0.0/16"
  from_port         = 9090
  to_port           = 9090
  ip_protocol       = "tcp"
}
hcl# terraform.tfvars.example
account_id                             = "123456789012"
name_prefix                            = "ketch"
vpc_id                                 = "vpc-xxxxx"
private_subnet_ids                     = ["subnet-xxxxx", "subnet-yyyyy"]
loki_nlb_arns                          = ["arn:aws:elasticloadbalancing:us-west-2:${var.account_id}:loadbalancer/net/xxxx-yyyy/zzzz"]
tempo_nlb_arns                         = ["arn:aws:elasticloadbalancing:us-west-2:${var.account_id}:loadbalancer/net/xxxx-yyyy/zzzz"]
char_normal_prometheus_vpcesvc_name    = "com.amazonaws.vpce.us-west-2.vpce-svc-xyz"
char_sparkling_prometheus_vpcesvc_name = "com.amazonaws.vpce.us-west-2.vpce-svc-xyz"
allowed_principals = [
  "arn:aws:iam::111111111111:root",
  "arn:aws:iam::222222222222:root"
]

Reply to this post ✉️


  1. It doesn’t matter all that much which approach you choose to implement for monitoring your monitoring. In light of recent events, some folks even created a downdetector for a downdetector’s downdetector. I mean, it’s hilariously fun, but the key point remains solid: you need to know whether your eyes and ears (infrastructure-wise) are even working. ↩︎

  2. We assume 100% cross-AZ traffic in this example to maximize potential traffic costs and avoid complicating the calculations with percentages of same-AZ versus cross-AZ traffic. ↩︎

  3. Direct Connect may also require a specific partner to enable and perform the physical connection to the AWS network, so expect to add a few hundred (or even thousands) of dollars on top of the initial bill for setup. ↩︎