Expert Digital Product Engineering Services

Published January 20, 2025

What Datadog Does

I've been working with monitoring systems for a while now, and one thing that always caught my attention is how Datadog manages to ingest massive amounts of data without breaking. When you're dealing with thousands of servers sending metrics every few seconds, things can get complicated pretty quickly.

Let me walk you through how they've architected this, based on what I've learned from their engineering blogs and my own experience using the platform.

Note: This is a simplified view based on common large-scale monitoring patterns. The actual implementation may vary, and this represents one way such systems can be architected at scale.

Datadog is a monitoring platform that collects metrics, logs, and traces from your infrastructure. The basic idea is simple: instead of having separate tools for different monitoring needs, you use one platform.

In practice, if you're running something like:

2,000 servers across AWS
A bunch of containers (let's say 15,000)
Multiple services and databases

Each of these components reports metrics—CPU usage, memory, request counts, error rates. When you add it all up, you're looking at a lot of data points being generated every second.

The Cardinality Problem

Here's where things get tricky. A metric isn't just a name and a value. It has tags attached to it.

For example, `http.request.latency` might have tags like:

`service: checkout`
`region: us-east-1`
`environment: production`
`status_code: 200`

Every unique combination of these tags creates a separate time series. So if you have 100 services, 5 regions, 3 environments, and 10 different status codes, you're already looking at 15,000 time series just for that one metric.

This is called high cardinality, and it's one of the main challenges in building monitoring systems at scale.

The Scale Challenge

At a previous job, we had about 2,500 hosts. Each host sent roughly 50 metrics every 10 seconds. That works out to:

Scale Calculation: 2,500 hosts × 50 metrics × 6 times per minute = 750,000 metrics per minute. Or about 12,500 metrics per second.

And that was just infrastructure metrics, not including application-level metrics like request rates or custom business metrics.

If each metric data point is around 100 bytes (metric name, tags, value, timestamp), that's 1.25 MB/sec of raw data. With compression, it drops to maybe 300-400 KB/sec.

Now imagine a company like Datadog with thousands of customers, each potentially sending similar amounts of data. The numbers add up quickly.

The Overall Flow: How Data Moves Through the System

Before diving into details, here's the basic pipeline:

Your infrastructure sends metrics to the Datadog agent → Agent aggregates and forwards to Intake services → Intake pushes to Kafka → Kafka routes to Storage → When you query, data flows back through the Query layer to your dashboard.

How data moves through the Datadog system from agents to storage

Each step is designed to handle massive scale without becoming a bottleneck.

Starting at the Agent

The Datadog agent is a lightweight program (written in Go, typically 50-70 MB) that runs on each host or container. It collects system metrics like CPU and memory, integrates with services like PostgreSQL or Redis, and gathers custom application metrics—basically, it's the data collector sitting on your infrastructure.

Instead of sending every single data point, the agent does preprocessing locally. Let's say your server reports CPU usage every second:

Agent Aggregation Example: 70%, 72%, 75%, 71%, 73% → The agent aggregates these over a 10-15 second window and sends: avg: 72.2%, max: 75%, min: 70%. This reduces the amount of data being sent over the network significantly.

Datadog agent aggregation process showing how metrics are preprocessed locally

It also batches multiple metrics together and compresses them before sending.

The Intake Layer

Once data leaves the agent, it hits Datadog's intake services. These are designed to be simple and stateless—they authenticate requests, validate the data format, and forward everything downstream.

The stateless part matters because it means they can scale horizontally. If traffic doubles, they just add more intake servers. No coordination needed between them.

Kafka as a Buffer

After intake, metrics go into Kafka. This serves as a buffer between data coming in and data being stored.

The reason this is useful: if there's a spike in traffic or if the storage layer slows down for some reason, Kafka just holds onto the data. Once things stabilize, the consumers catch up and process everything.

Kafka buffering mechanism during traffic spikes and system hiccups

It's basically decoupling the ingestion rate from the storage rate, which prevents data loss during traffic spikes or system hiccups.

Now, consumer services continuously read from Kafka and need to figure out where each metric should be stored. This is where routing and sharding come into play.

Routing and Sharding

Datadog routes metrics deterministically—the same metric always goes to the same shard. This is done using a hash of the metric name and tags.

So metrics for `service:checkout`, `region:us-east-1` always end up on the same shard. This keeps related data together and makes querying more efficient since you don't have to hit multiple shards for the same time series.

Once the routing determines which shard a metric belongs to, it gets written to storage. Let's look at how that storage is organized.

Storage Design

The storage is split into two parts:

Metadata Storage Keeps track of tags and indexes. When you query for all hosts with `service:checkout`, this is where that lookup happens.

Time Series Storage Holds the actual metric values over time. This is optimized for storing and retrieving sequences of numerical data efficiently.

Datadog storage architecture showing two-tier design with metadata and time series storage

When you run a query, it first finds the relevant time series using the metadata index, then fetches the actual data points from time series storage. This separation keeps queries fast even with millions of time series.

Query Execution

When you load a dashboard, here's roughly what happens:

The query identifies which time series match your filters
Data is fetched in parallel from the relevant shards
Results are aggregated
The chart renders

Query execution flow showing parallel fetching from multiple shards

The parallel fetching is important—instead of querying shards one by one, they're all queried simultaneously. This is why dashboards with dozens of charts can still load in a couple seconds.

Data Retention and Downsampling

One thing to note: Datadog doesn't keep full-resolution data forever. They typically:

High-Resolution Data Keep high-resolution data (10-15 second intervals) for about 24-48 hours

1-Minute Averages Downsample to 1-minute averages for older data

Hourly Averages Further downsample to hourly for data older than a month

This is a practical compromise. You usually need detailed data for recent troubleshooting, but for historical trends, hourly averages are usually sufficient.

Handling Failures

A few failure scenarios worth mentioning:

Agent Network Connectivity If an agent loses network connectivity, it buffers metrics locally for a while (usually an hour or two). When connection is restored, it sends the buffered data.

Storage Node Failure If a storage node goes down, Kafka holds onto the data until the node recovers. There's also replication at multiple levels to prevent data loss.

Regional Outage If there's a major outage in one region, traffic gets routed to other regions. This might cause some delay but prevents complete data loss.

What Makes This Work

The key design decisions that enable this scale:

Pushing aggregation to the edge (at the agent level)
Keeping services stateless where possible
Using Kafka as a buffer layer
Sharding data intelligently
Separating metadata from time series storage
Parallelizing query execution

It's not any single magic bullet, but rather a combination of thoughtful architectural choices that work together.

Some Practical Takeaways

If you're building something similar (or just trying to use Datadog more effectively):

Watch Your Tag Cardinality Adding tags like user IDs or request IDs can explode the number of time series. Stick to tags that have bounded cardinality.

Agent Aggregation is Your Friend The more you can aggregate locally, the less data you send over the network.

Recent Data Matters Most Design retention policies accordingly. You rarely need second-by-second granularity from three months ago.

Batch and Compress Don't send metrics one at a time. Batch them and compress before sending.

Closing Thoughts

Building a monitoring system that can handle billions of metrics per second isn't about having one clever trick. It's about making practical engineering decisions at each layer–from the agent to storage to queries.

Datadog's approach is a good example of scaling through architecture rather than just throwing more hardware at the problem. Though obviously, they do use a lot of hardware too.

If you're dealing with monitoring at scale, there's quite a bit to learn from how they've structured things. The principles apply even if you're not operating at billions of metrics per second – just maybe at millions instead.

HOW DATADOG HANDLES BILLIONS OF METRICS PER SECOND

Jashan Goyal

Summary

Table of Contents

Share