info@datatroops.io
HOW DATADOG HANDLES BILLIONS OF METRICS PER SECOND
Jashan Goyal
Founder & CEO
Infrastructure & Monitoring Expert
Specializing in large-scale systems
15+ years in software engineering
Summary
I've been working with monitoring systems for a while now, and one thing that always caught my attention is how Datadog manages to ingest massive amounts of data without breaking. When you're dealing with thousands of servers sending metrics every few seconds, things can get complicated pretty quickly. This article walks through how they've architected this system to handle billions of metrics per second.
What Datadog Does
I've been working with monitoring systems for a while now, and one thing that always caught my attention is how Datadog manages to ingest massive amounts of data without breaking. When you're dealing with thousands of servers sending metrics every few seconds, things can get complicated pretty quickly.
Let me walk you through how they've architected this, based on what I've learned from their engineering blogs and my own experience using the platform.
Note: This is a simplified view based on common large-scale monitoring patterns. The actual implementation may vary, and this represents one way such systems can be architected at scale.
Datadog is a monitoring platform that collects metrics, logs, and traces from your infrastructure. The basic idea is simple: instead of having separate tools for different monitoring needs, you use one platform.
In practice, if you're running something like:
- 2,000 servers across AWS
- A bunch of containers (let's say 15,000)
- Multiple services and databases
Each of these components reports metrics—CPU usage, memory, request counts, error rates. When you add it all up, you're looking at a lot of data points being generated every second.
The Cardinality Problem
Here's where things get tricky. A metric isn't just a name and a value. It has tags attached to it.
For example, `http.request.latency` might have tags like:
- `service: checkout`
- `region: us-east-1`
- `environment: production`
- `status_code: 200`
Every unique combination of these tags creates a separate time series. So if you have 100 services, 5 regions, 3 environments, and 10 different status codes, you're already looking at 15,000 time series just for that one metric.
This is called high cardinality, and it's one of the main challenges in building monitoring systems at scale.
The Scale Challenge
At a previous job, we had about 2,500 hosts. Each host sent roughly 50 metrics every 10 seconds. That works out to:
Scale Calculation: 2,500 hosts × 50 metrics × 6 times per minute = 750,000 metrics per minute. Or about 12,500 metrics per second.
And that was just infrastructure metrics, not including application-level metrics like request rates or custom business metrics.
If each metric data point is around 100 bytes (metric name, tags, value, timestamp), that's 1.25 MB/sec of raw data. With compression, it drops to maybe 300-400 KB/sec.
Now imagine a company like Datadog with thousands of customers, each potentially sending similar amounts of data. The numbers add up quickly.
The Overall Flow: How Data Moves Through the System
Before diving into details, here's the basic pipeline:
Your infrastructure sends metrics to the Datadog agent → Agent aggregates and forwards to Intake services → Intake pushes to Kafka → Kafka routes to Storage → When you query, data flows back through the Query layer to your dashboard.

Each step is designed to handle massive scale without becoming a bottleneck.
Starting at the Agent
The Datadog agent is a lightweight program (written in Go, typically 50-70 MB) that runs on each host or container. It collects system metrics like CPU and memory, integrates with services like PostgreSQL or Redis, and gathers custom application metrics—basically, it's the data collector sitting on your infrastructure.
Instead of sending every single data point, the agent does preprocessing locally. Let's say your server reports CPU usage every second:
Agent Aggregation Example: 70%, 72%, 75%, 71%, 73% → The agent aggregates these over a 10-15 second window and sends: avg: 72.2%, max: 75%, min: 70%. This reduces the amount of data being sent over the network significantly.

It also batches multiple metrics together and compresses them before sending.
The Intake Layer
Once data leaves the agent, it hits Datadog's intake services. These are designed to be simple and stateless—they authenticate requests, validate the data format, and forward everything downstream.
The stateless part matters because it means they can scale horizontally. If traffic doubles, they just add more intake servers. No coordination needed between them.
Kafka as a Buffer
After intake, metrics go into Kafka. This serves as a buffer between data coming in and data being stored.
The reason this is useful: if there's a spike in traffic or if the storage layer slows down for some reason, Kafka just holds onto the data. Once things stabilize, the consumers catch up and process everything.

It's basically decoupling the ingestion rate from the storage rate, which prevents data loss during traffic spikes or system hiccups.
Now, consumer services continuously read from Kafka and need to figure out where each metric should be stored. This is where routing and sharding come into play.
Routing and Sharding
Datadog routes metrics deterministically—the same metric always goes to the same shard. This is done using a hash of the metric name and tags.
So metrics for `service:checkout`, `region:us-east-1` always end up on the same shard. This keeps related data together and makes querying more efficient since you don't have to hit multiple shards for the same time series.
Once the routing determines which shard a metric belongs to, it gets written to storage. Let's look at how that storage is organized.
Storage Design
The storage is split into two parts:

When you run a query, it first finds the relevant time series using the metadata index, then fetches the actual data points from time series storage. This separation keeps queries fast even with millions of time series.
Query Execution
When you load a dashboard, here's roughly what happens:
- The query identifies which time series match your filters
- Data is fetched in parallel from the relevant shards
- Results are aggregated
- The chart renders

The parallel fetching is important—instead of querying shards one by one, they're all queried simultaneously. This is why dashboards with dozens of charts can still load in a couple seconds.
Data Retention and Downsampling
One thing to note: Datadog doesn't keep full-resolution data forever. They typically:
This is a practical compromise. You usually need detailed data for recent troubleshooting, but for historical trends, hourly averages are usually sufficient.
Handling Failures
A few failure scenarios worth mentioning:
What Makes This Work
The key design decisions that enable this scale:
- Pushing aggregation to the edge (at the agent level)
- Keeping services stateless where possible
- Using Kafka as a buffer layer
- Sharding data intelligently
- Separating metadata from time series storage
- Parallelizing query execution
It's not any single magic bullet, but rather a combination of thoughtful architectural choices that work together.
Some Practical Takeaways
If you're building something similar (or just trying to use Datadog more effectively):
Closing Thoughts
Building a monitoring system that can handle billions of metrics per second isn't about having one clever trick. It's about making practical engineering decisions at each layer–from the agent to storage to queries.
Datadog's approach is a good example of scaling through architecture rather than just throwing more hardware at the problem. Though obviously, they do use a lot of hardware too.
If you're dealing with monitoring at scale, there's quite a bit to learn from how they've structured things. The principles apply even if you're not operating at billions of metrics per second – just maybe at millions instead.
Build Scalable Monitoring Solutions with Us
By submitting this, you agree to our Privacy Policy
Why Choose DataTroops for Monitoring Solutions?
We specialize in building scalable monitoring and observability systems, helping organizations handle massive metric volumes, implement efficient data pipelines, and design robust infrastructure monitoring solutions.
KEY TOPICS COVERED
- •Scalable monitoring architecture
- •High-cardinality metric handling
- •Kafka-based data pipelines
- •Real-time observability systems
- •Custom monitoring solutions
We're Ready To Talk About Your Opportunities
Let's discuss how we can help transform your ideas into successful products. Schedule a consultation with our expert team today.