Data Center Heatmap

At Automattic, our systems team manages over 10,000 physical servers in 30 data centers on 6 continents. As our compute density has grown from 24 CPU/RU threads in 2013 to 128 CPU/RU threads in 2022, the maximum thermal thresholds have decreased. Older, less powerful servers can operate with air inlet temperatures of up to 42°C (107.6°F) while newer servers trigger CPU throttling at much lower temperatures of 35°C to 37°C (95°F to 98.6°F). Normal data center operating temperatures tend to be between 20F and 25C, but cooling failures are quite common (even affecting Google). So we have to watch the temperatures carefully.

We are big fans of Prometheus and Grafana and for a few years we have had temperature graphs that look like this.

This graph shows the temperatures of some servers located in our data center in Johannesburg, South Africa, for a week. The colored lines represent individual servers and the bold red line represents the average temperature in the rack.

We get this data from our servers inlet temperature sensor using ipmitool. I thought it would be interesting to see this data visualized a little differently, and Grafana has a Heatmap chart type that made it quite easy.

First, we just want to graph the temperature per location for a given data center. In PromQL it looks like

average per (location) (ipmi_inlet_temp{dc="$DC"})

location includes the rack identifier and the location within the rack. For example, a location of 101-10 would mean Rack 101, RU 10. We store this information in our data center asset management system (which is a colon separated file) and it is added as labels to all Prometheus metrics. By choosing the Heatmap (New) chart type and configuring some basic chart options, Grafana allows us to create a chart that displays the same data as our original chart, but in a different and more useful way. We can easily see that the top of the rack is warmer than the bottom, which is to be expected since the cold air in this setup comes from the floor. We can also see that temperatures have increased slightly over the past week, which is not ideal, but they are not at dangerous levels.

Data Center Heatmap

At Automattic, our systems team manages over 10,000 physical servers in 30 data centers on 6 continents. As our compute density has grown from 24 CPU/RU threads in 2013 to 128 CPU/RU threads in 2022, the maximum thermal thresholds have decreased. Older, less powerful servers can operate with air inlet temperatures of up to 42°C (107.6°F) while newer servers trigger CPU throttling at much lower temperatures of 35°C to 37°C (95°F to 98.6°F). Normal data center operating temperatures tend to be between 20F and 25C, but cooling failures are quite common (even affecting Google). So we have to watch the temperatures carefully.

We are big fans of Prometheus and Grafana and for a few years we have had temperature graphs that look like this.

This graph shows the temperatures of some servers located in our data center in Johannesburg, South Africa, for a week. The colored lines represent individual servers and the bold red line represents the average temperature in the rack.

We get this data from our servers inlet temperature sensor using ipmitool. I thought it would be interesting to see this data visualized a little differently, and Grafana has a Heatmap chart type that made it quite easy.

First, we just want to graph the temperature per location for a given data center. In PromQL it looks like

average per (location) (ipmi_inlet_temp{dc="$DC"})

location includes the rack identifier and the location within the rack. For example, a location of 101-10 would mean Rack 101, RU 10. We store this information in our data center asset management system (which is a colon separated file) and it is added as labels to all Prometheus metrics. By choosing the Heatmap (New) chart type and configuring some basic chart options, Grafana allows us to create a chart that displays the same data as our original chart, but in a different and more useful way. We can easily see that the top of the rack is warmer than the bottom, which is to be expected since the cold air in this setup comes from the floor. We can also see that temperatures have increased slightly over the past week, which is not ideal, but they are not at dangerous levels.

What's Your Reaction?

like

dislike

love

funny

angry

sad

wow