The answer: Not much when it comes to power provisioning.
Hear us out. The demand for compute capacity is growing every day, and successful companies are scaling faster than ever. In response, organizations have typically resorted to over-provisioning -- they buy more data center space and equipment than necessary today in anticipation of IT demands tomorrow. While this strategy is helpful for contingency planning, it presents a downside: zombie servers sit idle in data centers, wasting energy and money.
The same over-provisioning exists with power supplies in a data center, where bigger doesn’t necessarily mean better, particularly as pressure to reduce power consumption increases. In a past life, when we built infrastructure hardware, we had a hunch that most companies were pairing their infrastructure with the wrong power supplies. Anecdotal evidence showed that companies tended to buy power supplies with much more capacity than ever utilized.
Now we have the analysis to prove it. Using data from a client’s deployment as a proxy, we studied the maximum load of power supplies and their normal and critical output. We then measured actual power consumption and found a huge discrepancy between the load and output at both levels. The answer became clear: when it comes to power, companies are paying for the infrastructure they need, but they’re not using it to its full potential. As a result, they’re losing tens of thousands of dollars in the process.
We informed our client that the best provisioning strategy is one based on your infrastructure's actual load instead of on the power supply's nameplate power value. Furthermore, as anyone running a data center knows, power supplies follow an efficiency curve. The closer they run to their maximum output, the more efficient they are. Our data shows that you can actually be less efficient by not using your power supplies to their fullest extent.
Power provisioning for a given data center is usually based on the assumption that each system operates at 80% of its power supply's rated maximum power, otherwise known as the nameplate value. To scale the supporting power infrastructure within the data center accordingly, substantially higher capital expenditures are needed to buy extra, higher reliability, larger-sized equipment, including transformers, cooling and UPS systems.
But is a manufacturer’s nameplate value the best place to start? We analyzed data provided by one of our customers, a cloud provider with a cluster containing over 1,600 nodes. We also came across a previous study by a team at Google which discovered a large gap (7% to 16%) between achieved and theoretical peak power usage for groups of thousands of servers. This gap gets even larger — up to 40% — when measured for an entire data center.
Unlike Google, our customer's infrastructure is very heterogeneous, which made it impossible to find even 500 systems of the same model (see figure 1 below). Even with this wide range of servers, our analysis showed the gap in achieved and theoretical peak power usage is more significant than the numbers quoted in Google’s study, and we suspect that this is true for other companies as well.
FIgure 1: The customer's heterogeneous infrastructure.
We first looked for over-provisioning cases in our client’s infrastructure, comparing measured ratings with nameplate power values. From this perspective, a larger gap means lower operational efficiency based on a typical power efficiency curve. The "sweet spot" on the efficiency curve is typically in the 40% to 80% range, so machines with 20% to 30% power consumption are over-provisioned and operate in a low efficiency range.
Figure 2: Efficiency curve of a second generation Open Compute power supply. Even efficient power supplies have a lower efficiency when lightly loaded.
The following graph compares the average power consumption and peak power consumption with the nameplate power. As you can see, the majority of systems are operating below the sweet spot. Almost none of the systems are operating at the top of the sweet spot range, where they would be more efficient.
Figure 3: Histogram of average and peak power consumption normalized by nameplate value.
In this example, the power supplies could easily double their power output. Operating at power levels below the sweet spot results in a higher fraction of wasted energy, not to mention additional heat generated by the lower efficiency which increases cooling costs. The same can be said for data center equipment like transformers and cooling equipment which are most efficient at higher loads.This type of power provisioning isn’t economical or ecological.
A quick back-of-the-envelope calculation shows the economic impact of this over-provisioning. With 1,600 servers in its fleet and a utility rate of $0.10 / kWh, we calculated that the client could save over 300,000 kilowatt hours per year, at a savings of over $33,000 -- no small amount for a company scaling rapidly while trying to manage cost.
Avg input power per server (Watts) | 260 |
Number of servers | 1600 |
Cost / kWh based 2016 industry data | $0.10 |
Power usage effectiveness (PUE) | 1.30 |
Efficiency A (of customer’s existing servers) | 87.00% |
Efficiency B (if operating at servers’ sweet spot) | 94.00% |
Total kWh / year A | 4,737,408 |
Total kWh / year B | 4,405,789 |
Savings (kWh/year) | 331,619 |
Total power cost A | $473,740.80 |
Total power cost B | $440,578.94 |
Savings | $33,161.86 |
There are even more ways to save on data center power and infrastructure which can be considered with further analysis:
Right-sized power supplies are cheaper than larger power supplies, and they also lower a company’s CapEx
Planning for actual loads means you can amortize data center power and cooling infrastructure across more servers
Data center infrastructure (transformers and chillers) is also more efficient at higher loads
This study was the first time we used Coolan's data collection in a production environment, and it demonstrated a customer's inefficiencies. We learned that an IT organization shouldn't provision data center power based on a power supply's nameplate power, since the actual load is much lower on average.
Since this customer is a large software developer, it's safe to say that this is probably the case for many cloud-based providers. By analyzing the power ratings from actual operations, our data shows that the power infrastructure in a data center typically is oversized. If it is scaled down correctly, the equipment — especially the power infrastructure blocks — would draw less power from the grid overall. Our client was leaving more than $30,000 on the table every year, and extending that over a larger data center, the amount of cost-savings could easily approach hundreds of thousands of dollars.
Provisioning can be a tricky line to walk. The best strategy is to measure workload and power, then save on provisioning to avoid under-subscription. Collecting actual power consumption data for a variety of systems and applications over a meaningful period of time and representing different regimes of operations is important for cost-effective power provisioning at a data center. Such datasets can help, for example, with capacity planning, scheduling upgrades and choosing better server density and power supply models for a given application type.