Industry Trend

Feasibility study on energy saving of AI used in 1U server with multi-heat source cooling fan



■Research Background

With the growth of technology, the data that needs to be stored, transmitted, calculated are becoming more and larger. In order to facilitate management and scale development, various high-density server facilities are set up as data centers. In the future, with the Internet of Things, cloud with the accelerated development of storage, big data, artificial intelligence, and 5G, the demand for data centers as data and information processing platforms will also increase, and data centers will generate high-density waste heat during operation. Waste heat is mainly generated by several servers in each cabinet. If the heat is not effectively dissipated, the system will overheat and fail to operate normally. Therefore, the stability of the server performance in the data center and the reliability of high-speed computing. It can be seen that the dependence and complexity of the server system are increasing day by day.

According to the proportion of electricity used by data centers in the United States in 2014 [1], 40% of the energy loss is used to dissipate the heat generated by the server. Therefore, reducing the energy consumption of the cooling system has become a very important research in recent years, in addition to being more efficient to the use of energy, it can also significantly reduce energy costs. This also shows that in the future, data centers are bound to strike a balance between server performance requirements and mitigation of energy costs.


■research method

1.Deep reinforcement learning

This research uses the deep deterministic strategy gradient algorithm in deep reinforcement learning. The algorithm is based on the Markov decision-making process, through the environment and the agent's repeated interactions, and each interaction is recorded and stored in the database. After accumulating a certain amount of data, random batches of data in the database are selected for training. In the algorithm model training process, there are two neural networks, namely the target and the evaluation neural network. The evaluation neural network will update its parameters every time the loss gradient, while the target neural network will only update the parameters in each calculation. Slow update, compared with evaluating neural network, can be regarded as a fixed reference point, which helps evaluate neural network to converge more stably in calculation. Finally, it is to update the actor neural network parameters, and use this method to optimize the actor neural network to help the agent find a suitable operating point for decision-making in different situations.

In the process of interaction between the agent and the environment, in order for the agent to understand the internal conditions of the system, some index parameters need to be used to enable the agent to understand the current situation in the server. Therefore, there are three main categories as indicators, which are heat sources. Feature, environment and internal configuration, and fan configuration. In these three combinations, there are several parameter indexes as a reference for observing the characteristic value. In this study, the representative action value is the duty cycle of the fan's current speed as the output action.

After an interaction between the agent and the environment, a reward value will be generated. This value is given to evaluate the pros and cons of the action based on the result of the interaction. It also serves as an important reference basis for training the critic neural network, which in turn affects the final behavior direction of the agent. Therefore, in order to effectively control the heat dissipation, the fin efficiency and heat transfer area that affect the heat dissipation in this study have been fixed in the heat dissipation design, and the subsequent control cannot be changed. The heat transfer coefficient is affected by the flow rate driven by the fan, and It can be known from the fan law that fan speed is an important factor affecting system power consumption, and it is also the largest indicator affecting energy consumption; finally, the effective temperature difference will be affected by the fan speed, and the heat source temperature must be controlled at a normal level in this study. Within the scope of operation, the energy-saving space of the server is improved under the premise of avoiding component overheating and damage, and the agent is given a reward value to evaluate the action.


2.Server transient environment simulation

Figure 1 Actual server configuration


Figure 1 is a server on the market. It can be seen that the server configuration is complex and the space is small. Therefore, this study simplified the server heat transfer model. First, it is assumed that the server is a single entry and exit channel, and there is no external flow. Under this condition, the static pressure of the fan is equal to the total pressure drop of the channel, and the influence factors of the development flow of forced convection are ignored, and the cooling behavior of the radiator only depends on the air flowing in from the front area of its inlet for cooling. Ignoring the cooling effect of the surrounding bypass channel, the last thing is that the air flowing into it will not leak to the surrounding bypass channel during the process of passing through the radiator.

In addition to the above assumptions, the heat source distribution law in the server is also simplified. It is assumed that the heat source module distribution in the server is divided into different rows according to the flow direction. Each column is connected in series to form upstream and downstream, and the downstream inlet will inherit the fluid properties of the upstream outlet; the radiators in the same column are arranged side by side to form a plurality of parallel channels. Based on the above radiator distribution assumption, the space in the server is cut into multiple imaginary channels, and there can only be one radiator in a single flow channel, and the radiator cannot completely occupy the cross section of the flow channel to form the geometric configuration of the bypass phenomenon. However, ignoring the difference of the side profile of the runner, this geometric model can be approximated to the research model of Jonsson [2], so the radiator performance is described by its pressure drop and the empirical formula of Newson's number.


■Preliminary results

The preliminary governing control model has been completed, and the 1U server environment configuration parameter range applicable to the model in this study is shown in Table 1.Figure 1 is a schematic diagram of the environmental parameters and configuration of the subsequent simulation server. Each numbered block is a heat source module, which includes a heat source and a radiator. Figure 2 is the performance of the fan used in the simulation.


Chart 1 Server simulation environment configuration


Figure 2 Fan performance diagram


Compare the results of traditional switch control methods and algorithm control.

The simulation results in Fig. 2 and Fig. 3 show that although these two methods can effectively control the temperature, the energy consumption of the traditional control method is 109% of that of the algorithm control. The algorithm control can control the maximum heat source temperature near its upper limit, use the maximum effective heat transfer temperature difference to dissipate heat, and reduce the use of fans to dissipate heat as much as possible. It can be found that the algorithm is used to control and effectively help achieve greater energy saving effect.


Although the control of the algorithm can effectively improve the energy-saving effect, there is still room for improvement in the energy-saving effect. Therefore, we will continue the previous architecture and change the fan control method to compare the temperature control and energy-saving effects. Figure 4 shows the control result of changing only one fan at a time. At the beginning, it is not able to feed back the temperature immediately, but in the later stage, the temperature control is more stable and the turbulence is relatively stable; Figure 5 divides the fan into several areas, and each area has several fans. The fan changed each time is based on the zone, which is more instantaneous for temperature control, but at the same time, because more fans are changed at a time, the impact on the overall flow will also increase, and the temperature will produce more severe oscillations, and its energy-saving effect will be relatively poor.



Using the big data database for model training effectively helps us reduce the time cost of designing the server configuration, and can also give more appropriate actions according to different working conditions, reducing the energy consumption of the server in terms of heat dissipation. The current research results show that the algorithm can help us control the temperature of the heat source, avoid the occurrence of overheating, and use the maximum effective temperature difference to dissipate heat, then match the fan to further heat dissipation and reduce the energy consumption rate of the fan. In order to improve the energy-saving space of the overall system; Then follow-up will continue to optimize the intelligent control, and expand to the application of cabinets and computer rooms.



[1] 2020, "How Much Energy Do Data Centers Really Use?," Energy Innovation: Policy and Technology LLC.

[2] H. Jonsson and B. Moshfegh., 2001, "Modeling of the thermal and hydraulic performance of plate fin, strip fin, and pin fin heat sinks-influence of flow bypass," IEEE TRANSACTIONS ON COMPONENTS AND PACKAGING TECHNOLOGIES, VOL. 24, NO. 2, JUNE 2001, pp. 142-149.




Professor Wang Qichuan

Expertise |Electronic heat dissipation,Cloud computing energy management,Desalination,Development and application of non-traditional fluid machinery,

Refrigeration and air conditioning,Supercritical fluid systems and heat exchangers (soft and hardware),

LED heat dissipation,Micro-channel heat flow design (single and two-phase fluid applications)