A Universal Self-Calibrating Dynamic Voltage and Frequency Scaling (DVFS) Scheme with Thermal Compensation for Energy Savings in FPGAs

Shuze Zhao¹, Ibrahim Ahmed¹, Carl Lamoureux¹, Ashraf Lotfi², Vaughn Betz¹ and Olivier Trescases¹

¹University of Toronto, Toronto, ON, Canada
10 King’s College Road, Toronto, ON, M5S 3G4, Canada
²Altera Corp., Hampton, NJ, 08827, USA
Email: szhao@ece.utoronto.ca

Abstract—Field Programmable Gate Arrays (FPGAs) are widely used in telecom, medical, military and cloud computing applications. Unlike in microprocessors, the routing and critical path delay of FPGAs is user dependent. The design tool suggests a maximum operating frequency based on the worst-case timing analysis of the critical paths at a fixed nominal voltage, which usually means there is significant voltage or frequency margin in a typical chip. This paper presents a universal offline self-calibration scheme, which automatically finds the FPGA frequency and core voltage operating limit at different self-imposed temperatures by monitoring design-specific critical paths. These operating points are stored in a calibration table and used to dynamically adjust the frequency and core voltage according to the FPGA temperature when the application circuit is running. The self-calibration process is demonstrated on an Altera Cyclone IV 60-nm FPGA with a digitally controlled dc-dc converter, leading to 40% power savings in a typical digital filter application.

I. INTRODUCTION

Field Programmable Gate Arrays (FPGAs) can outperform microprocessors and Digital Signal Processors (DSPs) in many applications, thanks to their ability to implement massively parallel algorithms [1]–[3]. Since FPGAs can be reprogrammed to accommodate evolving standards, they eliminate the custom manufacturing and resulting high Non-Recurring Engineering (NRE) costs and development time of Application-Specific Digital ICs (ASICs). Thus FPGAs are widely used in telecom, medical, military and cloud computing applications. However, the flexibility of FPGAs comes at a significant cost; they typically consume ten times the dynamic power of an ASIC performing the same task [4], making power reduction techniques crucial for FPGAs. Dynamic Voltage and Frequency Scaling (DVFS) has been widely deployed in microprocessor applications over the past decade [5]–[11]. The fact that an FPGA can be programmed to perform any digital function gives rise to some unique challenges in designing a DVFS control system. Unlike microprocessors, the speed-limiting paths of a specific FPGA IC are unknown at manufacturing time; hence mimicking the critical path and setting the minimum core voltage for the DVFS control system is a major challenge.

Currently, FPGA designers operate each IC at its rated nominal voltage, and must choose a clock frequency at or below the limit predicted by the Computer-Aided Design (CAD) tool’s timing analysis. This timing analysis is extremely conservative, using worst-case models for process corners, on-chip voltage drop, temperature and aging. In the vast majority of chips and systems, however, the supply voltage can be reduced significantly below nominal in order to obtain energy savings. Operating the IC at a lower voltage also reduces the impact of aging effects such as Bias-Threshold Instability (BTI), and improves the chip lifetime [12], [13].

A. Prior Work

In [14], a DVS scheme with a Complex Programmable Logic Device (CPLD) load is presented. A CPLD is similar to an FPGA but has a non-volatile configuration memory. This scheme does not monitor the logic error due to the timing violation. In [15], a Logic Delay Measurement Circuit (LDMC) is used to determine the voltage at which the application circuit has a timing failure, and adjusts the supply voltage accordingly. It assumes that the critical path in the application circuit can be exercised by randomly generated inputs during calibration, which is not valid in modern FPGA applications. Approaches in [14] and [15] also rely on a non-valid assumption that the VCO/LDMC delay value perfectly tracks the delay variation in the application circuit critical paths with temperature and aging.

In [16]–[18], online timing slack measurement is achieved by using a phase-shifted clock and one shadow register for each critical path to determine timing headroom in a circuit during operation. This approach has several notable shortcomings:

1. the timing slack measurement is dependent on the input data, which cannot be controlled during normal operation,
2. the technique is limited to FPGA components where a second capture register can be added at the end of a critical path, which is not feasible for important ‘hard’ blocks such as the on-chip RAM,
3) the scheme requires extra logic elements (LEs) and clock resources, increasing circuit power and reducing the usable capacity of the FPGA.

In addition, the past works [15]–[18] do not employ a high-frequency digital dc-dc converter to generate the variable core supply voltage. Many important practical issues, such as the converter response time and quantization issues, are therefore ignored.

In this work, a new offline universal self-calibration scheme that requires close interaction with the digital dc-dc converter is proposed to automatically characterize the exact relationship between the maximum operating frequency for each core voltage and temperature corner. This information is saved in a calibration table that is used during normal operation for DVFS.

II. SELF-CALibration CONCEPT WITH TWO-STEP CONFIGURATION

The proposed universal self-calibration process is intended to run on a system production line, or regularly during each power-up sequence of a system. It is therefore important that this process (1) be reasonably fast and (2) require the minimum possible FPGA resource overhead. The self-calibration method has three steps and requires the FPGA to be programmed twice, as shown in Fig. 1:

1) The user’s design is automatically analyzed by the augmented CAD tool to extract the logic paths having the most critical timing. A design-specific self-calibration configuration file is then created. The critical paths used in self-calibration are exact replicas of the critical paths in the application; they are placed and routed using the identical resources (routing wires, LEs, etc.). All inputs along the critical path are set to non-controlling values to guarantee that the path is synthesized. These non-controlling values are selected to mimic the worst-case rising and falling pattern reported by the tool. The CAD automation ensures that no additional designer effort is required.

2) The FPGA is programmed once with the self-calibration configuration file, as shown in Fig. 3(a). The on-chip configuration contains:

- the design-specific critical paths with error checking circuit,
- flip-flop chain based logic blocks configured as programmable heaters for temperature control,
- a temperature sensing circuit,
- a frequency synthesizer,
- a digital dc-dc controller,
- a calibration controller.

Each of the critical paths is exercised by toggling the source register and checking that the sink register captures the correct value. And for each critical path, a fast path (a buffer or an inverter) that behaves as the critical path is synthesized to identify what the correct value is. The output of this fast path and the critical path is compared together. The fast path is designed such that it does not fail the timing at the maximum applied frequency.

Heater cells are distributed across the entire chip, and are used to create different die temperature conditions. The FPGA proceeds to run the calibration, using self-heating and automatic timing error checking, and populates the DVFS calibration table (CT). An ideal calibration table is demonstrated in Fig. 2. The detailed calibration scheme is described in the following session.

3) Finally, when the self-calibration is complete, the FPGA is automatically programmed a second time with the user’s regular configuration file, as well as the DVFS control system which relies on the extracted calibration table, as shown in Fig. 3(b). Based on the clock frequency requirements and chip temperature, the DVFS control core refers to the calibration table and set the according core voltage, $V_{\text{core}}$.

III. SYSTEM LEVEL ARCHITECTURE

The system architecture is shown in Fig. 4 and includes a 60-nm CMOS Cyclone IV FPGA (EP4CE115F29C7N). The two-phase Buck converter has an input voltage of 5 V and regulates the FPGA core voltage, $V_{\text{core}}$, between 0.85 - 1.35 V. The main phase, which delivers the majority of
The FPGA power, is implemented using an Enpirion power module, ET4040QI. The main phase is rated at 10 W and operates in digital peak current mode control, where the peak current command, $I_{ref}^{[n]}$, is generated within the FPGA and is converted to an analog reference, $I_{ref}^{(t)}$, using a high-speed DAC. The outer voltage loop is also implemented on the FPGA, based on the sampled voltage error signal, $err[n]$. The controller is carefully optimized to operate down to the minimum FPGA core voltage, $V_{min}$. While the latest-generation Altera FPGAs include both on-chip temperature and core voltage sensing, these are implemented off-chip in this initial phase of the project.

The auxiliary phase, which has a lower power rating of 3 W, is controlled by a non-volatile CPLD to assist with the startup process when the main-phase controller in the FPGA is not powered. The auxiliary phase can also be used to improve the dynamic response, similar to [19]. The CPLD can be removed in future implementations, where the startup control can be integrated into the ET4040QI for example.

In an ideal application scenario, the self-calibration/application configurations and the CT would be stored in the on-board non-volatile memory. To simplify the process, the FPGA is manually programmed with different configurations.

The fully automated calibration process is shown in Fig. 5 and can be explained as follows. The heater blocks are first enabled to cause the die temperature to ramp up. Each heater cell is programmable and consists of $N_{heater} = 8$ chains with 88 flip-flops per chain switching at 100 MHz. With heater cells enabled and $V_{core} = 1.2$ V, the FPGA package reaches 85 °C. At every integer temperature value, CT entries are obtained and stored in the on-board Flash Memory. During each sweep, the dc-dc controller drops $V_{core}$ to $V_{min} = 0.832$ V and starts to increases the clock frequency, $f_{sys}$, from the lowest operating frequency. The increasing clock frequency, $f_{sys}$, is applied to the critical paths until a logic error is detected ($err.flag = 1$, when $f_{sys} = f_{max}$) by the error-checking blocks. Once an error is detected, $V_{core}$ is increased by $\Delta V = 16$ mV until $V_{core} = V_{max} = 1.328$ V at the end of the sweep. Since a higher voltage always allows for a higher frequency, the frequency range only needs to be swept once with this method. In between sweeps, $V_{core}$ is set to 1.2 V.

The duration of the full calibration process is limited by the system’s thermal time response, which is considerably longer than the dc-dc converter dynamics. For each temperature value, one sweep of frequency and voltage takes less than 100 ms, while the entire temperature sweep takes approximately 2 minutes. The calibration time can be greatly reduced by optimizing the heater design, the voltage range and other calibration parameters depending on application needs.
The automated self-calibration process was demonstrated using a common application, a digital FIR filter design. The package temperature ranges from 30 °C - 85 °C for $V_{\text{core}}$ from 0.832 V - 1.328 V. Fig. 6 shows the experiment setup. The power stage supplying $V_{\text{core}}$ on the DE2-115 is disconnected, and the customized dc-dc converter is mounted on top of the FPGA, while its output is connected to the decoupling capacitors on the DE2-115 board through vias with a short path to supply $V_{\text{core}}$. The frequency generator feeds in the input clock, $clk_{\text{ref}}$, through an SMA connector with the frequency of $f_{\text{sys}}/4$. The on-chip PLL boosts the frequency for 4 times to $f_{\text{sys}}$ as the system clock, $clk_{\text{sys}}$. The testing point of the thermocouple is fixed on the FPGA package, as shown in Fig. 6.

Altera Quartus II Chip Planner provides the visualized FPGA on-chip configuration and shows exactly which logic elements are used by the circuit. The on-chip configuration of the self-calibration and the application are shown in Fig. 7(a) and Fig. 7(b), respectively. Each blue box represents a Logic
Array Block (LAB) consisting of a number of logic elements. The darkness of the LAB represents the relative number of LEs used in the LAB. The LABs comprising the application’s most critical path are shown in red. The black arrow connecting the two red areas represents part of the critical path routing. This arrow is only a representation of the connection and does not reflect the actual routing. As shown in Fig. 7, the two on-chip configurations are significantly different, however the critical path resources are identical, as in Fig. 3. Only the most critical path is monitored in this experiment to verify the concept, but in a real application, there might be multiple critical paths which have a similar delay. All these near critical paths should be monitored to ensure the safe operation of the device.

The entire calibration process is shown in Fig. 8(a): each voltage spike (as noted by “⋆”) corresponds to one full sweep of $f_{\text{max}}$ versus $V_{\text{core}}$ at the given temperature. One such sweep is shown in Fig. 8(b), which reveals the converter dynamics.

During heating, $V_{\text{core}}$ is held at 1.2 V and then ramped down slowly to 832 mV when the target temperature is reached. All the heater circuits are turned on at this point to simulate the worst-case internal voltage drop in the application. $f_{\text{sys}}$ is increased until the failing indicator, $\text{err} \_\text{flag}$, goes high at which time the frequency is stored with the corresponding core voltage $V_{\text{core}}$ in the CT, $V_{\text{ref}}[n]$ is then increased and the process repeats.

The stored CT data of $f_{\text{max}}$ versus $V_{\text{core}}$ versus $T$ is plotted in Fig. 9(a), with one curve per 5 °C temperature increment. The effect of temperature is more noticeable at higher $V_{\text{core}}$. For example, at $V_{\text{core}} = 1.3$ V, $f_{\text{max}}$ drops by 6.25% over a temperature range of $\Delta T = 55$ °C. In order to check the accuracy of the CT table data, which is generated from the calibration configuration (ie: Fig. 3(a)), the maximum clock frequency of the full FIR application (ie:
in Fig. 3(b)) was independently checked using an exhaustive random data generator and error checking. The result is shown as the Benchmark result in Fig. 9(a) and matches very well with the CT results. The thick black curve is a conservative 5%-guardbanded operating target, based on the CT data for the purpose of power consumption comparison. The measured power consumption of the benchmark is shown in Fig. 9(b), with constant voltage (black line) and DVFS based on the guard-banded CT-table (red line). The frequency axis is normalized to the maximum value specified by the timing analysis of the CAD tool, \( f_{\text{crit}} \) (i.e: the best available data for designers currently). The green line corresponds to the minimum possible DVFS operation power without guard-band.

Several key points can be drawn from the data: (1) even with \( V_{\text{core}} \) fixed at 1.2 V, the benchmark circuit can operate up to 50% above \( f_{\text{crit}} \) (dashed line). This shows that the CAD tool timing is necessarily conservative as expected, since it must account for worst-case temperature and process variations; (2) using DVFS enables 40% power savings at \( f_{\text{crit}} \) in this application; (3) for the same power consumption, DVFS enables a 25% increase in the clock frequency.

V. CONCLUSION

While DVFS is highly successful in microprocessors, it remains elusive in FPGAs mainly due to the fundamental challenge of a user-dependent critical paths. The proposed self-calibration technique can be universally applied to any user design, has a very low resource overhead and guarantees no logic errors during operation. The technique allows FPGA designers to safely operate each FPGA at its optimal performance point, reaching power savings on the order of 40%. This procedure is fast enough to be applied automatically at board burn-in/test time, or possibly even at each board power-up.

VI. ACKNOWLEDGEMENT

This work was supported by Altera Corporation, the Ontario Centres of Excellence, the Natural Sciences and Engineering Research Council of Canada, the Canadian Foundation for Innovation and the Ontario Research Fund.

REFERENCES