Frontiers of Information Technology & Electronic Engineering www.zju.edu.cn/jzus; engineering.cae.cn; www.springerlink.com ISSN 2095-9184 (print); ISSN 2095-9230 (online) E-mail: jzus@zju.edu.cn



# Power-efficient dual-edge implicit pulse-triggered flip-flop with an embedded clock-gating scheme<sup>\*</sup>

Liang GENG, Ji-zhong SHEN<sup>‡</sup>, Cong-yuan XU

(College of Information Science & Electronic Engineering, Zhejiang University, Hangzhou 310027, China) E-mail: gengliang@zju.edu.cn; jzshen@zju.edu.cn; cyxu@zju.edu.cn Received Sept. 8, 2015; Revision accepted Feb. 17, 2016; Crosschecked Aug. 15, 2016

**Abstract:** A novel dual-edge implicit pulse-triggered flip-flop with an embedded clock-gating scheme (DIFF-CGS) is proposed, which employs a transmission-gate-logic (TGL) based clock-gating scheme in the pulse generation stage. This scheme conditionally disables the inverter chain when the input data are kept unchanged, so redundant transitions of delayed clock signals and internal nodes of the latch are all eliminated, leading to low power efficiency. Based on SMIC 65 nm technology, extensive post-layout simulation results show that the proposed DIFF-CGS gains an improvement of 41.39% to 56.21% in terms of power consumption, compared with its counterparts at 10% data-switching activity. Also, full-swing operations in both implicit pulse generation and the static latch improve the robustness of the design. Thus, DIFF-CGS is suitable for low-power applications in very-large-scale integration (VLSI) designs with low data-switching activities.

Key words: Low power, Flip-flop, Implicit, Clock-gating scheme, Dual-edge http://dx.doi.org/10.1631/FITEE.1500293 CLC number: TN432

# 1 Introduction

In many digital very-large-scale integration (VLSI) architectures, the power dissipation of the clock system that comprises clock distribution network and flip-flops (FFs) accounts for 30% to 60% of the overall system power, and 90% of the clock system power is dissipated by the FFs and the last sections of the clock distribution network that directly drive the FFs (Kawaguchi and Takayasu, 1998). Scaling down the technology causes an increase in chip densities and clock frequencies, which increases the importance of low-power circuit designs. In particular, several factors such as the demand for portable devices, thermal considerations, and environmental

concerns have further increased the importance of low-power designs (Hyman et al., 2013). To reduce power dissipation in both clock distribution networks and FFs in modern digital VLSI designs, a wide range of technologies has been proposed to improve the performance of FFs, including clustered voltage scaling, dual-edge triggering, and clock gating. Usually, these technologies are combined to further improve the performance of the design. Depending on these methods, a variety of high-performance lowpower FFs have been proposed in the literature (Klass et al., 1999; Stojanovic and Oklobdzija, 1999; Ko and Balsara, 2000; Wu et al., 2000; Kong et al., 2001; Nedovic et al., 2002; Kulkarni and Sylvester, 2004; Zhao et al., 2004; Strollo et al., 2005; Teh et al., 2006; Goh et al., 2007; Teh et al., 2011; Phyu et al., 2011; Hwang et al., 2012; Wu and Shen, 2012; Judy and Kanchana Bhaaskaran, 2012; Shen et al., 2015).

Pulse-triggered FFs (P-FFs) have gained greater popularity over conventional transmission gate (TG)

962

<sup>&</sup>lt;sup>‡</sup> Corresponding author

<sup>\*</sup> Project supported by the National Natural Science Foundation of China (Nos. 61071062 and 61471314) and the Zhejiang Provincial Natural Science Foundation of China (No. LY13F010001)

ORCID: Ji-zhong SHEN, http://orcid.org/0000-0002-9031-2379
 Zhejiang University and Springer-Verlag Berlin Heidelberg 2016

and master-slave-based FFs for high-speed and low-power applications (Klass et al., 1999; Ko and Balsara, 2000; Strollo et al., 2005; Teh et al., 2011). A P-FF is characterized by zero or even negative setup time by allowing time borrowing across the clock edge, which includes a pulse-generating stage and a data-latching stage. Besides its soft edge property, its concise latch structure reduces the power consumption of the clock system. According to the pulse generation method, P-FFs can be divided into explicit (eP-FF) and implicit (iP-FF) types, which have different attributes. First, iP-FF is often considered to be more power-efficient than eP-FF, because the former controls merely the discharge clock branches while the latter needs to generate a pulse independently. Second, eP-FFs' pulse generators can be shared by neighboring FFs, which helps in distributing the power overhead of the pulse-generating stage across other FFs (Teh et al., 2006). However, when applying a clock-gating scheme in eP-FFs, gating functions of multiple latches should be similar and the pulse should be physically close to its latches to prevent pulse distortion. Also, the capacitive load of the pulse generator should be considered for the safety of pulse delivery from the clock source to latches (Kim et al., 2011). However, these problems can be greatly alleviated in an iP-FF due to some of its features. As a consequence, on the basis of the clock triggering edge control technique proposed in our former publication (Xiang et al., 2013; Shen et al., 2015) and classical dual-edge triggering logic, a novel powerefficient dual-edge iP-FF with an embedded clockgating scheme is proposed, which is feasible for blocking one or two triggering edges of the clock signals if they are redundant.

# 2 Review of low-power techniques for clock systems and state-of-the-art pulse-triggered flip-flops

Switching power consumption is one of the primary components of the total power consumption in complementary metal–oxide–semiconductor (CMOS) circuits, and is caused by charging and discharging the load capacitances. It can be expressed by (Zeitzoff and Chung, 2005; Shen *et al.*, 2015)

$$P = \alpha C V^2 f, \tag{1}$$

where C is the node capacitance, V is the supply voltage, f is the clock frequency, and  $\alpha$  is the switching activity factor. According to these factors, various techniques are presented to save the power consumption of FFs.

### 2.1 Clustered voltage scaling

Clustered voltage scaling (CVS) is an effective way to decrease the power consumption, since the switching power is proportional to the square of the supply voltage (Kulkarni and Sylvester, 2004). In the CVS scheme, by using low supply voltage (VDDL) in speed-insensitive paths and high supply voltage (VDDH) in critical paths, the circuit can considerably reduce power consumption without degrading its performance. However, the positive-channel metaloxide-semiconductor (PMOS) transistor of the VDDH block cannot be shut off completely if it is directly driven by the output of the VDDL block, which causes great static power. Therefore, a levelconverting circuit is needed between these two blocks for converting low-swing input into high-swing output. Since this scheme can be combined with many other low-power techniques, it will not be discussed separately in this paper.

#### 2.2 Dual-edge triggering

Using dual-edge FFs by cutting the frequency of the clock by one half will save approximately half of the power consumption on the clock distribution network. For example, Fig. 1 shows an improved static dual edge-triggered FF (SDETFF), which contains a head-end sampling stage and an XNOR logicbased pulse generator (Goh *et al.*, 2007). The inverted clock signal CLKB is generated by only one inverter, which successively decreases the number of transistors. Inputs are straightly delivered to T and TB during the transparent period, which leads to a concise latch. However, redundant pulses are still generated when the input stays unchanged, resulting in a great dynamic power consumption.

#### 2.3 Conditional operation techniques

For most dynamic and semi-dynamic FFs, periodic clock signal triggering results in redundant transitions at internal nodes without changing the

963

output when the input stays unchanged (Zhao *et al.*, 2004). So, reducing these redundant transitions has a strong effect of minimizing power consumption. For this purpose, we extensively study the various techniques proposed in the literature, which can be classified as conditional capture, conditional pre-charge, and conditional discharge.



Fig. 1 Static dual edge-triggered flip-flop circuit: (a) XNOR pulse generator; (b) sampling stage

#### 2.3.1 Conditional capture technique

This technique is proposed to disable redundant internal transitions, and has achieved significant power reduction with little delay penalty (Fig. 2.) In this scheme, the path from the delayed clock to the discharge path of the internal node X is controlled by a Q-related signal. After sampling a high-input signal, the output will be charged high to keep the transparent window off, which prevents redundant transitions at



Fig. 2 Conditional capture technique

the internal node. For example, the conditional capture FF (CCFF) (Kong *et al.*, 2001) achieves great power reduction by eliminating internal redundant transitions. However, the conditional capture technique leads to a dissipation in redundant power by gate controlling the delivery of the delayed clock signal to the first stage.

### 2.3.2 Conditional pre-charge technique

To overcome the difficulties of the conditional capture technique, the conditional pre-charge technique is presented (Fig. 3), where a PMOS transistor controlled by a signal is inserted in the pre-charge path to avoid pre-charging the internal node X when the input D is kept high. For example, an improved FF called dual-edge conditional pre-charge FF (DE-CPFF) employs this technique, whose control signal is Q (Nedovic *et al.*, 2002). However, this technique suits only implicit FFs and is difficult to use in dual-edge triggering mechanism because it would need more transistors.



Fig. 3 Conditional pre-charge technique

#### 2.3.3 Conditional discharge technique

Fig. 4 shows the conditional discharge technique, which is applied for both eP-FFs and iP-FFs. In this technique, a negative-channel metal–oxide– semiconductor (NMOS) transistor controlled by a signal is inserted in the discharging path, which not only reduces redundant transitions at the internal node, but also maintains a small *D*-QB delay. For example, a conditional discharge FF (CDFF) is proposed based on this technique (Fig. 5) (Zhao *et al.*, 2004). When the input changes from low to high, the controlled signal Qfb shuts down the discharging path of the first stage until the input changes again.



Fig. 4 Conditional discharge technique



Fig. 5 Conditional discharge flip-flop circuit (Zhao *et al.*, 2004)

#### 2.4 Reducing capacity of the clock load

Usually, the activity factor of clocked nodes is 100%, while that of non-clocked nodes is about 10% (Weste, 2006; Hwang *et al.*, 2012). So, minimizing the number of clocked transistors to reduce clock load is an effective way to reduce the power of a clock system. For example, an improved design proposed by Hwang *et al.* (2012), named conditional pulse-enhancement FF (CPEFF), uses only one inverter to generate the delayed clock signal (Fig. 6). Also, CPEFF uses the feedback Qfb to conditionally control the discharging path of node *X*.

#### 2.5 Clock-gating technique

In most cases, redundant switching of the clock results in a great deal of unnecessary power dissipation. So, the clock-gating technique is proposed, which can suppress the redundant transitions of clock with respect to the master clock, leading to great power reduction (Fig. 7). Clearly, this scheme exhibits several advantages: no redundant transitions in the internal node of the FF in idle cycles; no need for conditional pre-charge or discharge blocks since the redundant clock signals are blocked when the input stays unchanged (Wu *et al.*, 2000).



Fig. 6 Conditional pulse-enhancement flip-flop circuit (Hwang *et al.*, 2012)



Fig. 7 Clock-gating technique

A variety of FFs based on this technique have been presented in the literature. For example, a clockgated sense-amplifier FF (CG-SAFF) exhibits aggressive power reduction by adopting a clock-gating scheme that performs well at low switching activities (Phyu et al., 2011). However, because the output '1' signal is driven by only NMOS transistors, the internal nodes in the clock pulse generator suffer from the threshold voltage degradation problem (TVDP), which leads to a long delay. A design of negativeedge-triggered FF with the clock-gating feature (CG-NFF), which is suitable for low-power applications (Judy and Kanchana Bhaaskaran, 2012), also suffers from TVDP just as CG-SAFF, due to the single-pass transistor employed in the pulse generator. Furthermore, it should be noted that if D makes a transition when CLK=0, the comparator in the pulse generator changes its output from high to low, which causes the NOR-gate output to change from low to high and to produce a pulse at the output of the generator. So, the output of the latch will be changed without respect to the edge of the input CLK, leading to the race problem (Geng *et al.*, 2016).

From the analysis above, we can find that both conditional operation techniques and clock-gating techniques are efficient methods for reducing power by avoiding redundant transitions of internal nodes. So, to design an iP-FF with the target of low power, we have not only combined the above low-power techniques such as dual-edge triggering, reducing clock load, and clock gating, but also modified the embedded clock-gating scheme.

# 3 Dual-edge implicit flip-flop with an embedded clock-gating scheme

After thoroughly analyzing the merits and weaknesses of existing FF designs, a novel dual-edge iP-FF with an embedded clock-gating scheme (DIFF-CGS) is designed by combining a modified clock-gating scheme with the dual-edge triggering technique. The schematic diagram of DIFF-CGS is made up of two parts: the implicit pulse generation stage with an embedded clock-gating scheme and a static latch (Fig. 8).

In DIFF-CGS, the clock-gating scheme is implemented by embedding a control circuit in the adaptive clocking inverter chain, which obtains the capability for judging and suppressing redundant delayed clock signals. Different from the clock-gating scheme employed in CG-SAFF which produces weak '1' signals by pass-transistor-logic (PTL) based comparators causing improper operations, especially in low-voltage schemes, the clock-gating scheme in our design is implemented by employing a transmission-gate-logic (TGL) based comparator and is consequently free from the TVDP, which greatly improves the robustness of the design. Note that due to the implicit pulse feature of our proposed design where the pulse and its latch are physically close, pulse distortion can be wisely avoided, which makes it easier to preserve the clock shape when delivering the pulse from the clock source to the latch.

To ensure the efficiency of double-edge clock triggering in an implicit environment, the clock branches (N5, N6) and (N7, N8) are shared by the latch. The advantage of this sharing structure lies in the fact that the number of clocked transistors is reduced, which results in a great power reduction. Compared with the CDFF's latch, the conditional discharge technique is not needed in the proposed latch, since there are no redundant transitions at internal node X by suppressing all the unnecessary pulses. A pseudo-NMOS transistor P2 (a weak PMOS transistor with the gate connected to the ground) is applied in the static latch, and the keeper circuit for node X can be omitted. Although P2 is always on, the short current occurs only once when D makes a 0-1transition during the evaluation phase, and the discharging path stays on for just a short moment, resulting in only a little short-circuit power. Then the discharging path is shut down by delayed clock signals CLK3 or CLK4. Furthermore, the output keeper (cross-couple inverters) offers protection against direct coupling noise and provides a feedback signal for the comparator in the implicit pulse generation stage.



Fig. 8 Dual-edge implicit flip-flop with an embedded clock-gating scheme

The operational principle of the proposed FF is explained as follows. When D and Q are different, the TGL-based comparator sends out a high logic signal, so that N1 and N2 in the clock inverter chain controlled by Y are turned on and P1 is turned off. As a result, the desired inverted and delayed clock signals (CLK3 and CLK4) are generated, controlling the clocked transistors N5 and N7. At the rising edge of the clock, CLK and CLK3 will be high for a short while, which turns on the left clock branch (N5 and N6). At the falling edge of the clock, the right clock branch (N7 and N8) will be turned on for a short time when CLK1 and CLK4 are both high. As a result, the FF is in an evaluation stage when either clock branch works. If D makes a 0-1 transition, the internal node X will be lowered through N3 and one of the clock branches, and then Q is raised to a high level by P3; if D makes a 1–0 transition, output Q is lowered through N4 and one of the clock branches. When D and Q are the same, the TGL-based comparator sends out a low logic signal, which shuts down N1 and N2 and turns on P1, so CLK2 will eventually be charged to a high level by P0, and CLK3 will be kept low. Meanwhile, node T is charged to a high level by P1 and then CLK4 is lowered down to the ground. So, the clocked transistors N5 and N7 are turned off by CLK3 and CLK4, respectively, and the FF remains unchanged until the input makes a transition again, resulting in a great reduction in power.

From the analysis above, it can be concluded that if the input D stays unchanged, the clock inverter chain is disabled and the redundant delayed clock signals are blocked, so unnecessary charging and discharging of the clocked transistors is reduced. In this condition both the clock branches are shut off, so the internal node X is kept at a high level, resulting in no redundant transitions at this node. Moreover, due to the embedded clock-gating scheme, the size of the clock inverter chain is reduced, which helps improve the power and delay performance of the design and save the layout area. As a result, DIFF-CGS exhibits a low-power characteristic when data activities are low.

## 4 Simulation results and comparisons

The performance of the proposed P-FF design is evaluated against existing designs. In Zhao *et al.* 

(2004), it has been proven that CDFF outperforms CCFF and DE-CPFF in both *D*-QB delay and power consumption. Pre-simulation has proved that CG-SAFF (Phyu *et al.*, 2011) has a very long delay and even an incorrect logic functionality, especially when applied to low-voltage systems, because the voltages of derived controlling nodes (*X* and *Y*) become too low to turn on the NMOS transistors. Also, CG-NFF (Judy and Kanchana Bhaaskaran, 2012) has race problems as described in Section 2. So, we will not compare these four FFs. The designs compared are SDETFF (Goh *et al.*, 2007), CDFF (Zhao *et al.*, 2004), and CPEFF (Hwang *et al.*, 2012).

Layout-level simulation results are obtained from HSPICE for the SMIC 65 nm logic low leakage CMOS process technology at room temperature. All parasitic capacitance and resistance are extracted from the layouts so that the circuits can be simulated more accurately. The supply voltage VDD is 1.2 V, and the clock frequency of single-edge FFs is 1 GHz; the clock frequency of dual-edge FFs is 500 MHz, and the input D is pseudorandom data with an activity factor of 10%. The transistor sizes are optimized to minimize the power-delay product (PDP) of the FFs using an iterative procedure introduced by Stojanovic and Oklobdzija (1999), and the layout of the DIFF-CGS is shown in Fig. 9. We use the same simulation test-bench as introduced by Zhao et al. (2004). The inputs (data and clock) are driven by fixed buffers, and the outputs are required to drive an output load of 20 fF. Furthermore, we have used the statements '.IC V<sub>0</sub>/V<sub>0B</sub>=VDD/0' in our HSPICE files to define the initial states of the output and set the worst case as the power parameter in the specific simulation.



Fig. 9 Layout of dual-edge implicit pulse-triggered flip-flop with an embedded clock-gating scheme

Fig. 10 shows the snapshots of the transient waveforms for DIFF-CGS, which demonstrates that the proposed design has correct logic functionality,

and redundant pulses are suppressed. As is obvious from Fig. 10, if D stays unchanged, CLK3 and CLK4 will be blocked and kept low. The delay parameter we use is the minimum D-OB delay, including both setup time and CLK-QB delay, so the delay characteristics can be reflected more appropriately. The D-QB delay is obtained by sweeping the time of low-to-high and high-to-low transition of input data, and the minimum delay corresponding to optimum setup time is recorded. Usually, the minimum D-QB delays differ for low-to-high and high-to-low transitions, and the worst minimum D-QB delay is chosen. The total average power consumption includes internal latching power of the FF, local input clock driving power, and local data driving power, but excludes the power dissipated on switching the output load capacitance, considering the loading effect on the previous stage and the clock tree. Note that the internal latching power includes dynamic power, short circuit power, and leakage power (Geng et al., 2016). Moreover, DIFF-CGS and its counterparts (SDETFF and CDFF) are all of dual-edge triggering logic. So, to make the power comparisons among different FFs fair and realistic, we carefully make sure that the FFs capture the input signal transitions equally at both the rising and falling edges of the clock.

Table 1 summarizes the comparisons of the FFs in terms of transistor count, layout area, setup time, hold time, minimum *D*-QB delay, total average power, clock and data driving power, latching power, and optimal PDP at a typical corner. Even though DIFF-CGS has the most transistors, its layout area is not the largest due to the reduced-size implicit-pulse generation and the concise structure of the latch. In terms of

delay, DIFF-CGS features the longest minimum D-QB delay because of the positive setup time. It is caused by the embedded clock-gating scheme; i.e., the output signal Y of the TGL-based comparator should be stable to turn on the inverter chain to judge the passage of the input clock. The setup time is measured as the optimal time to minimize D-QB delay. The relationship of D-QB delay and CLK-QB delay with respect to the setup time is presented in Fig. 11. In terms of power metric, DIFF-CGS gains the minimum total average power at a 10% data-switching activity, which is 53.33%, 56.21%, and 41.39% less than SDETFF, CDFF, and CPEFF, respectively. Due to the considerable savings in power



Fig. 10 Transient waveforms of dual-edge implicit flip-flop with an embedded clock-gating scheme

| Damanastan               | Value   |         |        |          |  |  |
|--------------------------|---------|---------|--------|----------|--|--|
| Parameter                | SDETFF  | CDFF    | CPEFF  | DIFF-CGS |  |  |
| Number of transistors    | 18      | 30      | 19     | 31       |  |  |
| Layout area $(\mu m^2)$  | 31.57   | 42.33   | 30.93  | 37.86    |  |  |
| Setup time               | -128.27 | -126.07 | -51.46 | 78.67    |  |  |
| Hold time                | 217.39  | 186.51  | 174.45 | 180.39   |  |  |
| Minimum D-QB delay (ps)  | 167.30  | 191.80  | 181.39 | 292.12   |  |  |
| CLK driving power (µW)   | 5.927   | 3.420   | 7.414  | 1.433    |  |  |
| Data driving power (µW)  | 0.171   | 0.099   | 0.135  | 0.425    |  |  |
| Latching power (µW)      | 9.048   | 12.626  | 4.513  | 5.211    |  |  |
| Total average power (µW) | 15.146  | 16.145  | 12.062 | 7.069    |  |  |
| PDP (fJ)                 | 2.534   | 3.097   | 2.188  | 2.065    |  |  |

| Table 1 Comparison of various mp-nop design | Table 1 | 1 Comparison | 1 of various | flip-flop | designs |
|---------------------------------------------|---------|--------------|--------------|-----------|---------|
|---------------------------------------------|---------|--------------|--------------|-----------|---------|

\* Power dissipation is measured when the data-switching activity is 10%. SDETFF: static dual edge-triggered flip-flop; CDFF: conditional discharge flip-flop; CPEFF: conditional pulse-enhancement flip-flop; DIFF-CGS: dual-edge implicit flip-flop with an embedded clock-gating scheme. PDP: power-delay product consumption, the PDP value of DIFF-CGS gains an improvement of 18.51%, 33.32%, and 5.62% against SDETFF, CDFF, and CPEFF, respectively, given the same condition.

Note that the leakage power of each FF in our simulation setup at a typical corner (VDD=1.2 V and T=25 °C) is in the pecowatt (pW) range, which is less than 2% of the total average power. So, it can be concluded that the dynamic power is the main source of the total average power (here the short-current



Fig. 11 Delay performances of various designs: (a) *D*-QB delay versus setup time settings; (b) CLK-QB delay versus setup time settings

SDETFF: static dual edge-triggered flip-flop; CDFF: conditional discharge flip-flop; CPEFF: conditional pulseenhancement flip-flop; DIFF-CGS: dual-edge implicit flipflop with an embedded clock-gating scheme power can also be ignored). To evaluate the effect of process variations on all designs, Table 2 shows the comparison of leakage power consumption under different combinations of clock and input data for the worst-case condition (VDD=1.3 V and T=125 °C). As is obvious from Table 2, although the proposed design consists of more transistors, the leakage power of DIFF-CGS is about the same level as its rival designs, which is attributed mainly to the transistor-stacking effect in the implicit reduced-size pulse-generation stage and the concise full-swing static-latching stage. The SDETFF design experiences the worst leakage power consumption because of the nonfull-swing internal nodes in its XNOR-logic-based pulse generator.

To characterize the power consumption and the PDP as a function of data-switching activities, five test patterns, which represent 0% (all-zero or all-one), 10%, 25%, and 50%, respectively, are applied (Fig. 12). The power dissipation of DIFF-CGS at different switching activities is shown in Fig. 12a. The simulation results show that the proposed design outperforms its rival designs when the data-switching activity is less than 37.5%. Also, note that the shortcurrent power in the proposed latch is greatly reduced by using the split latch scheme where different input transitions are distributed at different stages. Based on all the above simulation results and corresponding discussions, it can be concluded that there are four specific reasons for its low-power characteristic. First, DIFF-CGS applies dual-edge triggering logic, which cuts the clock frequency by one half and greatly reduces the power in the clock network. Second, due to the embedded clock-gating scheme, not only the redundant delayed clock signals but also the redundant internal nodes transitions are suppressed to save more power. Third, the short-circuit current in the second

| Tuble 2 Zeuninge portet comparison |          |                    |       |          |  |  |
|------------------------------------|----------|--------------------|-------|----------|--|--|
| Specific values                    |          | Leakage power (nW) |       |          |  |  |
|                                    | SDETFF   | CDFF               | CPEFF | DIFF-CGS |  |  |
| (CLK, Data)=(0, 0)                 | 20861.90 | 34.64              | 32.25 | 38.28    |  |  |
| (CLK, Data)=(0, 1)                 | 9603.72  | 42.38              | 39.91 | 37.37    |  |  |
| (CLK, Data)=(1, 0)                 | 20735.60 | 30.41              | 34.82 | 40.26    |  |  |
| (CLK, Data)=(1, 1)                 | 9588.75  | 38.15              | 33.77 | 30.78    |  |  |
| Average                            | 15197.49 | 36.39              | 35.19 | 36.67    |  |  |

 Table 2 Leakage power comparison

FF: flip-flop; SDETFF: static dual edge-triggered flip-flop; CDFF: conditional discharge flip-flop; CPEFF: conditional pulse-enhancement flip-flop; DIFF-CGS: dual-edge implicit flip-flop with an embedded clock-gating scheme

stage is greatly reduced since the short-current power of the latch occurs only at the pseudo-NMOS transistor (P2) when the input D makes a 0-1 transition and is also small (less than 5% of the total power). Fourth, the reduced number of clocked transistors due to the sharing scheme of both clock branches helps further reduce the power. However, DIFF-CGS consumes more power at high data-switching activities, because the clock signal is not blocked in most of the time, and the power consumption of the increased number of transistors outweighs the power savings due to the reduced transitions. Fig. 12b shows the PDP of DIFF-CGS at different switching activities, and the simulation results show that the proposed design outperforms its counterparts when the dataswitching activity is less than 13.2%. Usually the data activity factor of a typical CMOS logic is in the range of 0.08–0.12, while the clock activity factor is 100% (Weste, 2006). So, the proposed design is quite suitable for non-critical paths with low data-switching activities.

To compare the influence of process variations on the FFs, four FFs are simulated through different process corners under a condition of a 10% dataswitching activity. All FFs function correctly subject to process variations, and the results are shown in Fig. 13. It is obvious that DIFF-CGS gains power improvements in all four corners, but its delay is still the largest in all corners. Moreover, with the purpose of analyzing the robustness of the FFs against random process variations, Monte-Carlo (MC) simulations of power and delay performances are performed with process-voltage-temperature (PVT) variations, where three combinations of supply voltage (about 8.3% variation) and temperature are applied (Maxim and Gheorghe, 2001). Meanwhile, for each combination, 500 MC simulation iterations are conducted based on an MC model provided by the foundry's process design kit (PDK). The simulation results with process and PVT variations are shown in Table 3 and Table 4, respectively. The results show that DIFF-CGS has the lowest mean power and standard



Fig. 12 Different data-switching activities: (a) power dissipation; (b) PDP performance

SDETFF: static dual edge-triggered flip-flop; CDFF: conditional discharge flip-flop; CPEFF: conditional pulseenhancement flip-flop; DIFF-CGS: dual-edge implicit flipflop with an embedded clock-gating scheme



Fig. 13 Different process corners: (a) power dissipation; (b) delay

SDETFF: static dual edge-triggered flip-flop; CDFF: conditional discharge flip-flop; CPEFF: conditional pulseenhancement flip-flop; DIFF-CGS: dual-edge implicit flipflop with an embedded clock-gating scheme

| Design 1 | Mean power (µW) |              |               | Standard deviation (nW) |              |               |
|----------|-----------------|--------------|---------------|-------------------------|--------------|---------------|
|          | 1.3 V, −40 °C   | 1.2 V, 25 °C | 1.1 V, 125 °C | 1.3 V, −40 °C           | 1.2 V, 25 °C | 1.1 V, 125 °C |
| SDETFF   | 17.443          | 15.153       | 13.078        | 190.37                  | 195.37       | 224.60        |
| CDFF     | 18.694          | 16.158       | 13.667        | 87.47                   | 77.80        | 83.26         |
| CPEFF    | 14.223          | 11.998       | 9.536         | 104.77                  | 78.46        | 94.98         |
| DIFF-CGS | 8.270           | 7.044        | 5.999         | 44.91                   | 38.62        | 32.01         |

Table 3 Power analysis of process-voltage-temperature variability through Monte-Carlo simulations\*

\* The power parameters are measured when the input switching activity is 10%. SDETFF: static dual edge-triggered flip-flop; CDFF: conditional discharge flip-flop; CPEFF: conditional pulse-enhancement flip-flop; DIFF-CGS: dual-edge implicit flip-flop with an embedded clock-gating scheme

 Table 4 Delay analysis of process-voltage-temperature variability through Monte-Carlo simulations\*

| Design   | Me            | Mean D-QB delay (ps) |               |               | Standard deviation (nW) |               |  |
|----------|---------------|----------------------|---------------|---------------|-------------------------|---------------|--|
| Design   | 1.3 V, −40 °C | 1.2 V, 25 °C         | 1.1 V, 125 °C | 1.3 V, −40 °C | 1.2 V, 25 °C            | 1.1 V, 125 °C |  |
| SDETFF   | 140.11        | 166.39               | 194.27        | 3.93          | 5.67                    | 9.02          |  |
| CDFF     | 156.99        | 188.44               | 261.14        | 4.14          | 6.58                    | 12.02         |  |
| CPEFF    | 141.66        | 177.72               | 217.87        | 3.83          | 6.93                    | 10.43         |  |
| DIFF-CGS | 239.73        | 286.21               | 369.83        | 4.18          | 9.16                    | 15.41         |  |

\* The delay parameters are measured when the output load is 20 fF. SDETFF: static dual edge-triggered flip-flop; CDFF: conditional discharge flip-flop; CPEFF: conditional pulse-enhancement flip-flop; DIFF-CGS: dual-edge implicit flip-flop with an embedded clock-gating scheme

deviation compared with the others, meaning that DIFF-CGS is more robust with PVT variations. In terms of delay, DIFF-CGS still has the largest delay with PVT variations due to its large setup time.

## 5 Conclusions

We proposed a novel dual-edge implicit pulse-triggered flip-flop with an embedded clockgating scheme (DIFF-CGS) exhibiting excellent power reduction by means of employing a clockgating scheme in pulse generation, which conditionally disables the inverter chain to block the redundant delayed clocked signals and reduce the redundant transitions of internal nodes when the input data are kept unchanged. Based on SMIC 65 nm technology, extensive post-layout simulation results show that DIFF-CGS gains an improvement of 41.39% to 56.21% in terms of power consumption against its rival designs at 10% data-switching activity at a typical corner. Also, full-swing operations in both implicit pulse generation and the static latch improve the robustness of the design. Therefore, the proposed DIFF-CGS is quite suitable for power-efficient applications in VLSI designs that are not sensitive to delay.

#### References

Geng, L., Shen, J.Z., Xu, C.Y., 2016. Design of flip-flops with clock-gating and pull-up control scheme for powerconstrained and speed-insensitive applications. *IET Comput. Dig. Techn.*, **10**(4):193-201.

- http://dx.doi.org/10.1049/iet-cdt.2015.0139
- Goh, W.L., Yeo, K.S., Zhang, W., et al., 2007. A novel static dual edge-trigger flip-flop for high-frequency low-power application. IEEE Int. Symp. on Integrated Circuits, p.208-211.

http://dx.doi.org/10.1109/ISICIR.2007.4441834

Hwang, Y.T., Lin, J.F., Sheu, M.H., 2012. Low-power pulse-triggered flip-flop design with conditional pulseenhancement scheme. *IEEE Trans. VLSI Syst.*, 20(2): 361-366.

http://dx.doi.org/10.1109/TVLSI.2010.2096483

Hyman, R., Ranganathan, N., Bingel, T., *et al.*, 2013. A clock control strategy for peak power and RMS current reduction using path clustering. *IEEE Trans. VLSI Syst.*, 21(2):259-269.

http://dx.doi.org/10.1109/TVLSI.2012.2186989

- Judy, D.J., Kanchana Bhaaskaran, V.S., 2012. Energy recovery clock gating scheme and negative edge triggering flip-flop for low power applications. Int. Conf. on Devices, Circuits and Systems, p.140-143. http://dx.doi.org/10.1109/ICDCSyst.2012.6188691
- Kawaguchi, H., Takayasu, S., 1998. A reduced clock-swing flip-flop (RCSFF) for 63% power reduction. *IEEE J. Sol.-State Circ.*, 33(5):807-811. http://dx.doi.org/10.1109/4.668997
- Kim, S., Han, I., Paik, S., *et al.*, 2011. Pulser gating: a clock gating of pulsed-latch circuits. Proc. IEEE Asia South Pacific Design Automation Conf., p.190-195. http://dx.doi.org/10.1109/ASPDAC.2011.5722182
- Klass, F., Amir, C., Das, A., et al., 1999. A new family of semidynamic and dynamic flip-flops with embedded logic for high-performance processors. *IEEE J.* Sol.-State Circ., 34(5):712-716.1

http://dx.doi.org/10.1109/ASPDAC.2011.5722182

- Ko, U., Balsara, P.T., 2000. High-performance energyefficient D-flip-flop circuits. *IEEE Trans. VLSI Syst.*, 8(1):94-98. http://dx.doi.org/10.1109/92.820765
- Kong, B.S., Kim, S.S., Jun, Y.H., 2001. Conditional-capture flip-flop for statistical power reduction. *IEEE J. Sol.-State Circ.*, **36**(8):1263-1271. http://dx.doi.org/10.1109/4.938376
- Kulkarni, S.H., Sylvester, D., 2004. High performance level conversion for dual V<sub>DD</sub> design. *IEEE Trans. VLSI Syst.*, 12(9):926-936.

http://dx.doi.org/10.1109/TVLSI.2004.833667

Maxim, A., Gheorghe, M., 2001. A novel physical based model of deep submicron CMOS transistors mismatch for Monte Carlo SPICE simulation. IEEE Int. Symp. on Circuits and Systems, p.511-514.

http://dx.doi.org/10.1109/ISCAS.2001.922097

- Nedovic, N., Aleksic, M., Oklobdzija, V.G., 2002. Conditional pre-charge techniques for power-efficient dual-edge clocking. Proc. Int. Symp. on Low Power Electronics and Design, p.56-59. http://dx.doi.org/10.1109/LPE.2002.146709
- Phyu, M.W., Fu, K., Goh, W.L., et al., 2011. Power-efficient explicit-pulsed dual-edge triggered sense-amplifier flip-flops. *IEEE Trans. VLSI Syst.*, **19**(1):1-9. http://dx.doi.org/10.1109/TVLSI.2009.2029116
- Shen, J.Z., Geng, L., Wu, X.X., 2015. Low power pulsetriggered flip-flop based on clock triggering edge control technique. J. Circ. Syst. Comput., 24(07):1550094. http://dx.doi.org/10.1142/S0218126615500942
- Stojanovic, V., Oklobdzija, V.G., 1999. Comparative analysis of master-slave latches and flip-flops for highperformance and low-power systems. *IEEE J. Sol.-State Circ.*, **34**(4):536-548.

http://dx.doi.org/10.1109/4.753687

Strollo, A.G.M., de Caro, D., Napoli, E., et al., 2005. A novel high-speed sense-amplifier-based flip-flop. *IEEE Trans.* VLSI Syst., **13**(11):1266-1274. http://dx.doi.org/10.1109/TVLSI.2005.859586

Teh, C.K., Hamada, M., Fujita, T., et al., 2006. Conditional data mapping flip-flops for low-power and highperformance systems. *IEEE Trans. VLSI Syst.*, 14(12): 1379-1383.

http://dx.doi.org/10.1109/TVLSI.2006.887833

- Teh, C.K., Fujita, T., Hara, H., et al., 2011. A 77% energysaving 22-transistor single-phase-clocking D-flip-flop with adoptive-coupling configuration in 40nm CMOS. IEEE Int. Solid-State Circuits. Conf. on Digest of Technical Papers, p.338-340. http://dx.doi.org/10.1109/ISSCC.2011.5746344
- Weste, N.H.E., 2006. CMOS VLSI Design: a Circuits and Systems Perspective (3rd Ed.). Pearson Education, Noida, India.
- Wu, Q., Pedram, M., Wu, X., 2000. Clock-gating and its application to low power design of sequential circuits. *IEEE Trans. Circ. Syst.*, 47(3):415-420. http://dx.doi.org/10.1109/81.841927
- Wu, X.X., Shen, J.Z., 2012. Low-power explicit-pulsed triggered flip-flop with robust output. *Electron. Lett.*, 48(24):1523-1525. http://dx.doi.org/10.1049/Fel.2012.0943
- Xiang, G.P., Shen, J.Z., Wu, X.X., et al., 2013. Design of a low-power pulse-triggered flip-flop with conditional clock technique. IEEE Int. Symp. on Circuits and Systems, p.121-124.

http://dx.doi.org/10.1109/ISCAS.2013.6571797

- Zeitzoff, P.M., Chung, J.E., 2005. A perspective from the 2003 ITRS: MOSFET scaling trends, challenges, and potential solutions. *IEEE Circ. Dev. Mag.*, 21(1):4-15. http://dx.doi.org/10.1109/MCD.2005.1388764
- Zhao, P., Darwish, T.K., Bayoumi, M.A., 2004. Highperformance and low power conditional discharge flip-flop. *IEEE Trans. VLSI Syst.*, **12**(5):477-484. http://dx.doi.org/10.1109/TVLSI.2004.826192