# Efficient and optimized approximate GDI full adders based on dynamic threshold CNTFETs for specific least significant bits 

Ayoub SADEGHI ${ }^{1}$, Razieh GHASEMI ${ }^{2}$, Hossein GHASEMIAN ${ }^{\dagger \ddagger 3}$, Nabiollah SHIRI ${ }^{1}$<br>${ }^{1}$ Department of Electrical Engineering, Shiraz Branch, Islamic Azad University, Shiraz 7198774731, Iran<br>${ }^{2}$ School of Electrical Engineering, Iran University of Science and Technology, Tehran 1684613114, Iran<br>${ }^{3}$ Department of Electrical and Electronic Engineering, Shiraz University of Technology, Shiraz 7155713876, Iran<br>${ }^{\dagger}$ E-mail: H.ghasemian@sutech.ac.ir<br>Received Mar. 1, 2022; Revision accepted Sept. 13, 2022; Crosschecked Mar. 27, 2023


#### Abstract

Carbon nanotube field-effect transistors (CNTFETs) are reliable alternatives for conventional transistors, especially for use in approximate computing (AC) based error-resilient digital circuits. In this paper, CNTFET technology and the gate diffusion input (GDI) technique are merged, and three new AC-based full adders (FAs) are presented with 6, 6, and 8 transistors, separately. The nondominated sorting based genetic algorithm II (NSGA-II) is used to attain the optimal performance of the proposed cells by considering the number of tubes and chirality vectors as its variables. The results confirm the circuits' improvement by about $50 \%$ in terms of power-delay-product (PDP) at the cost of area occupation. The Monte Carlo method (MCM) and $32-\mathrm{nm}$ CNTFET technology are used to evaluate the lithographic variations and the stability of the proposed circuits during the fabrication process, in which the higher stability of the proposed circuits compared to those in the literature is observed. The dynamic threshold (DT) technique in the transistors of the proposed circuits amends the possible voltage drop at the outputs. Circuitry performance and error metrics of the proposed circuits nominate them for the least significant bit (LSB) parts of more complex arithmetic circuits such as multipliers.


Key words: Carbon nanotube field-effect transistor (CNTFET); Optimization algorithm; Nondominated sorting based genetic algorithm II (NSGA-II); Gate diffusion input (GDI); Approximate computing
https://doi.org/10.1631/FITEE. 2200077

## 1 Introduction

Digital circuits are vital parts of portable electronic devices, and in recent years, designers have tried to achieve high-performance circuits. The main challenge in integrated circuits (ICs) is the dimensions and number of transistors (Rafiee et al., 2021b). Shrinking the size of transistors toward the nanometer region poses significant challenges to reduce power, delay, and area (Cardenas et al., 2021). Reliable scaling

[^0]down of transistors has faced vital issues, including short channel effects, drain-induced barrier lowering (DIBL), decreased gate controllability, and hot electron effects (Sadeghi et al., 2020). Hence, two fundamental solutions have been considered: the technology of transistors and the circuit design methodology. Carbon nanotube field-effect transistors (CNTFETs) (Deng and Wong, 2007a, 2007b) and fin field-effect transistors (FinFETs) have been proposed as alternatives for conventional metal-oxide-semiconductor fieldeffect transistors (MOSFETs). Therefore, an evaluation of circuit fabrication and transistor sizing is required to achieve an optimal sketch (Karimi and Rezai, 2016; Kordrostami et al., 2019). These are critical
concepts in very-large-scale integration (VLSI) circuits, especially regarding approximate computing (AC) based arithmetic circuits (Strollo et al., 2020).

The basis of AC-based arithmetic circuits is the full adder (FA) (Mirzaei and Mohammadi, 2020). Various theories of transistor- and gate-level designs along with technology have been expressed in AC. In this research, AC-based FAs based on CNTFET technology are presented. The focus of arithmetic AC-based circuits is on area reduction and performance enhancement; thus, the design techniques are evaluated for this intention. In this regard, a research gap is the lack of the gate diffusion input (GDI) technique and its compatibility with CNTFET technology (Morgenshtein et al., 2002).

The contribution of this paper is the assessment of combining CNTFET technology and the GDI technique for designing AC-based arithmetic circuits. As shown in Fig. 1, the GDI technique is considered to design AC-based circuits using CNTFET technology. The most significant contribution of the AC concept is considered at the circuit level. By integration of pioneer design techniques and modern technology, highefficiency specific-purpose chips, field programmable gate arrays (FPGAs), and systems-on-chip (SoCs) are accessible. To select the best option as a reliable
and prospective technology, various factors, such as tremendous electrical characteristics including ballistic transport and low OFF-current, are important and seen in CNTFETs.

Carbon nanotubes are defined as two types of CNT transistors (Karimi and Rezai, 2017), which were described in Abdul Hadi et al. (2022) and Ghasemian et al. (2022) as alternatives for MOSFETs. Additionally, CNT transistors have a significant ability to adjust the threshold voltage $\left(V_{\mathrm{th}}\right)$; hence, they are used as reliable devices for multithreshold applications. Fig. 1 shows a CNTFET-based inverter layout consisting of P-CNTs and N -CNTs with 10 tubes and a chirality vector of $(38,0)$. The transistors' parameters, the width of the gate, pitch, $L_{\mathrm{ch}}, L_{\mathrm{dd}}$, and $L_{\mathrm{ss}}$ are shown in Fig. 1. In this paper, optimizing the performance of CNTFETs in a circuit, specifically when they are merged with the GDI technique, is covered by considering different values and theories for the geometric parameters. Therefore, the nondominated sorting genetic algorithm II (NSGA-II) (Abiri et al., 2020) is used as an optimization procedure, which improves the performance of a subject such as energy savings by considering two or more objectives simultaneously. In this regard, according to Fig. 1,


Fig. 1 Considered chart to attain future chips
the number of tubes and chirality vectors are critical parameters influencing the $V_{\text {th }}$ of transistors and their performance. However, there is a possibility of errors during fabrication; for example, the number of tubes may change (Ghorbani et al., 2022). Therefore, it is necessary to evaluate the optimized circuits considering lithography (Cho and Lombardi, 2016).

From Fig. 1, the GDI cell is the selected technique, and there is no need to adjust the width of the CNTFETs to achieve equal rise/fall time (Ben-Jamaa et al., 2011). Hence, by changing the number of tubes and chirality vectors, the voltage transfer characteristics (VTC) change, as shown in Fig. 1. The GDI cells reduce the area, but they have a voltage swing drop at their outputs (Rafiee et al., 2021b). Although a single-swing restoration (SR) transistor has been proposed, it increases the area and jeopardizes one of the main goals of AC-based circuits, which lowers the complexity of the circuits in high-bit structures (Morgenshtein et al., 2014). Another solution is the dynamic threshold (DT) technique (Lindert et al., 1999). Fig. 2 indicates different GDI cells (e.g., for an AND gate), which use basic, DT, and SR transistor structures. When using the DT technique, each transistor substrate is dynamically aligned with the gate voltage, so the $V_{\text {th }}$ of the device is adjusted dynamically. When DT-CNT is ON, its $V_{\text {th }}$ decreases, and the current and speed increase. On the other hand, when it is OFF, $V_{\text {th }}$ increases, leakage current is reduced, and power dissipation is minimized (Homulle et al., 2018). Full-swing outputs increase the drivability of the circuit (Rafiee et al., 2022). As shown in Fig. 2, the highest output voltage under tube variations and fan-out 4 (FO4) belongs to the DT-based GDI AND gate. By the DT in CNTFET-based circuits, $V_{\text {th }}$ is controlled, and its drop at the output is reduced (Rafiee et al., 2021b; Sadeghi et al., 2022).

The contributions of this paper are as follows:

1. The DT-GDI technique and CNTFET technology are used to achieve three new approximate FAs.
2. FAs have low numbers of transistors $(6,6,8)$ and high performance. The best values of transistor dimension and threshold voltage $\left(V_{\mathrm{th}}\right)$ are considered. The FAs are implemented in the least significant bit (LSB) parts of the partial product reduction tree (PPRT) of the multipliers.


Fig. 2 Implementations of the GDI-based AND gate: (a) basic; (b) DT; (c) SRT; (d) output voltage swing variations
3. The FAs are optimized based on NSGA-II, and the best performances of the circuits for power, delay, energy, and output swings are extracted.
4. Finally, the proposed circuits are directly used in an error-tolerant application, image processing, and a reasonable trade-off between the circuit and accuracy parameters is confirmed.

## 2 Investigation of AC-based FAs

### 2.1 Analysis of previous AC-based FAs

Previous FAs are classified into full-custom and gate-level designs. However, a combination of them can be considered. In Gupta et al. (2013), approximate FAs were designed based on the general structure of a conventional mirror adder (CMA) to reduce the area and save more power. AMA1, AMA2, and AMA3 are implemented with this principle. Table 1 shows the implementation functions, the number of transistors, and the techniques in the literature.

Venkatachalam and Ko (2017) and Waris et al. (2019) proposed approximate FAs based on changing the conventional block diagram of an exact FA (XOR-XNOR-based). High area and high error rate are the defects of these circuits. Mahdiani et al. (2010) proposed a design known as lower-part OR adder (LOA), where an FA cell was designed as a part of a hardware implementation for most significant bit (MSB)

Table 1 Comparison among designs in the literature

| Design | $C_{\text {out }}$ | Sum | Number of transistors | Technique | Reference |
| :--- | :---: | :---: | :---: | :---: | :--- |
| AMA1 | $A C_{\text {in }}+B$ | $(\overline{A \oplus B}) C_{\text {in }}$ | 20 | CMOS | Gupta et al., 2013 |
| AMA2 | $(A+B) C_{\text {in }}+A B$ | $\left(\overline{A+B) C_{\text {in }}+A B}\right.$ | 14 | CMOS | Gupta et al., 2013 |
| AMA3 | $A C_{\text {in }}+B$ | $\overline{A C_{\text {in }}+B}$ | 11 | CMOS | Gupta et al., 2013 |
| VAFA | $(A+B) C_{\text {in }}$ | $(A+B) \oplus C_{\text {in }}$ | $24 \uparrow$ | CMOS | Venkatachalam and Ko, 2017 |
| NFAx | $\overline{\overline{A B} \overline{C_{\mathrm{in}}}}$ | $\overline{\overline{A B} C_{\text {in }}}$ | 14 | CMOS | Waris et al., 2019 |
| TGA2 | $A+B$ | $(\overline{A \oplus B}) C_{\text {in }}$ | 22 | Hybrid (TG-CMOS) | Yang et al., 2015 |
| LOA | $A B$ | $A+B$ | 12 | CMOS | Mahdiani et al., 2010 |
| AFA1 | $A\left(B+C_{\text {in }}\right)$ | $\overline{A\left(B+C_{\text {in }}\right)}$ | 8 | CMOS | Mirzaei and Mohammadi, 2020 |
| AFA2 | $(A+B) C_{\text {in }}+A B$ | $A+B$ | 18 | CMOS | Mirzaei and Mohammadi, 2020 |
| AFA3 | $A\left(B+C_{\text {in }}\right)$ | $A+B$ | 14 | CMOS | Mirzaei and Mohammadi, 2020 |
| AFA4 | $A B$ | $A \bar{B}+\bar{A} B+C_{\text {in }}$ | 17 | CMOS | Mirzaei and Mohammadi, 2021 |
| AFA5 | $A B$ | $\overline{A B}$ | $6 \downarrow$ | CMOS | Mirzaei and Mohammadi, 2021 |
| AFA6 | $A\left(B+C_{\text {in }}\right)$ | $A \bar{B}+\bar{A} B+C_{\text {in }}$ | 19 | CMOS | Mirzaei and Mohammadi, 2021 |
| NxFA | $\overline{A+B}+\overline{C_{\text {in }}}$ | $\overline{\overline{A+B}+C_{\text {in }}}$ | 14 | CMOS | Waris et al., 2022 |

$\uparrow$ and $\downarrow$ indicate the highest and lowest results, respectively
or LSB computing, but the delay and power were not satisfactory. Similar to AMA1-AMA3, in Mirzaei and Mohammadi (2020; 2021), six other circuits were implemented based on the simplification of the CMA circuit. In this paper, all six designs are compared with the proposed circuits. These circuits include a high number of transistors and are designed to increase the stability against unintended variations during the fabrication process. Another circuit suggested in the literature is TGA2 (Yang et al., 2015), which is based on a combination of transistor-level and gate-level, by the integration pass-transistor-logic (PTL) and CMOS. Here, the drop in voltage swing is solved, but the power consumption is increased. An approximate FA based on CMOS consists of NOR gates presented in Waris et al. (2022) to reduce the error rate of output carry $\left(C_{\text {out }}\right)$. The circuit has one error in $C_{\text {out }}$ and two errors in Sum, which produce an appropriate error rate. As given in Table 1, the maximum number of transistors is for VAFA, while the minimum is AFA5 and the following is AFA1. AFA5 is similar to LOA, without input carry $\left(C_{\text {in }}\right)$.

### 2.2 Proposed AC-based DT-GDI FAs

Three novel approximate FA cells are proposed (Fig. 3). These cells are based on different block diagrams and the GDI technique. From Fig. 3a, Proposed-1 has an XNOR with an AND including six transistors, a similar number of transistors as AFA5. Compared to AFA5, Proposed-1 has $C_{\text {in }}$, so its implementations


Fig. 3 Structures of the proposed approximate full adders: (a) Proposed-1; (b) Proposed-2; (c) Proposed-3
in complex structures do not require extra gates. Additionally, input $A$ is considered as $C_{\text {out }}$ for passing to other gates in a ripple-carry-based structure. To overcome the voltage swing drop due to the use of GDIAND, the DT-CNT technique is used. However, in high fan-out conditions, the voltage drop problem is still observable. Therefore, in Proposed-2, instead of using AND at Sum, the F1 gate is used, and the main advantage of F 1 is that the inverter is created internally.

The number of transistors is still 6 , while the voltage swing of Sum is improved. Additionally, XOR is used instead of XNOR, and the $C_{\text {out }}$ of these two circuits is not changed. In terms of functioning, Proposed-1 with Sum $=(\overline{A \oplus B}) C_{\text {in }}$ is similar to TGA2 and AMA1. However, in TGA2 and AMA1, 16 and 20 transistors are used to produce Sum, respectively. In AMA1, Sum depends on $C_{\text {out }}$, which requires an inverter on the output for swing restoration. Additionally, using F1 and considering $C_{\text {in }}$ as the main input give $\operatorname{Sum}=(A \oplus B) \overline{C_{\text {in }}}$. Considering Table 1 , none of the circuits in the literature produce such an output.

Proposed-3 improves $C_{\text {out }}$ drivability (Fig. 3c). Its function is similar to that of Proposed-2. The only difference is a transmission gate (TG) for boosting the speed and strengthening the input $A$ as $C_{\text {out }}$. Since the XOR-based GDI uses an inverter for $A$, it is possible to use a TG with both PCNT and NCNT gate terminals connected to $\bar{A}$; therefore, an extra inverter is not needed, so Proposed-3 produces its outputs with eight transistors. TG is used in the $C_{\text {out }}$ path, because in cascading structures such as multipliers (Rafiee et al., 2021b; Sadeghi et al., 2022), inputs come from
circuits such as compressors and may have a voltage swing drop, so TG overcomes this problem.

Table 2 shows the truth table of references and the proposed designs compared to their exact types. The highest error rate (ER) is for the circuits with 0.5 ER such as LOA, AFA2, AFA3, AFA5, Proposed-2, and Proposed-3. Considering the normalized mean error distance (NMED), AFA2, AFA3, AFA5, Proposed-2, and Proposed-3 have the maximum values. Transistorlevel schematics of the proposed designs are shown in Figs. $4 \mathrm{a}-4 \mathrm{c}$ for Proposed-1 to Proposed-3. Additionally, the output waveforms under a frequency of 1 GHz are shown in Fig. 4d.

The proposed circuits benefit from a fast charge and discharge of internal capacitances, so high-speed outputs are expected. In Proposed-1, $T_{1}-T_{4}$ operate as XNOR. Conventional GDI-based XNOR uses an inverter for input $B$, but here it is used for input $A$ to activate or inactivate the TG of Proposed-3 without an extra inverter.

In Proposed-1, $T_{5}$ and $T_{6}$ act as an AND, in which $C_{\text {in }}$ is connected as the transistor activator and transmits GND and XNOR to the output. Due to the use

Table 2 Truth table for the exact and approximate full adders

| Input | Exact | AMA1 | AMA2 | AMA3 | VAFA | NFAx | TGA2 | LOA |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $A B C_{\text {in }}$ | CS | CS | CS | CS | CS | CS | CS | CS |  |
| 000 | 00 | 00 | 01 | 01 | 00 | 01 | 00 | 00 |  |
| 001 | 01 | 01 | 01 | 01 | 01 | 10 | 01 | 00 |  |
| 010 | 01 | 10 | 01 | 10 | 01 | 01 | 10 | 01 |  |
| 011 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 01 |  |
| 100 | 01 | 00 | 01 | 01 | 01 | 01 | 10 | 01 |  |
| 101 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 01 |  |
| 110 | 10 | 10 | 10 | 10 | 01 | 11 | 10 | 11 |  |
| 111 | 11 | 11 | 10 | 10 | 10 | 11 | 11 | 11 |  |
| ER | 7 | $0.25 \downarrow$ | $0.25 \downarrow$ | 0.375 | $0.25 \downarrow$ | 0.375 | $0.25 \downarrow$ | $0.5 \uparrow$ |  |
| NMED | 7 | $0.083 \downarrow$ | $0.083 \downarrow$ | 0.125 | $0.083 \downarrow$ | 0.125 | $0.083 \downarrow$ | 0.125 |  |
| Input | AFA1 | AFA2 | AFA3 | AFA4 | AFA5 | AFA6 | NxFA | P1 | P2, P3 |
| $A B C_{\text {in }}$ | CS | CS | CS | CS | CS | CS | CS | CS | CS |
| 000 | 01 | 00 | 00 | 00 | 01 | 00 | 00 | 00 | 00 |
| 001 | 01 | 00 | 00 | 01 | 01 | 01 | 00 | 01 | 00 |
| 010 | 01 | 01 | 01 | 01 | 01 | 01 | 01 | 00 | 01 |
| 011 | 01 | 11 | 01 | 01 | 01 | 01 | 10 | 00 | 00 |
| 100 | 01 | 01 | 01 | 01 | 01 | 01 | 01 | 10 | 11 |
| 101 | 10 | 11 | 11 | 01 | 01 | 11 | 10 | 10 | 10 |
| 110 | 10 | 11 | 11 | 10 | 10 | 10 | 01 | 10 | 10 |
| 111 | 10 | 11 | 11 | 11 | 10 | 11 | 10 | 11 | 10 |
| ER | 0.375 | $0.5 \uparrow$ | $0.5 \uparrow$ | $0.25 \downarrow$ | $0.5 \uparrow$ | $0.25 \downarrow$ | 0.375 | 0.375 | $0.5 \uparrow$ |
| NMED | 0.125 | $0.167 \uparrow$ | $0.167 \uparrow$ | $0.083 \downarrow$ | $0.167 \uparrow$ | $0.083 \downarrow$ | 0.125 | 0.125 | $0.167 \uparrow$ |

CS: $C_{\text {out }}$ Sum; ER: error rate; NMED: normalized mean error distance. $\uparrow$ and $\downarrow$ indicate the highest and lowest results, respectively. P1, P2, and P3 are the Proposed-1, Proposed-2, and Proposed-3 cells, respectively


Fig. 4 Proposed approximate full adders: (a) Proposed-1; (b) Proposed-2; (c) Proposed-3; (d) their output waveforms under a frequency of $\mathbf{1 ~ G H z}$
of GND and the inherent properties of the GDI-AND gate, glitches are seen in the outputs of this circuit, which reduces speed or increases power consumption. Hence, F1 is used in Proposed-2, which prevents $V_{\text {th }}$ drop by generating fresh and inverted signals. With the inherent nature of the internal inverter of $\mathrm{F} 1, C_{\text {in }}$ is converted to $\bar{C}_{\text {in }}$, and then it is AND with XOR. This output produces strong 0 and 1 in all cases, except when $C_{\mathrm{in}}=0$ and XOR=0, where $T_{5}$ produces an output of $\left|V_{\text {thp }}\right|$ when the DT technique is not used. As an advantage, the GDI-based F1 function does not
need SR transistors since the employed DT works efficiently; therefore, as shown in Fig. 4d, Sum is full-swing when $C_{\mathrm{in}}=0$ and XOR=0. Another important point of $T_{5}$ in Proposed-2 is its threshold voltage ( $V_{\text {th-T5 }}$ ). As will be illustrated in Section 2.4, by the intelligent optimization algorithm, the best value of the chirality vector for this transistor will be obtained as $(35,0)$. Using this chirality vector, we have $D_{\text {CNT-Ts }}=2.741 \mathrm{~nm}$ and $V_{\mathrm{th}-\mathrm{Ts}}=0.156 \mathrm{~V}$, and $D_{\mathrm{CNT}}$ is the CNT diameter. This value of $V_{\mathrm{th}}$, compared to $V_{\mathrm{DD}}=$ 0.9 V , shows a decrease of only $17.33 \%$ when the
voltage swing drop occurs (below 20\%), which confirms the high capability of DT for making this output full-swing with a desirable noise margin. In other words, the algorithm considers the largest value of the chirality vector and consequently the smallest value of $V_{\mathrm{th}}$ for $T_{5}$ (among all transistors of Proposed-2). Another contribution of this research is the way by which NSGA-II is used to consider full-swing output. That is, in addition to optimizing power, delay, and power-delay-product (PDP), the algorithm considers the output waveform and compares it with the ideal value to obtain the best swing performance.

In about $50 \%$ of the cases, the GDI cell operates as a regular CMOS inverter (Morgenshtein et al., 2002). On the other hand, using GDI AND, this possibility does not exist, so voltage swing drop and glitch appear in the output of Proposed-1. However, using DT, this issue is covered. In this case, only the DT technique operates as swing restoration, and full-swing output is obtained for all states (such as the table provided in Fig. 4).

In Proposed-3, since an inverter is used in XOR, the possibility of using TG is raised. Therefore, $C_{\text {out }}$ is highly dependent on the state of $A$. The two transistors used in the TG structure of this circuit ( $T_{7}$ and $T_{8}$ ) are not ON (activated) at the same time, and their activation states are changed according to the state of $A$. Therefore, with the appropriate use of NCNT and PCNT transistors for producing 0 and 1 , respectively, no voltage swing drop occurs, and the speed is improved when embedded in cascaded structures. In CNTbased TG gates, the rise and fall times are equal, and the on-resistances of P-type and N-type CNTFETs are equal ( $R=R_{\mathrm{n}}=R_{\mathrm{p}}$ ) (Ben-Jamaa et al., 2011). This yields smaller CNTFET gates compared to CMOS gates in the implementation of the same function. Simultaneously, since equally sized P-type and N-type CNTFETs have the same ON-resistance, they are more compact than MOSFET-based gates (which usually use a transistor sizing methodology such as $W_{\text {PMOS }} \approx$ $3 W_{\text {NMOS }}$ ). CNTFET technology overcomes the challenge of static power by small-size devices and low $V_{\text {th }}$ (Majerus et al., 2013).

As shown in Fig. 4, Proposed-1 has sensitive glitches at Sum. Various procedures are considered to avoid AND's glitches, including delay balancing, hazard filtering, gate sizing, and transistor sizing. F1, as a kind of AND, improves the performance of

Proposed-1 and introduces a solution to remove the glitches in Proposed-2 and Proposed-3.

By using F1, $C_{\text {in }}$, which reaches F1 faster than XOR, is first inverted in F1 without increasing the area, compared to well-known solutions (Vasantha Kumar et al., 2012; Yang et al., 2015), and causes a slight delay for $C_{\text {in }}$ to enter F1, resulting in delay balancing. The procedure used in Proposed-2 and Proposed-3 increases the total delay in comparison with Proposed-1, but it prevents glitches and increases drivability and accuracy.

Moreover, by looking at the $k$-map of Proposed-1, when $C_{\text {in }}=0, T_{5}$ is ON and passes GND to output, and it is PCNT and inappropriate for passing GND; therefore, the hazard of glitch can occur. In total, according to Fig. 4, Proposed-1 has the major failing efforts for producing the full-swing output, although by using DT, these defects are reduced, but not completely avoided. On the other hand, for Proposed-2, first the simultaneous arrival of inputs to F1 is covered by an inherent inverter, and second a minimum possible failing effort of non-full swing output appears and is covered by DT. In conclusion, Proposed-2 and Proposed-3 are better options for cascade structures such as RCAs.

Regarding the highest operating frequency, Proposed-1 suffers from possible glitches, while the two other circuits have much better performance. Proposed-3 with a $2-\mathrm{GHz}$ operating frequency has a better condition. However, in this state, the voltage drop is equal to 0.211 V , which is approximately $20 \%$ of $V_{\mathrm{DD}}$ (still acceptable). For Proposed-2, frequencies greater than 1.5 GHz jeopardize the swings. As expected, Proposed-1 has weak performance against high frequencies, and according to noise margin, it has acceptable values until 1.25 GHz . Proposed-1, in contrast to two other circuits, has some drops regarding both high logic (e.g., 0.645 V ) and low logic (e.g., 0.115 V ). The peaks for the glitches are 0.48 V and 0.37 V . Another important point is the high stability of $C_{\text {out }}$ in Proposed-3, produced by a TG gate, which makes this cell suitable for ripple effect structures.

Based on the ER of the proposed circuits and their low number of transistors, they are suitable for the LSB parts. Several approximate multipliers have been proposed, in which LSB outputs are classified into truncated or approximate. In truncated multipliers, some of the partial products are not formed, leading to errors in the outputs (Strollo et al., 2020).

To increase the accuracy of approximate multipliers, instead of truncating the LSB outputs, the proposed circuits are used without a significant increase in power and energy (Sadeghi et al., 2022), and a higher accuracy is achieved at the cost of losing a small area. These circuits are used to generate the LSB outputs of $P_{0}$ to $P_{3}$ in an 8 -bit multiplier. The proposed circuits are also used to generate the initial signals of the approximate multipliers in the final addition by an RCA. For this reason, in this study, different RCAs with different numbers of approximate bits (NABs) are used to evaluate the proposed circuits.

### 2.3 Optimization procedure using NSGA-II

Usually, VLSI circuits are faced with transistor sizing (Naseri and Timarchi, 2018). To date, a reliable mechanism has not been proposed to achieve the best (optimal) performance of CNTFET-based circuits in the literature regarding approximate cells. In this study, this issue is significantly addressed. Unlike many previous works on transistor sizing, we consider all circuit parameters with a specific priority. This arrangement is related to output voltage swing (indicating correct operation of the circuit during transistor sizing) and the best power saving, minimum delay, and minimum PDP. NSGA-II is used (Fig. 5). It considers several variables and objectives at the same time to optimize several objective functions (Deb et al., 2002). Using these procedures often results in a trade-off between the best circuit performance and area consumption (Abiri et al., 2020). Therefore, the intervals considered in the following are not jeopardizing the symmetry of the circuit too much. For the
optimization of the cells, a direct mechanism is established between MATLAB and HSPICE tools.

This algorithm is carried out in three main steps. The CNTFET-based circuits are affected by the numbers of tubes and chirality vectors, and are considered for optimization. Before starting the algorithm, preknown data must be optimized. First, with an improved code in MATLAB, the desired circuit is simulated using HSPICE, and the results are stored so that they can be easily called when comparison becomes necessary. In this case, along with the power, delay, and PDP results, the output waveforms are stored as well. In the second step, the problem variables are considered, the number of variables is equal to $V=\left(\alpha_{i}+\beta_{i}\right) T$, where $\alpha_{i}$ and $\beta_{i}$ are the numbers of tubes and chirality vectors of transistor $i$, respectively, and $T$ is the number of transistors. For example, for Proposed-1, $V=12$.

Now, random populations, the optional number of generations, and the considered intervals for objectives with prior knowledge stored in the previous step are adjusted for the algorithm. The algorithm is initiated with the desired iterations. Genetic operators, including mutation and crossover (Deb et al., 2002), are used to generate different ranks to dominate each previous rank to attain the best fronts. In the third step, one-by-one and corresponding comparisons between the obtained results of waveforms, power, delay, and PDP are performed. If the results are correct, the next comparison is performed; otherwise, the population is reset for the simulation. Optimization results may be obtained in each iteration. In this case, if the algorithm is stopped, better results in the next generations that can be obtained may be


Fig. 5 Optimization procedure using NSGA-II
ignored. To avoid that, again in MATLAB, a storage environment is provided only for all optimization results, the best of which are reported as the final results. This step is called "selection of the best optimization."

### 2.4 Optimization procedure results

The explained mechanism is applied to the references and the proposed designs according to the specification provided in Table 3. The preknown data are attained according to constant conditions for the circuit parameters.

The optimization results are provided in Table 4. Additionally, the power, delay, and PDP results from the comparison between the nonoptimized and optimized versions of the proposed cells are given in Table 4.

Proposed-1 suffers from delay, while the best optimization results belong to Proposed-1 for power

Table 3 Optimization adjustment and desired conditions for the proposed approximate full adder cells

| Optimization parameter | Value |
| :--- | :---: |
| Population size | 100 |
| Crossover | 0.7 |
| Mutation | $V=\left(\alpha_{i}+\beta_{i}\right) T$ |
| Number of variables | 200 |
| Maximum number of iterations | $13-38(\neq 3 i)$ |
| Upper-lower bounds of chirality | $2-100$ |
| Upper-lower bounds of the number of tubes |  |

and PDP with $2.10 \times$ and $2.09 \times$ improvements, respectively. Regarding delay, Proposed-2 shows a $0.36 \%$ improvement. A comparison between Proposed-2 and Proposed-3 shows better performance of Proposed-2 in terms of power and PDP.

### 2.5 Layout of approximate FAs

Electric VLSI 9.07 is a useful tool for MOSFET and CNTFET layouts (Huang JL et al., 2010, 2012). Fig. 6 shows the layout of the proposed FAs, where design rule checking (DRC), electrical rule checking (ERC), and layout versus schematic (LVS) are performed without error. These roles are based on $\lambda$ $(f=2 \lambda)$, which exists in the tool as mocmos-cn ( $\mathrm{cn}=$ carbon nanotube) technology. The values of 0.090 , 0.080 , and $0.113 \mu \mathrm{~m}^{2}$ are achieved as the area occupation of Proposed-1, Proposed-2, and Proposed-3, respectively, while for their optimized versions, these values are $0.109,0.120$, and $0.153 ~ \mu \mathrm{~m}^{2}$, respectively. These results show a $17.43 \%, 33.33 \%$, and $26.14 \%$ larger area for the optimized version of Proposed-1, Proposed-2, and Proposed-3 in comparison with their nonoptimized versions, respestively.

## 3 Simulation setup and results

In this study, the 32-nm SPICE-compatible compact model is used which describes enhancement-mode, unipolar MOSFETs with semiconducting single-walled carbon nanotubes as channels (Deng and Wong, 2007a,

Table 4 Optimization results for the proposed approximate full adder cells

| Pair transistor $\quad$ P | Number of tubes |  |  |  | Chirality vector |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | Proposed-1 | Proposed-2 | 2 Proposed-3 |  | Proposed-1 |  | Proposed-2 | Proposed-3 |  |
| $T_{1}, T_{2}$ | 11, 13 | $20,27$ |  | 7,21 | $(25,0),(17,0)$ |  | $(16,0),(14,0)$ | $(23,0),(20,0)$ |  |
| $T_{3}, T_{4}$ | 13, 21 | 30, 23 |  | 28, 6 | $(23,0),(13,0)$ |  | $(17,0),(25,0)$ | $(22,0),(25,0)$ |  |
| $T_{5}, T_{6}$ | 24, 14 | 29, 32 |  | 24, 25 | $(29,0),(25,0)$ |  | $(35,0),(26,0)$ | $(24,0),(25,0)$ |  |
| $T_{7}, T_{8}$ |  |  |  | 20, 17 |  |  | $(22,0),(32,0)$ |  |  |
|  | Proposed-1 |  |  | Proposed-2 |  |  | Proposed-3 |  |  |
| Parameter | Preknown data | $\begin{aligned} & \text { Optimized } \\ & \text { data } \end{aligned}$ | Difference | Preknown data | $\begin{aligned} & \text { Optimized } \\ & \text { data } \end{aligned}$ | Difference | Preknown data | $\begin{aligned} & \text { Optimized } \\ & \text { data } \end{aligned}$ | Difference |
| Average power consumption ( $\mu \mathrm{W}$ ) | ) 0.5413 | 0.2572 | $2.10 \times$ | 0.9307 | 0.6847 | 35.93\% | 0.9771 | 0.7363 | 32.70\% |
| Maximum propagation delay (ns) | on 2.0019 | 2.0165 | -0.73\% | 3.9924 | 3.9782 | 0.36\% | 3.9924 | 3.9795 | 0.32\% |
| Maximum PDP (fJ) | 1.0838 | 0.5187 | $2.09 \times$ | 3.7158 | 2.7241 | 36.4\% | 3.9009 | 2.9301 | 33.13\% |

[^1]

Fig. 6 Layouts of Proposed-1 (a and d), Proposed-2 (b and e), and Proposed-3 (c and f): (a-c) nonoptimized; (d-f) optimized

2007b). Additionally, the Synopsys HSPICE-H-2013.03SP2 64-bit tool with Stanford University CNFETs Verilog-A Model v. 2.1.1 is used for the simulations. The main reason for using the mentioned technology is its running speed issue in the HSPICE tool; in this case, its Verilog-A model can be used as a solution (Lee and Wong, 2015). The simulation parameters for the technology are given in Table 5. For constant conditions of simulations, the chirality vector and the number of tubes are adjusted as $(38,0)$ and 10 for each transistor, respectively, which results in $D_{\mathrm{CNT}}=2.97 \mathrm{~nm}$ and $V_{\mathrm{th}}=0.144 \mathrm{~V}$.

For the FAs, a circuit under test (CUT) is provided according to Fig. 7a (Hasan et al., 2020). The circuit inputs pass through two inverters, and a load capacitance equivalent to 1 fF is used at each output after a fan-out of 4 to check the drivability of the circuit (Kandpal et al., 2020). The average power consumption is measured from 0.01 ns for two periods of time based on the intended operative frequency. The delay measurement is according to the path shown in the CUT from before the buffers to the end of the path. Additionally, the worst propagation delay is estimated when outputs approach $50 \%$ of $V_{\mathrm{DD}}$ in

Table 5 Parameters of the applied CNTFET technology

| Parameter | Value | Description |
| :---: | :---: | :---: |
| $L_{\text {ch }}$ | 32 nm | Physical channel length |
| $L_{\text {geff }}$ | 100 nm | Mean free path: intrinsic CNT |
| $L_{\text {ss }}$ | 32 nm | Source side extension regions |
| $L_{\text {dd }}$ | 32 nm | Length of doped CNT drain |
| $K_{\text {gate }}$ | 16 | Gate dielectric constant |
| $T_{\text {ox }}$ | 4 nm | Oxide thickness |
| $C_{\text {sub }}$ | $40 \mathrm{pF} / \mathrm{m}$ | Coupling capacitance between channel and substrate |
| $E_{\text {fi }}$ | 0.6 eV | Fermi level of the doped S/D tube |
| Pitch | 5 nm | Distance between tube centers |
| Chirality vector | $(38,0)$ | Arrangement of the carbon atom angle |
| Number of tubes | 10 | Number of tubes used for each transistor |

rising and falling conditions. All possible patterns of the FA truth table are applied for the measurements. Additionally, PDP is reported as a metric for circuit performance observation. To include the area consumption level based on the number of transistors, power-delay-area-product (PDAP) is reported.


Fig. 7 Simulation setup for the full adders (FAs) (a), RCA structure (b), gate-level structure of the exact FA (c), and transistor-level schematic of the exact FA (d) used in this paper

Additionally, ER, error distance (ED), mean error distance (MED), and NMED (Mirzaei and Mohammadi, 2021) are calculated. The RCA structure shown in Fig. 7 b is used to evaluate the proposed circuits and circuits proposed in the literature. To obtain a fair comparison between the approximate FAs when embedded in RCA, different scenarios are considered. Different NABs, including NAB1 to NAB4, are applied to the 4-bit RCA. For example, NAB1 means that only the first exact FA with the LSB signals on its output is replaced with an approximate FA. In each of the NABs, all 512 possible states are applied to the inputs. These scenarios are used in HSPICE and MATLAB for circuitry performance and error evaluations separately. Since the proposed circuits are according to the DT-GDI techniques, the exact cell is selected based on this technique as well. Figs. 7c and 7d show the block diagram and transistor-level scheme of the DT-GDI-based exact FA for comparison and RCA implementation with different NABs.

### 3.1 Approximate FA assessments

Simulations are carried out under the constant conditions provided in Table 6. The best performance

Table 6 Approximate full adder simulation results under constant conditions*

| Design | Power $(\mu \mathrm{W})$ | Delay $(\mathrm{ns})$ | PDP $(\mathrm{fJ})$ | PDAP |
| :---: | :---: | ---: | ---: | ---: |
| AMA1 | 1.6476 | 6.1268 | 10.0950 | 201.90 |
| AMA2 | 0.9504 | 2.1206 | 2.0156 | 28.22 |
| AMA3 | 0.7581 | 6.1282 | 4.6461 | 51.11 |
| VAFA | $\downarrow 3.0946$ | 2.1043 | 6.5121 | 156.29 |
| NFAx | 2.3293 | 2.1090 | 4.9126 | 68.78 |
| TGA2 | 1.5947 | $\downarrow 14.1290$ | $\downarrow 22.5310$ | $\downarrow 495.68$ |
| LOA | 1.2608 | 6.1051 | 7.6972 | 92.37 |
| AFA1 | 0.6866 | 6.1380 | 4.2145 | 33.72 |
| AFA2 | 0.8888 | 2.1202 | 1.8846 | 33.92 |
| AFA3 | 0.7260 | 14.1280 | 10.2580 | 143.61 |
| AFA4 | 1.2587 | 10.1100 | 12.7250 | 216.33 |
| AFA5 | 1.1360 | 6.1097 | 6.9409 | 41.65 |
| AFA6 | 1.4160 | 6.1302 | 8.6802 | 164.92 |
| NxFA | 1.7575 | 4.0008 | 7.0315 | 98.44 |
| Proposed-1 | $\uparrow 0.5413$ | $\uparrow 2.0019$ | $\uparrow 1.0838$ | $\uparrow 6.50$ |
| Proposed-2 | 0.9307 | 3.9924 | 3.7158 | 22.29 |
| Proposed-3 | 0.9771 | 3.9924 | 3.9009 | 31.21 |

*requency is 250 MHz , temperature is $25^{\circ} \mathrm{C}$, number of tubes is 10 , the chirality vector is $(38,0)$, pitch is $4 \mathrm{~nm}, T_{\mathrm{ox}}$ is 4 nm , gate length is 32 nm , and for two periods of time and all patterns. $\uparrow$ and $\downarrow$ indicate the best and worst results, respectively
regarding power, delay, PDP, and PDAP belongs to Proposed-1. VAFA has the highest power of $3.0946 \mu \mathrm{~W}$ due to its high number of transistors. TGA2, with 22 transistors, shows the highest results of delay, PDP, and PDAP. Proposed-2 and Proposed-3 have appropriate conditions. In this regard, in terms of power, Proposed-2 and Proposed-3 are placed after AMA3, AFA1, AFA2, and AFA3. Regarding PDAP, Proposed-1, Proposed-2, AMA2, and Proposed-3 rank the first four. The main significant difference between Proposed-1 and the other cells results from its low number of transistors. The main disadvantages of NxFA are a high number of transistors (i.e., 14), the use of an input inverter, and the existence of a high number of direct paths between $V_{\text {DD }}$ and GND. The extraction results according to Table 6 show that, compared with the NxFA circuit, Proposed-1 has $69.20 \%, 49.96 \%, 84.50 \%$, and $93.39 \%$ better results for power, delay, PDP, and PDAP, respectively. Although NxFA has a good error rate (which is comparable to that of Proposed-1), due to the weak results in terms of power, delay, and specifically PDAP (which are the most important comparative factors between circuits), it is not considered the main competitor in comparison with the proposed circuits.

During the fabrication process, the considered mechanism is useful. For example, the intended number of tubes may be displaced. Inevitably, the circuits are initially examined in terms of lithography and cost on the wafer before use based on the optimized values. The process-voltage-temperature (PVT) variations evaluate the performance of circuits, but it is better to establish more accurate analyses when using CNTFETs. Two types of variations are considered: conventional processes and CNT-specific processes. The former variations include channel length, channel width, oxide thickness, and threshold voltage variations, while the latter is about the gate length and width of CNTFETs (Cho and Lombardi, 2016).

Here, for the simulations, parameters such as the number of tubes (which has a direct influence on the density of CNTFETs), gate length, and gate width (relevant to the lithography) of the CNTFETs are considered the changeable objectives of the Monte Carlo method (MCM) for 1000 runs with a Gaussian distribution ( $\pm 5 \%$ distribution at the $\pm 3 \sigma$ level) (Ghorbani et al., 2022). A lithographic process using gate length and width variations has a small impact on
the gate capacitance of a CNTFET. Variations in the lithographic process and density do not affect the power dissipation because they do not change the current in a CNTFET. When the number of tubes is changed, the power is varied. Hence, considering variations in both lithography and the number of tubes (density), a comprehensible understanding of CNTFET-based fabrication on wafers arises.

The obtained results of the mean, minimum, and maximum for power, delay, PDP, and PDAP are shown in Fig. 8. According to Fig. 8a, the mean power consumption by Proposed-1 is the minimum, while AFA1, Proposed-2, and AFA2 follow. All three proposed cells show the best results of mean PDAP, resulting from their low PDP and small number of transistors. Similarly, the conclusion can be attributed to the results regarding the minimum and maximum values of power, delay, and PDP, as shown in Figs. 8b and 8c.

By comparing the power, delay, and PDP distribution results obtained using MCM, according to Fig. 9, the stability of the proposed circuits in this field is achieved. Accordingly, a histogram closer to the starting point of the $X$ and $Y$ axes has higher stability in terms of circuitry performance. On the other hand, Fig. 8d is provided in terms of stability of circuit output versus the fabrication process, in which higher sensitivity of Proposed-1 compared to those of two other cells is realized. This is due to the use of an unstable gate such as GDI-AND in its structure. The stability of the proposed circuits against mismatches in the fabrication process is acceptable, and one can rely on the results obtained from the optimization procedure.

### 3.2 Approximate RCA evaluations

Here, a comprehensive investigation of the approximate RCAs based on approximate FAs is provided. Fig. 10 shows the performance of the circuits versus different NABs in terms of power, delay, PDP, and PDAP. Regarding power, PDP, and PDAP, the proposed cells have the best results compared to other cells when the optimized values are considered. Among the proposed cells, Proposed-2 has the highest delay for NAB1 to NAB4. The proposed cells are appropriate alternatives for RCAs even with a higher number of input bits, such as 15 bits, for the multipliers as final addition stages or even in their PPRT.


Fig. 8 Lithographic variation results by MCM for mean (a), minimum (b), maximum (c), and output (d) waveforms

The higher NABs cause a higher possibility of error in the outputs. Fig. 11 depicts the performance of RCAs in terms of NMED. As expected in Table 2, some circuits, such as AMA1, have higher output accuracy than the proposed circuits. Hence, AMA1, AMA2, VAFA, and AFA4 have lower NMED rates compared to the proposed circuits.

In contrast to the circuitry behavior shown in Fig. 10, AMA1, AMA2, VAFA, and AFA4 have worse results. Therefore, an appropriate competency criterion is used to evaluate the performance of approximate circuits. In this regard, Figs. 12a and 12 b are given, showing NMED versus PDP and PDAP, respectively. Fig. 12a shows the PDP versus NMED results extracted from all considered NABs of RCAs. Plotting the trendline of the proposed circuits on a logarithmic scale demonstrates the best Pareto optimal curve, due mainly to the appropriate PDP. To illustrate the conditions of the circuits in terms of a trade-off between the circuitry and accuracy performance, Figs. 12b and 12c are shown. From Fig. 12b, in terms of average NMED and average PDP, the proposed cells have
better conditions compared to AMA3, NFAx, LOA, AFA2, AFA1, AFA5, and AFA3. TGA2, AMA1, AFA4, AFA6, AMA2, VAFA, and NxFA have only better NMED. The main conclusion here is the strength of the proposed cells regarding circuitry performance and NMED. The same conclusion is attributed to Fig. 12c, which shows PDAP versus NMED; the proposed cells have a better condition in both terms compared to most of the designs. Therefore, these results suggest the proposed cells as proper designs for implementation in specific-purpose future generation chips that are error-tolerant and energy-efficient.

### 3.3 Case study: digital image addition

The proposed circuits are investigated in a real error-tolerant application such as image addition (Huang JQ et al., 2021) based on the mechanism described in Sadeghi et al. (2022). Initially, gray input images (Figs. 13a and 13b) are considered, then converted to binary images (Figs. 13c and 13d), and then converted to a binary equivalent signal in piecewise linear (PWL) format by the developed codes in MATLAB. The


Fig. 9 Lithographic variation results by MCM analysis in terms of power (a), delay (b), and PDP (c)


Fig. 10 Performance of the optimized approximate RCAs versus different numbers of approximate bits: (a) power; (b) delay; (c) PDP; (d) PDAP


Fig. 11 NMED evaluations of the optimized approximate RCAs versus different numbers of approximate bits (NABs)
attained signals are applied to the digital circuits in HSPICE, and the real performance of these circuits is extracted in terms of circuitry parameters such as power, delay, and PDP along with image quality assessments such as the peak signal-to-noise ratio (PSNR) and structural similarity index metric (SSIM) according to Eqs. (1) and (2). Additionally, the figure of merit (FoM) in Eqs. (3) and (4) gives a better understanding of the designs. The mentioned procedure is carried out using RCA with NAB3 for all references
and the proposed cells. The output images obtained by Proposed-1, Proposed-2, and Proposed-3 are shown in Figs. 13e, 13f, and 13g, respectively. The output images of Proposed-2 and Proposed-3 have slightly blurred pixels compared to those of Proposed-1 due to their higher error rates.

$$
\begin{gather*}
\operatorname{PSNR}=10 \lg \frac{m p \mathrm{MAX}_{1}^{2}}{\sum_{i=0}^{m-1 p-1} \sum_{j=0}^{1}[I(i, j)-K(i, j)]^{2}},  \tag{1}\\
\operatorname{SSIM}(x, y)=\frac{\left(2 \mu_{x} \mu_{y}+C_{1}\right)\left(2 \sigma_{x y}+C_{2}\right)}{\left(\mu_{x}^{2}+\mu_{y}^{2}+C_{1}\right)\left(\sigma_{x}^{2}+\sigma_{y}^{2}+C_{2}\right)},  \tag{2}\\
\text { FoM1 }=\frac{\text { Normalized PDP } / \mathrm{pJ}}{\text { PSNR } / \mathrm{dB} \cdot \operatorname{SSIM}} \times 100 \%,  \tag{3}\\
\text { FoM2 }=\frac{\text { Normalized PDAP }}{\text { PSNR } / \mathrm{dB} \cdot \mathrm{SSIM}} \times 100 \% \tag{4}
\end{gather*}
$$

In Eqs. (1) and (2), $m$ and $p$ are the image dimensions, $\mathrm{MAX}_{\mathrm{I}}$ is the maximum value of each pixel, $I(i, j)$ and $K(i, j)$ are the exact and obtained values for each pixel, respectively, $\mu_{x}$ and $\mu_{y}$ are the pixel sample means of $x$ and $y$ images respectively,


Fig. 12 Comparison of the optimized approximate RCAs for different numbers of approximate bits (NABs) in terms of all NABs' PDP simultaneously (a), average PDP (b), and average PDAP (c) versus NMED


Fig. 13 Image addition application: (a,b) greyscale input images; (c, d) binary scale input images; (e) outputs of Proposed-1; (f) outputs of Proposed-2; (g) outputs of Proposed-3
$\sigma_{x}^{2}$ and $\sigma_{y}^{2}$ are the variances of $x$ and $y$ images respectively, $\sigma_{x y}$ is the covariance of $x$ and $y$ images, and $C_{1}=$ $\left(K_{1} L\right)^{2}$ and $C_{2}=\left(K_{2} L\right)^{2}$ are two variables to stabilize the division with a weak denominator, where $K_{1}$ and $K_{2}$ are small constants generally equal to 0.01 and 0.03 respectively with no unit and $L=255$ is the dimension of the image.

In the achieved results (Fig. 14), TGA2, as a circuit with the worst results, is indicated for other references to be normalized. The proposed cells with a low number of transistors and appropriate PDP and PDAP have the best performance during the conducted image processing application.


Fig. 14 FoM1 and FoM2 results for the approximate RCA with three approximate bits and different approximate full adders in image addition

Proposed-1 with $8 \%$ FoM1 and $3 \%$ FoM2, compared to TGA2, is the best circuit. Proposed-2 has a similar performance regarding FoM2 compared to Proposed-1, while it has a similar result in comparison with AMA2 regarding FoM1. The results obtained show an appropriately established trade-off between
circuitry performance and accuracy in the proposed cells for error-tolerant applications.

Approximate cells are usually connected as an RCA with a large number of input bits. In this regard, as a multibit evaluation of the proposed cells and those in the literature, they are used in RCA implementation with 8 -bit, 16 -bit, and considering $50 \%$ approximate bits to evaluate them in real circumstances (Huang JQ et al., 2021). Simulations are performed, and the results in terms of FoM3 (Eq. (5)) consisting of circuitry and accuracy performance are reported (Sabetzadeh et al., 2019), as shown in Fig. 15:

$$
\begin{equation*}
\text { FoM3 }=\frac{\text { PDP }}{1-\mathrm{NMED}} . \tag{5}
\end{equation*}
$$

A design with a smaller FoM3 value reaches a better trade-off between hardware and accuracy. The simulations are performed at an operating frequency of 250 MHz , and the inputs are applied to the circuit to cover all possible states. Additionally, the circuits used in the simulations are all based on optimized values attained previously. Accordingly, the three proposed circuits have better conditions in terms of PDP and PDAP. However, in some cases, the delay and NMED values of the proposed circuits are worse than those of the others. During the simulations, as the number of input bits increases, the delay of Proposed-1 increases, whereas this increment occurs for Proposed-2 and Proposed-3 concerning NMED. The reason for the poorer performance of Proposed-1 is voltage swing drop in its Sum. In general, the proposed circuits with better performance in terms of FoMs consist of different parameters, such as power, delay, PDP, and the number of transistors, as PDAP and NMED are more suitable metrics for use in more complex circuits.


Fig. 15 Eight- and 16-bit RCA results versus FoM3 with 50\% NABs

## 4 Conclusions

In this paper, a new approach for designing approximate computing based arithmetic circuits is proposed. A reliable combination between CNTFETs and GDI is established as the principal technology and design technique. Three approximate full adders are proposed with a small area and a small number of transistors. Regarding performance optimization, as one of the main challenges in using CNTFETs, the NSGA-II algorithm is performed by considering the number of tubes and chirality vectors of transistors as objectives. The optimization results indicate an approximately $50 \%$ improvement in terms of power, delay, and PDP for some of the proposed cells compared to nonoptimized conditions. Additionally, lithography evaluation based on the Monte Carlo method is performed, and the stability and reliability of the proposed cells based on the GDI technique are approved. The results achieved are attributed to the dynamic threshold technique that is used for the transistors of the proposed cell. Investigations of error metrics in terms of the normalized mean error distance (NMED) and circuitry performance of the proposed cells, implemented in a ripple-carry adder (RCA) under different numbers of approximate bits, are carried out using MATLAB and HSPICE. The results obtained in an error-tolerant application such as image addition indicate the appropriate performance of the proposed circuits on both circuitry and accuracy performance.

## Contributors

Ayoub SADEGHI designed the research. Razieh GHASEMI and Hossein GHASEMIAN processed the data. Ayoub SADEGHI and Hossein GHASEMIAN drafted the paper. Hossein GHASEMIAN and Nabiollah SHIRI supervised the study and revised and finalized the paper.

## Compliance with ethics guidelines

Ayoub SADEGHI, Razieh GHASEMI, Hossein GHASEMIAN, and Nabiollah SHIRI declare that they have no conflict of interest.

## Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

## References

Abdul Hadi MF, Hussin H, Soin N, 2022. The impact of variation in diameter and dielectric materials of the CNT field-effect
transistor. ECS J Sol State Sci Technol, 11(2):023002.
https://doi.org/10.1149/2162-8777/ac4ffc
Abiri E, Darabi A, Salehi MR, et al., 2020. Optimized gate diffusion input method-based reversible magnitude arithmetic unit using non-dominated sorting genetic algorithm II. Circ Syst Signal Process, 39(9):4516-4551.
https://doi.org/10.1007/s00034-020-01382-1
Ben-Jamaa MH, Mohanram K, de Micheli G, 2011. An efficient gate library for ambipolar CNTFET logic. IEEE Trans Comput-Aided Des Integr Circ Syst, 30(2):242-255.
https://doi.org/10.1109/tcad.2010.2085250
Cardenas JA, Lu SH, Williams NX, et al., 2021. In-place printing of flexible electrolyte-gated carbon nanotube transistors with enhanced stability. IEEE Electron Dev Lett, 42(3):367-370.
https://doi.org/10.1109/led.2021.3055787
Cho G, Lombardi F, 2016. Design and process variation analysis of CNTFET-based ternary memory cells. Integration, 54:97-108. https://doi.org/10.1016/j.vlsi.2016.02.003
Deb K, Pratap A, Agarwal S, et al., 2002. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Comput, 6(2):182-197. https://doi.org/10.1109/4235.996017
Deng J, Wong HSP, 2007a. A compact SPICE model for carbon-nanotube field-effect transistors including nonidealities and its application-Part I: model of the intrinsic channel region. IEEE Trans Electron Dev, 54(12):31863194. https://doi.org/10.1109/ted.2007.909030

Deng J, Wong HSP, 2007b. A compact SPICE model for carbon-nanotube field-effect transistors including nonidealities and its application-Part II: full device model and circuit performance benchmarking. IEEE Trans Electron Dev, 54(12):3195-3205.
https://doi.org/10.1109/ted.2007.909043
Ghasemian A, Abiri E, Hassanli K, et al., 2022. HF-QSRAM: half-select free quaternary SRAM design with required peripheral circuits for IoT/IoVT applications. ECS J Sol State Sci Technol, 11(1):011002. https://doi.org/10.1149/2162-8777/ac4798
Ghorbani A, Dolatshahi M, Zanjani SM, et al., 2022. A new low-power dynamic-GDI full adder in CNFET technology. Integration, 83:46-59. https://doi.org/10.1016/j.vlsi.2021.12.001
Gupta V, Mohapatra D, Raghunathan A, et al., 2013. Lowpower digital signal processing using approximate adders. IEEE Trans Comput-Aided Des Integr Circ Syst, 32(1): 124-137. https://doi.org/10.1109/tcad.2012.2217962
Hasan M, Zaman HU, Hossain M, et al., 2020. Gate diffusion input technique based full swing and scalable 1-bit hybrid full adder for high performance applications. Eng Sci Technol Int J, 23(6):1364-1373.
https://doi.org/10.1016/j.jestch.2020.05.008
Homulle H, Song L, Charbon E, et al., 2018. The cryogenic temperature behavior of bipolar, MOS, and DTMOS transistors in standard CMOS. IEEE J Electron Dev Soc, 6: 263-270. https://doi.org/10.1109/jeds.2018.2798281
Huang JL, Zhu MH, Gupta P, et al., 2010. A CAD tool for design and analysis of CNFET circuits. Proc IEEE Int Conf of Electron Devices and Solid-State Circuits, p.1-4.
https://doi.org/10.1109/edssc.2010.5713735
Huang JL, Zhu MH, Yang SQ, et al., 2012. A physical design tool for carbon nanotube field-effect transistor circuits. ACM J Emerg Technol Comput Syst, 8(3):25.
https://doi.org/10.1145/2287696.2287708
Huang JQ, Kumar TN, Almurib HAF, et al., 2021. Commutative approximate adders: analysis and evaluation. Proc IEEE/ACM Int Symp on Nanoscale Architectures, p.1-6. https://doi.org/10.1109/nanoarch53687.2021.9642233
Kandpal J, Tomar A, Agarwal M, et al., 2020. High-speed hybrid-logic full adder using high-performance 10-T XORXNOR cell. IEEE Trans Very Large Scale Integr Syst, 28(6):1413-1422. https://doi.org/10.1109/tvlsi.2020.2983850
Karimi A, Rezai A, 2016. Improved device performance in CNTFET using genetic algorithm. ECS J Sol State Sci Technol, 6(1):M9-M12. https://doi.org/10.1149/2.0101701jss
Karimi A, Rezai A, 2017. A design methodology to optimize the device performance in CNTFET. ECS J Sol State Sci Technol, 6(8):M97-M102. https://doi.org/10.1149/2.0181708jss
Kordrostami Z, Raeini AGN, Ghoddus H, 2019. Design and optimization of lightly doped CNTFET architectures based on NEGF method and PSO algorithm. ECS J Sol State Sci Technol, 8(4):M39-M44.
https://doi.org/10.1149/2.0121904jss
Lee C, Wong HSP, 2015. Stanford Virtual-Source Carbon Nanotube Field-Effect Transistors Model 1.0.1. NanoHUB. https://nanohub. org/publications/42/2 [Accessed on Mar. 1, 2022].
Lindert N, Sugii T, Tang S, et al., 1999. Dynamic threshold pass-transistor logic for improved delay at lower power supply voltages. IEEE J Sol-State Circ, 34(1):85-89. https://doi.org/10.1109/4.736659
Mahdiani HR, Ahmadi A, Fakhraie SM, et al., 2010. Bioinspired imprecise computational blocks for efficient VLSI implementation of soft-computing applications. IEEE Trans Circ Syst I Regul Papers, 57(4):850-862. https://doi.org/10.1109/tcsi.2009.2027626
Majerus S, Merrill W, Garverick SL, 2013. Design and longterm operation of high-temperature, bulk-CMOS integrated circuits for instrumentation and control. Proc IEEE Energytech, p.1-6. https://doi.org/10.1109/energytech.2013.6645305
Mirzaei M, Mohammadi S, 2020. Process variation-aware approximate full adders for imprecision-tolerant applications. Comput Electr Eng, 87:106761. https://doi.org/10.1016/j.compeleceng.2020.106761
Mirzaei M, Mohammadi S, 2021. Low-power and variationaware approximate arithmetic units for image processing applications. AEU Int J Electron Commun, 138:153825. https://doi.org/10.1016/j.aeue.2021.153825
Morgenshtein A, Fish A, Wagner I, 2002. Gate-diffusion input (GDI): a power-efficient method for digital combinatorial circuits. IEEE Trans Very Large Scale Integr Syst, 10(5): 566-581. https://doi.org/10.1109/tvlsi.2002.801578
Morgenshtein A, Yuzhaninov V, Kovshilovsky A, et al., 2014. Full-swing gate diffusion input logic-case-study of
low-power CLA adder design. Integration, 47(1):62-70. https://doi.org/10.1016/j.vlsi.2013.04.002
Naseri H, Timarchi S, 2018. Low-power and fast full adder by exploring new XOR and XNOR gates. IEEE Trans Very Large Scale Integr Syst, 26(8):1481-1493. https://doi.org/10.1109/tvlsi.2018.2820999
Rafiee M, Sadeghi Y, Shiri N, et al., 2021a. An approximate CNTFET 4:2 compressor based on gate diffusion input and dynamic threshold. Electron Lett, 57(17):650-652. https://doi.org/10.1049/ell2.12221
Rafiee M, Pesaran F, Sadeghi A, et al., 2021b. An efficient multiplier by pass transistor logic partial product and a modified hybrid full adder for image processing applications. Microelectron $J, 118: 105287$.
https://doi.org/10.1016/j.mejo.2021.105287
Rafiee M, Shiri N, Sadeghi A, 2022a. High-performance 1-bit full adder with excellent driving capability for multistage structures. IEEE Embed Syst Lett, 14(1):47-50.
https://doi.org/10.1109/les.2021.3108474
Sabetzadeh F, Moaiyeri MH, Ahmadinejad M, 2019. A majoritybased imprecise multiplier for ultra-efficient approximate image multiplication. IEEE Trans Circ Syst I Regul Papers, 66(11):4200-4208. https://doi.org/10.1109/tcsi.2019.2918241
Sadeghi A, Shiri N, Rafiee M, 2020. High-efficient, ultra-lowpower and high-speed $4: 2$ compressor with a new full adder cell for bioelectronics applications. Circ Syst Signal Process, 39(12):6247-6275. https://doi.org/10.1007/s00034-020-01459-x
Sadeghi A, Shiri N, Rafiee M, et al., 2022. An efficient counterbased Wallace-tree multiplier with a hybrid full adder core for image blending. Front Inform Technol Electron Eng, 23(6):950-965. https://doi.org/10.1631/FITEE. 2100432
Strollo AGM, Napoli E, de Caro D, et al., 2020. Comparison and extension of approximate 4-2 compressors for lowpower approximate multipliers. IEEE Trans Circ Syst I Regul Papers, 67(9):3021-3034. https://doi.org/10.1109/tcsi.2020.2988353
Vasantha Kumar BVP, Murthy Sharma NS, Lal Kishore K, 2012. A technique to reduce glitch power during physical design stage for low power and less IR drop. Int J Comput Appl, 39(18):62-67. https://doi.org/10.5120/5086-7450
Venkatachalam S, Ko SB, 2017. Design of power and area efficient approximate multipliers. IEEE Trans Very Large Scale Integr Syst, 25(5):1782-1786. https://doi.org/10.1109/tvlsi.2016.2643639
Waris H, Wang CH, Liu WQ, 2019. High-performance approximate half and full adder cells using NAND logic gate. IEICE Electron Expr, 16(6):20190043. https://doi.org/10.1587/elex.16.20190043
Waris H, Wang CH, Liu WQ, et al., 2022. Hybrid partial productbased high-performance approximate recursive multipliers. IEEE Trans Emerg Top Comput, 10(1):507-513. https://doi.org/10.1109/tetc.2020.3013977
Yang ZX, Han J, Lombardi F, 2015. Transmission gate-based approximate adders for inexact computing. Proc IEEE/ ACM Int Symp on Nanoscale Architectures, p.145-150. https://doi.org/10.1109/nanoarch.2015.7180603


[^0]:    ${ }^{\ddagger}$ Corresponding author
    (1D) ORCID: Ayoub SADEGHI, https://orcid.org/0000-0001-9904-9813; Hossein GHASEMIAN, https://orcid.org/0000-0002-8069-8845 © Zhejiang University Press 2023

[^1]:    The best optimization results are in bold

