# Supplementary materials for 

Ayoub SADEGHI, Nabiollah SHIRI, Mahmood RAFIEE, Mahsa TAHGHIGH, 2022. An efficient counter-based Wallace-tree multiplier with a hybrid full adder core for image blending. Front Inform Technol Electron Eng, 23(6): 950-965. https://doi.org/10.1631/FITEE. 2100432

## 1 Full adders

Naseri and Timarchi (2018) proposed a full-swing transmission gate (TG) based XOR-XNOR with 10 transistors and an inverter in its input. Among the suggested six structures, HFA-17T has the minimum area. In Safaei Mehrabani and Eshghi (2016), six new full adders (FAs) were proposed based on the pass transistor logic (PTL) XOR-XNOR gate and TG-based multiplexers (MUXs). The most reliable design in Safaei Mehrabani and Eshghi (2016) (NEW-ND-FA) has 24 transistors with three inverters in inputs. Although the proposed XOR-XNOR in Safaei Mehrabani and Eshghi (2016) has high performance, high numbers of transistors and input inverters cause high power consumption and area occupation. Unlike Safaei Mehrabani and Eshghi (2016) and Naseri and Timarchi (2018), in Kandpal et al. (2020) a different hybrid design (Design-4) was proposed with three modules and 20 transistors. In this cell, XOR-XNOR signals are applied to TG gates which have $C_{\text {in }}$ as their drain and source inputs to obtain Sum and $C_{\text {out }}$. The main problem is the threshold voltage drop of the output that can damage the whole system. The threshold voltage drop as a major defect causes the designers to use two other modules based on the complementary metal-oxide-semiconductor (CMOS) technique with four and two transistors in a series structure. These modules have power supply ( $V_{\mathrm{DD}}$ ) and ground (GND) in their configuration, so the direct path will be created and short-circuit power consumption will increase. An FA cell is designed in the gate-diffusion input (GDI) technique. Here, NAND, NOR, XNOR, and XOR create the initial stage and generate the essential signals of 2:1 MUXs. Also, two output inverters improve the driving capability, and due to the GDI technique, the occupied area is small. A high number of internal nodes due to the use of series transistors causes a long critical path, the presence of direct paths between $V_{\mathrm{DD}}$ and GND causes static and dynamic power consumption, and the use of multiple inverters increases the power consumption and significantly reduces the speed in this circuit. However, because of the drop in swing voltage due to the GDI technique, the GDI technique can be considered an unreliable choice.

## $2 M: 3(4 \leq M \leq 7)$ counters

Briefly, Mehrabi et al. (2013) has proposed a $4: 3$ counter using CMOS-based gates with 80 transistors that employ different logic gates such as AND, XOR, and MUX. All gates are based on the CMOS technique, with a high area occupation. Similarly, CMOS-based 5:3 and 6:3 counters with 80 and 112 transistors were proposed in Chowdhury et al. (2008) and Mehrabi et al. (2013), respectively. A 180-transistor CMOS-based 7:3 counter with high power consumption (Chang et al., 2005) was proposed in Mohd et al. (2013). These cells with the CMOS technique have appropriate output swings but suffer from high power consumption. Therefore, different designs using the GDI technique were presented in Mukherjee and Ghosal (2019) to achieve a small area. Different structures of 5:3 to 7:3 counters, which have logic gates including AND, XOR, and OR gates, were presented as propagation gate (PG) blocks to be implemented in the body of the cells (Asif and Kong, 2015). Two different 7:3 counters with 260 and 160 transistors were configured by grouping the input bits pattern and removing the redundant carry generator (Saha et al., 2018). One of the most reliable designs regarding $7: 3$ counters was described in Veeramachaneni et al. (2007) comprehensively.

Generally, counters are used to count the number of 1 's in the inputs. Table 1 in the main text shows the performance of a $4: 3$ digital counter whose concept can be extended to larger counters. There is a similarity between compressors and counters' truth tables. The essential difference between a counter and a compressor is that the counter has no inter-stage carries and no inter-stage interconnects. In contrast, the compressor has several carries that come from or go to the neighboring cells in the same stage (Bagherizadeh et al., 2017; Srinivasulu et al., 2020).

## 3 Physical comparison between the conventional and proposed multipliers

Note that using half adder (HA) cells in circuits such as 4:3 and 5:3 counters in the partial product reduction tree (PPRT) stage of a multiplier will increase the area consumption undesirably, and consequently increase the complexity of this significant stage. The higher the complexity of the circuit, the higher the number of internal nodes (due to the higher number of transistors used). Regarding the gate-level delay and transistor delay, instead of lowering FAs and using low-order circuits like $4: 3$ and 5:3 circuits, it is recommended to lower the complexity and use the high-order compression circuits like 6:3 and 7:3 counters along with FAs; in this case, lower area consumption, higher speed, and lower power can be obtained (Rahnamaei, 2020). Therefore, the higher-order counters like 6:3 (with one HA on the Sum output path) and 7:3 ones with only FA circuits are much better options in multipliers like the proposed one.

In general, increasing the width of the transistor will increase the value of $V_{\mathrm{th}}$ until it reaches a fixed point. In this case, if the hybrid FA does not have a full-swing output, this increase in $V_{\text {th }}$ will cause a considerable drop in output swing voltage, especially in the case of high fan-outs. As a result, the noise margin is compromised. On the other hand, increasing the width of the transistors will increase the current of the transistors, which is very important concerning static power. Despite these features, it is known that with the minimum width of the transistors, in some circuits the minimum energy dissipation can be obtained.

## 4 Simulation results and comparison

### 4.1 Transistor sizing determination

Transistor sizing is important for the design and implementation of high-performance and reliable circuits. Approximately equal rise time and fall time help attain high-speed circuits. In CMOS circuits, the width of PMOS transistors is usually considered to be two to three times that of the NMOS type ( $W_{\mathrm{p}}=2 W_{\mathrm{n}}-3 W_{\mathrm{n}}$ ). However, this may not be the most effective way to implement circuits such as FAs which are based on different logic styles and techniques like hybrid ones.

Therefore, due to the hybrid structure of the presented FA and most of the compared circuits, the mentioned advantages by considering $W_{\mathrm{n}}=W_{\mathrm{p}}$ can be achieved. Also, to minimize energy dissipation, the minimum allowable width based on the used technology is considered. Therefore, the simulations on PMOS and NMOS transistors of the proposed circuit and other references are adjusted with equal dimension, $W_{\mathrm{p}}=W_{\mathrm{n}}=120 \mathrm{~nm}$ and $L_{\mathrm{p}}=L_{\mathrm{n}}=100 \mathrm{~nm}$.

### 4.2 Full adder cell tolerability evaluation

When manufacturing a die, parameters such as oxide thickness, dopant and mobility, transistor width/length $(W / L)$, and existing resistance-capacitance ( RC ) variations will change, which affects the threshold voltage of transistors as follows:

$$
\begin{equation*}
V_{\mathrm{th}}=V_{\mathrm{T} 0}+\gamma\left(\sqrt{2 \varnothing_{\mathrm{s}}+V_{\mathrm{SB}}}-\sqrt{\varnothing_{\mathrm{s}}}\right), \tag{S1}
\end{equation*}
$$

where $V_{\mathrm{T} 0}$ is the threshold voltage when the source is at the body potential, $\varnothing_{\mathrm{s}}$ is the surface potential at the threshold, $\gamma$ is the body effect coefficient, and $V_{\mathrm{SB}}$ is the applied voltage between the source and body. Therefore, a change in $V_{\mathrm{th}}$ will change the drain current $\left(I_{\mathrm{D}}\right)$ as follows:

$$
\begin{equation*}
I_{\mathrm{D}}=\frac{1}{2} \mu_{\mathrm{n}} C_{\mathrm{ox}} \frac{W}{L}\left(V_{\mathrm{GS}}-V_{\mathrm{th}}\right)^{2}, \tag{S2}
\end{equation*}
$$

where $V_{\mathrm{GS}}$ is the voltage of gate-source, and the current flow through the channel directly depends upon mobility $\left(\mu_{\mathrm{n}}\right)$, oxide capacitance $C_{\mathrm{ox}}$ (and hence the thickness of the oxide, i.e., $t_{\mathrm{ox}}$ ), and the ratio of $W / L$. Numerous reasons can change the chip voltages, like current-resistance (IR) drop, which is caused by the current flow over the parasitic resistances and can reduce the supply voltage from the intended value. As a result, change in the chip voltages will cause the circuit to work slower or faster than earlier. Transistor characteristics like carrier mobility are influenced by temperature. The junction temperature inside the chip can vary in a wide range; thus, it is needed to consider the temperature variation. Carrier mobility relation with temperature is shown in Eq. (S3) for the metal-oxidesemiconductor field-effect transistors (MOSFETs):

$$
\begin{equation*}
\mu(T)=\mu\left(T_{\mathrm{r}}\right)\left(T / T_{\mathrm{r}}\right)^{-k_{\mu}} \tag{S3}
\end{equation*}
$$

where $T$ and $T_{\mathrm{r}}$ are the absolute and room temperature respectively, and $k_{\mu}$ is a fitting parameter with a typical value. Therefore, $V_{\mathrm{th}}$ decreases nearly linearly with temperature and may be approximated by

$$
\begin{equation*}
V_{\mathrm{th}}(T)=V_{\mathrm{th}}\left(T_{\mathrm{r}}\right)-k_{\mathrm{vt}}\left(T-T_{\mathrm{r}}\right) \tag{S4}
\end{equation*}
$$

where $k_{\mathrm{vt}}$ is typically about $1-2 \mathrm{mV} / \mathrm{K}$. Also, at high values of $V_{\mathrm{DD}}$, the ON state current $\left(I_{\mathrm{on}}\right)$ decreases with temperature increment, while the subthreshold leakage current increases exponentially.

### 4.3 Counter cells

The circuit in Mukherjee and Ghosal (2019) suffers from threshold voltage drop; thus, its inability to be used in more sophisticated circuits like the multiplier is under question. Looking at the numbers of transistors of circuits provided in Table S1, the proposed structures have $51.11 \%, 51 \%, 52.14 \%$, and $50 \%$ transistor reduction, in 4:3, 5:3, 6:3, and 7:3 counters, respectively, compared to their closest references. It results in an about $50 \%$ average reduction of area in the proposed structures compared to the state-of-the-art design. Regarding the 4:3 circuits, since Mukherjee and Ghosal (2019) is not able to be implemented in multiplier, it is not compared with proposed cell in terms of transistors number. Here the proposed cell is compared to 90 transistors by Asif and Kong, 2015.

Table S1 Transistors and gates comparison among different digital counters

| Design | Number of transistors |  |  |  | Number of gates |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | 4:3 | 5:3 | 6:3 | 7:3 | 4:3 | 5:3 | 6:3 | 7:3 |
| Mehrabi et al., 2013 | 100 | - | 140 | 160 | 9 | - | 11 | 12 |
| Mukherjee and Ghosal, 2019 | 22 | 226 | - | - | 9 | 22 | - | - |
| Asif and Kong, 2015 | 90 | - | 224 | - | 10 | - | 21 | - |
| Chowdhury et al., 2008 | - | 100 | - | - | - | 8 | - | - |
| Fritz et al., 2017 | - | - | 190 | - | - | - | 26 | - |
| Saha et al., 2018 (design-1) | - | - | - | 260 | - | - | - | 26 |
| Saha et al., 2018 (design-2) | - | - | - | 180 | - | - | - | 22 |
| Veeramachaneni et al., 2007 | - | - | - | 144 | - | - | - | 12 |
| Proposed | 44 | 49 | 67 | 72 | 9 | 12 | 20 | 17 |

### 4.4 Layout considerations

The post-layout simulation waveforms of the proposed FA are shown in Fig. 10 of the main paper. As can be seen, the proposed circuit has appropriate performance under 500 MHz of frequency after all the validation tests of its
layout are checked. It can be seen that the proposed circuit attributed to its configuration can produce high logic (1) with sufficient swing $\left(=V_{\mathrm{DD}}\right)$ at the considered frequency. In this case, the major point that must be taken into consideration is the $C_{\text {out }}$ generation for low logic ( 0 ) voltages. It is observed that in such a case the proposed circuit can produce suitable 0 logic except when $A B C_{\text {in }}=001$, Sum $=1$, and $C_{\text {out }}=0$. Although in this case threshold voltage ( $V_{\mathrm{th}}$ ) loss occurs, its value is not high (it is equal to 0.26 V instead of being 0 approximately, which can be seen from Fig. 10 in the main text). The main reason for this outcome is the use of a PMOS (M14), which is unable to produce the 0 logic appropriately. It must be considered that the highest value of the existing threshold voltage ( $V_{\mathrm{th}}$ ) in $C_{\text {out }}$ ( $C_{\text {out_ }}$ Low_logic $\mathrm{max}_{\text {max }}$ ) is equal to one $V_{\text {th }}=$ PMOS (M14). In a circuit usually 0 and 1 degradations occur due to inappropriate use of NMOS and PMOS transistors; in such cases a much more expensive latch based signal keeper circuit must be employed to solve the problem, or it becomes essential to increase the size of transistors, subsequently. On the other hand, in the proposed design only 0 degradation exists by a PMOS (M14), only in one state of input combination $\left(A B C_{\mathrm{in}}=001\right)$, in which the harmful influence on the DC power can occur. In this case, there is no need to tweak the size of the PMOS transistor (since it will be insignificant while consuming power) or to add an expensive latch based signal keeper circuit, but an NMOS with the complement input signal (here $\bar{X}$ ) compared to M14 can be added. Also, the range of output voltage swings can affect the minimum supply voltage.

## 5 Image blending

The multiplication can be performed based on

$$
\begin{equation*}
Q(i, j)=P_{1}(i, j) \times P_{2}(i, j) \tag{S5}
\end{equation*}
$$

where $Q$ is the multiplied output image, while $P_{1}$ and $P_{2}$ are the input images that are supposed to be multiplied pixel by pixel.

Here, the intended images for blending, which are grayscale, must first be converted into readable digital signals for transistors in the multiplier. For this purpose, a MATLAB code is developed. Grayscale images have pixels with values between 0 and 255. Initially, a proportional voltage is assigned to each pixel for each possible value. To do this, $V_{\mathrm{DD}} / 255$ (if the input images have dimensions of $255 \times 255$ ) is considered as a step. For example, if the desired pixel has a value of 0 , its voltage is 0 , while if it is 255 , its voltage value is equal to 1.2 V , which is the nominal value of the $V_{\mathrm{DD}}$ in 90 nm technology. Next, to generate the digital signals from these pixels, the resulting voltage matrix must be readable for the HSPICE simulator.

So, the resulting voltage matrix is converted to a piecewise linear (PWL) signal, a $1 \times n$ matrix, and applied to the circuit as input. Also, to generate pulses commensurate with the values of the pixels, the values of the rise time $\left(t_{\mathrm{r}}\right)$ and fall time $\left(t_{\mathrm{f}}\right)$ are considered to 0.1 ns for the circuit benchmark at high frequencies. The mentioned signals are illustrated in Fig. 11 in the main text. The buffers are used to convert the signals to binary for better testing of the multipliers. By applying the resulting PWL signal to these buffers, the gray images are automatically converted to binary. Now the signals are available for the multiplier. The output is obtained as an $m \times 1$ matrix, which is transferred to MATLAB and converted to its initial dimensions, $255 \times 255$. The output is displayed and image evaluation parameters such as peak signal-to-noise ratio (PSNR) and structural similarity index metric (SSIM) are calculated.

Like any advanced very-large-scale integration (VLSI) system, the proposed mechanism has a fault detection system. It is possible to apply the multiplier output image and the expected image, obtained from the typical MATLAB operation, to a subtractor with a sufficient number of input bits, and to subtract the values of these two images pixel by pixel $(255 \times 255=65025)$. The difference image from the VLSI implementation by the proposed multiplier with its mechanism and conventional image processing by MATLAB is obtained. Therefore, it is easy to detect the performance of multipliers by this mechanism for image blending applications.

Also, by comparing the plot profile results from the subtraction of the expected MATLAB output image and different multipliers, Fig. S1 is attained. The highest difference belongs to the output image by the multiplier of Ref1, while the lowest results are for the proposed multiplier. These results prove the stability and high efficiency of the proposed cells including FA, HA, counters, and also the 8-bit CBW multiplier.


Fig. S1 Subtraction results of the output images of multipliers and MATLAB

## References

Asif S, Kong YN, 2015. Design of an algorithmic Wallace multiplier using high speed counters. Proc $10^{\text {th }}$ Int Conf on Computer Engineering \& Systems, p.133-138. https://doi.org/10.1109/icces.2015.7393033
Bagherizadeh M, Moaiyeri MH, Eshghi M, 2017. Digital counter cell design using carbon nanotube FETs. IEEE J Appl Res Technol, 15(3):211-222. https://doi.org/10.1016/j.jart.2016.12.005
Chang CH, Gu JM, Zhang MY, 2005. A review of $0.18-\mu \mathrm{m}$ full adder performances for tree structured arithmetic circuits. IEEE Trans Very Large Scale Integr Syst, 13(6):686-695. https://doi.org/10.1109/tvlsi.2005.848806
Chowdhury SR, Banerjee A, Roy A, et al., 2008. Design, simulation and testing of a high speed low power 15-4 compressor for high speed multiplication applications. Proc $1^{\text {st }}$ Int Conf on Emerging Trends in Engineering and Technology, p.434-438. https://doi.org/10.1109/icetet.2008.151
Fritz C, Fam AT, 2017. Fast binary counters based on symmetric stacking. IEEE Trans Very Large Scale Integr Syst, 25(10):2971-2975. https://doi.org/10.1109/tvlsi.2017.2723475
Kandpal J, Tomar A, Agarwal M, et al., 2020. High-speed hybrid-logic full adder using high-performance 10-T XOR-XNOR cell. IEEE Trans Very Large Scale Integr Syst, 28(6):1413-1422. https://doi.org/10.1109/tvlsi.2020.2983850
Mehrabi S, Mirzaee RF, Zamanzadeh S, et al., 2013. Design, analysis, and implementation of partial product reduction phase by using wide $m: 3(4 \leq m \leq 10)$ compressors. Int J High Perform Syst Arch, 4(4):231-241. https://doi.org/10.1504/ijhpsa.2013.058986
Mohd BJ, Abed S, Alouneh S, 2013. Carry-based reduction parallel counter design. Int J Electron, 100(11):1510-1528. https://doi.org/10.1080/00207217.2012.751320
Mukherjee B, Ghosal A, 2019. Counter based low power, low latency Wallace tree multiplier using GDI technique for on-chip digital filter applications. Proc Devices for Integrated Circuit, p.151-155. https://doi.org/10.1109/devic.2019.8783456
Naseri H, Timarchi S, 2018. Low-power and fast full adder by exploring new XOR and XNOR gates. IEEE Trans Very Large Scale Integr Syst, 26(8):1481-1493. https://doi.org/10.1109/tvlsi.2018.2820999
Rahnamaei A, 2020. CMOS high-performance 5-2 and 6-2 compressors for high-speed parallel multipliers. Inform MIDEM - $J$ Microelectron Electron Compon Mater, 50(2):115-124. https://doi.org/10.33180/infmidem2020.204
Safaei Mehrabani Y, Eshghi M, 2016. Noise and process variation tolerant, low-power, high-speed, and low-energy full adders in CNFET technology. IEEE Trans Very Large Scale Integr Syst, 24(11):3268-3281.
https://doi.org/10.1109/tvlsi.2016.2540071

Saha A, Pal R, Naik AG, et al., 2018. Novel CMOS multi-bit counter for speed-power optimization in multiplier design. AEU-Int J Electron Commun, 95:189-198. https://doi.org/10.1016/j.aeue.2018.08.015
Srinivasulu A, Kumar Saini J, Kumawat R, 2020. A full adder design with CNFETs for real time. Fault Tol Miss Crit Appl Electron, 24(2):66-74. https://doi.org/10.7251/els2024066s
Veeramachaneni S, Lingamneni A, Krishna MK, et al., 2007. Novel architectures for efficient ( $m, n$ ) parallel counters. Proc $17^{\text {th }}$ ACM Great Lakes Symp on VLSI, p.188-191. https://doi.org/10.1145/1228784.1228833

