# ENGINEERING IN ADVANCED RESEARCHSCIENCE AND TECHNOLOGY ISSN 2352-8648 Vol.04, Issue.02 December-2021 Pages: -437-448 # LOW DENSITY SCALABLE BINARY COUNTER USING EFFICIENT SORTING NETWORK #### ¹KATHIMANDA ANJALI, ²S SAMBASIVARAO DANNINA <sup>1</sup>PG Scholar, Dept of ECE, Prasiddha College of Engineering & Technology, Anathavaram, Amalapuram <sup>2</sup>Associate Professor, of ECE, Prasiddha College of Engineering & Technology, Anathavaram, Amalapuram ABSTRACTParallel counters are enter components in numerous number juggling circuits, particularly quick multipliers. The summation of multiple operands in parallel forms part of the critical path in various digital signal processing units. To speedup the summation, high compression ratio counters and compressors are necessary. In this article, we present a novel method of fast saturated binary counters and exact/approximate (4:2) compressors based on the sorting network. The inputs of the counter are asymmetrically divided into two groups and fed into sorting networks to generate reordered sequences, which can be solely represented by one-hot code sequences. Between the reordered sequence and the one-hot code sequence, three special Boolean equations are established, which can significantly simplify the output Boolean expressions of the counter. Further, this project is enhanced by using parallel sorting algorithms for finding/ sorting M largest values from N inputs and then design scalable architectures based on proposed algorithms. For finding the largest values the iterative sorting techniques also proposed. Bitonic Sorting Is one type of efficient such algorithm for implementing with optimised parameters. **Keywords:**approximate (4:2) compressors, Sorting, asymmetric, Digital signal processing, Arithmetic Logic Unit. **INTRODUCTION:**Energy minimization is major requirements in almost any electronic systems, especially the portable ones such as smart phones, tablets, and different gadgets. It is extremely desired to attain this minimization with minimal performance (speed) penalty [1]. Digital signal processing (DSP) blocks are most wanted in transportable components for realizing various multimedia applications. The computational core of these blocks is the ALU where the multiplications and additions are the major part [6]. The multiplications plays foremost operation in the processing elements which can leads to high consumption of energy and power. Many of the DSP cores implement image and video processing algorithms where final outputs are either images or videos prepared for human consumptions. It facilitates to go for approximations for improving the speed and energy in the arithmetic circuits. This originates from the limited perceptual abilities in observing an image or a video for human beings. In addition to the image and video processing applications, there are other areas where the exactness of the arithmetic operations is not critical to the functionality of the system (see [2],[3]). Approximate computing provides an accuracy, speed and power/energy consumption. The advantage of approximate multiplier reduces the error rate and gain high speed. For correcting the division error compare operation and a memory look up is required for the each operand is required which increases the time delay for entire multiplication process [4]. At various level of abstraction including circuit, logic and architecture levels the approximation is processed [5]. In the category for approximation methods in function, a number of approximating arithmetic building blocks, such as adders and multipliers, at different design levels have been suggested in various structures [6],[7]. Broken array multiplier was designed for efficient VLSI implementation [8]. **LITERATURE SURVEY:** A traditional method to reduce the aging effects is overdesign which includes techniques like guard-banding ad gate oversizing. This approach can be area and power inefficient [8]. To avoid this problem, an NBTI- aware technology mapping technique wasproposed in [7] which guarantee the performance of the circuit during its lifetime. Another technique was an NBTI- aware sleep transistor in [3] which improve the lifetime stability of the power gated circuits under considerations. A joint logic restructuring and pin reordering method in [6] is based on detecting functional symmetries and transistor stacking effects. This approach is an NBTI optimization method that considered path sensitization. Dynamic voltage scaling and bogy-biasing techniques were proposed in [4] and [5] to reduce power or extend circuit life. These techniques require circuit modification or do not provide optimization of specific circuits. Every gate in any VLSI circuit has its own delay which reduces the performance of the chip. Traditional circuits use critical pathdelayas the overall circuit clock cycle in order to perform correctly. However, in many worst-case designs, the pMRSAbility that the critical pathdelay is activated is low. In such cases, the strategy of minimizing the worst-case conditions may lead to inefficient designs. For noncritical path, using the critical path delay as the overall cycle period will result in significant timing waste. Hence, the variable latency design was proposed to reduce the timing waste of traditional circuits. A short path activation function algorithm was proposed in [16] to improve the accuracy of the hold logic and to optimize the performance of the variable-latency circuit. An instruction scheduling algorithm was proposed in [17] to schedule the operations on nonuniform latency functional units and improve the performance of Very Long Instruction Word processors. In [18], a variable-latency pipelined multiplier architecture with a Booth algorithm was proposed. In [19], process-variation tolerant architecture for arithmetic units was proposed, where the effect of process-variation is considered to increase the circuit yield. In addition, the critical paths are divided into two shorter paths that could be unequal and the clock cycle is set to the delay of the longer one. These research designs were able to reduce the timing waste of traditional circuits to improve performance, but they did not consider the aging effect and could not adjust themselves during the runtime. **SORTING:**A sorting network is a collection of interconnected compare-and-exchange (CAE) block. Sorting networks consist of comparators, which are in fact hardware implementations of the CE operation. A comparator has two inputs and two outputs. Depending on the sense of ordering, comparators can be of two kinds as shown in figure 1 (a), (b). Fig1: Basic Sorting Unit **Sorting** is any process of arranging items according to a certain sequence or in different sets, and therefore, it has two common, yet distinct meanings: **PARALLEL SORTING NETWORKS**A sorting network is a collection of interconnected compareand-exchange (CAE) blocks that guides a parallel set of inputs to a parallel set of outputs in sorted order. Each CAE block has two inputs and two outputs. If the input values are already in order, they are directed to the corresponding outputs; otherwise, they are swapped. There are two types of CAE blocks, called increasing and decreasing CAE blocks, used in hardware-based sorting Fig:2parallel sorting units units. Fig.2 shows the high-level implementations (left) and schematic symbols (right) for three building blocks used in previous sorting units and in our designs. A decreasing CAE block, shown in Fig. 2b, outputs its inputs in descending order. Decreasing and increasing CAE blocks are identical, except for their wiring. Each CAE block contains a comparator and two multiplexers. We also define Max units which are used in our designs. A Max unit, shown in Fig. 1c, takes two inputs and returns the larger input. Note that the \_ and \_ symbols determine the type of the block in Fig. 2. A sorting network usually consists of a series of stages in which each stage contains a number of CAE blocks that operate in parallel. The latency of a sorting network is proportional to its depth (the number of consecutive CAE blocks). Two popular parallel sorting networks that currently have the lowest known latency for hardware implementation are the bitonic and odd-even merge sorting networks proposed by Batcher [25]. The structure of a sorting network is fixed, regardless of the value of the numbers being sorted and the results of the comparisons. Sorting networks are a common solution for hardwarebased sorting. Their parallel nature allows them to perform sorting much faster than the OðN \_ logðNPP time achievableby the fastest sequential software-based sorting algorithms. A sorting network may also be pipelined to further increase throughput. ## **EXISTING TECHNIQUE:** Fig:3 Overall (7,3) counter circuit The whole architecture is shown in above Fig. It is obvious that the paths from sequences H and I to C2, C1, and S are almost independent. Increases the parallelism of the circuit. However, as will be discussed later, the area of the proposed design will not increase with the parallelism. **IMPLEMENTATION:**First, as shown in above Fig.due to the fact that "1" is bigger than "0," all the "1"s are at the top of the sequence if there exist "1"s, and all the "0"s are at the bottom of the sequence if there exist "0"s. If there exist both "1"s and "0"s, there must be a position in the reordered sequence where there is the junction of "1" and "0." If there are only "1"s or "0"s, we can manage the sequence by padding fixed one bit "1" at the top and one bit "0" at the bottom of the reordered sequence to make sure that 0,1-junction always exits. Second, the reordered sequence has the same total number of "1"s and "0"s as the original sequence (the inputs of two sorting networks). Although the padded "1" would influence the total number of "1"s in the padded sequence, it is fixed, so we ignore it while counting. Fig.4Three- and four-way sorting networks. Fig.5Two-input binary sorter we give an input example: sequence [0, 1, 1, 1] represents the input of four-way sorting network (4 SN), and sequence [0, 1, 1] represents the input of three-way sorting network (3 SN). For both 4 SN and 3 SN, the input sequences are reordered in the form of the larger number at the top and the smaller number at the bottom after three layers of sorter. ## ONE-HOT CODE GENERATION Fig.6 Definition of a sequence. Fig. 7One-hot code generation circuit. 1) Asymmetric Prereorder: Both three- and four-way sorting networks require three layers of binary sorter (the two binary sorters on the same layer in fourway sorting network can be calculated in parallel). Each layer of binary sorters consumes one basic two-input logical gate layer, as shown in Fig. 3. This means that the time consumed by the three- and four-way sorting networks is almost the same. Based on this, we divide the seven inputs of a (7,3) counter into two parts. One part contains 4 bits, while the other contains 3 bits. 2) Find 0,1-Junction and One-Hot Code Sequence: The 0,1-junction can solely represent the reordered sequence under the promise of the extended fixed "0" and "1." Notice that the position of the 0,1-junction must be 1,0 from left to right. Therefore, we still utilize the fourway sorting network as an example, and then, This structure uses a Boolean expression (AB) to obtain a new sequence P0-P4. Because there is one and only one 0,1-junction in the reordered and extended sequence, there is one and only one "1" in sequence P0-P4. This means that sequence P0-P4 is one-hot code that satisfies ("|" represents TRUTH TABLE OF (7,3) COUNTER OUTPUTS | Num | $C_2$ | $C_1$ | S | Num | $C_2$ | $C_1$ | S | |-----|-------|-------|---|-----|-------|-------|---| | 0 | 0 | 0 | 0 | 4 | 1 | 0 | 0 | | 1 | 0 | 0 | 1 | 5 | 1 | 0 | 1 | | 2 | 0 | 1 | 0 | 6 | 1 | 1 | 0 | | 3 | 0 | 1 | 1 | 7 | 1 | 1 | 1 | "OR" and "&" represents "AND") P0|P1|P2|P3|P4 = 1. (1) If the sequence elements (P0-P4) are randomly divided into two groups, such as P0, P2, and P4 as group 1 and P1 and P3 as group 2, then, because of one and only one "1" in the sequence, we have P0|P2|P4 = P3|P4. (2) All results of random separation satisfy this rule. We also apply the same method on the three-way sorting network's output sequence and get the one-hot code sequence Q0-Q3. This sequence also satisfies the rule above. we have two sequences P and Q. P0 = 1 means that there is no "1" in the input sequence of four-way sorting network, P1 = 1 represents one "1" in it, and Pi = 1 represents i "1"s in it and so is the sequence Q. Here are some symbol conventions. The outputs of (7,3) counter are denoted as C2, C1, and S, and C2 has the most significant weight, while S has the lowest weight. Table I shows the total numbers of "1"s ("Num" column in the table) in the input 7 bits corresponding to outputs, i.e., Num = 22C2 + 21C1 + 20 S. The sequence output from four-way sorting network is denoted as sequence H, including H1-H4 from left to right in Fig. 4. The sequence output from three-way sorting network is denoted as sequence I, including I1-I3 from left to right. According to Table I, we know that at least four "1"s are in the input sequence of the (7,3) counter when C2 = 1. As discussed before, P4 = 1 means that four "1"s are in sequence H (also in the input sequence of 4 SN because sorting network has no effect on total number of "1"s), and Q0 = 1 means that no "1" is in sequence I. Thus, P4&Q0 = 1 means that there are totally 4 + 0 = 4 "1"s in the input 7 bits. As a result of this type of representation, C2 is equal to 1 when the summation of subscripts of P and Q is no less than 4. In this way, C2 can be expressed as $$C_2 = (P_4 \& (Q_0 | Q_1 | Q_2 | Q_3)) | (P_3 \& (Q_1 | Q_2 | Q_3)) |$$ $$(P_2 \& (Q_2 | Q_3)) | (P_1 \& Q_3).$$ (3) Notice that the sequence Q, with the same method in (2), satisfies Fig.8 C1 generating circuit. # HIGHER COMPRESSION RATIO COUNTERS: ## **CONSTRUCT (15,4) COUNTER** Fig.9 Overall (15,4) counter circuit Fig10 Eight-way sorting network. Seven- and Eight-Way Sorting Networks: Fig. shows an eight-way sorting network [14]. This sorting network consumes six layers of basic logic gates to output the result. Removing one bit from the eight-way sorting network can obtain a seven-way sorting network that also consumes six layers of basic logic gates. $$Q_{0}|Q_{2}|Q_{4}|Q_{6} = \overline{Q_{1}|Q_{3}|Q_{5}|Q_{7}}$$ $$P_{0}|P_{2}|P_{4}|P_{6}|P_{8} = \overline{P_{1}|P_{3}|P_{4}|P_{5}|P_{7}}$$ $$I_{i} = I_{i}|I_{i+j},$$ $$(i = 0, 1, ..., 6, 7; j \ge 0; i + j \le 8)$$ ## TRUTH TABLE OF (15,4) COUNTER OUTPUT TRUTH TABLE OF (15,4) COUNTER OUTPUTS | Num | $C_3$ | $C_2$ | $C_1$ | S | Num | $C_3$ | $C_2$ | $C_1$ | S | |-----|-------|-------|-------|---|-----|-------|-------|-------|---| | 0 | 0 | 0 | 0 | 0 | 8 | 1 | 0 | 0 | 0 | | 1 | 0 | 0 | 0 | 1 | 9 | 1 | 0 | 0 | 1 | | 2 | 0 | 0 | 1 | 0 | 10 | 1 | 0 | 1 | 0 | | 3 | 0 | 0 | 1 | 1 | 11 | 1 | 0 | 1 | 1 | | 4 | 0 | 1 | 0 | 0 | 12 | 1 | 1 | 0 | 0 | | 5 | 0 | 1 | 0 | 1 | 13 | 1 | 1 | 0 | 1 | | 6 | 0 | 1 | 1 | 0 | 14 | 1 | 1 | 1 | 0 | | 7 | 0 | 1 | 1 | 1 | 15 | 1 | 1 | 1 | 1 | ## PROPOSED TECHNIQUE: **BITONIC SORTING:**Bitonic sort is one of the fastest sorting networks. A <u>sorting network</u> is a special kind of sorting algorithm, where the sequence of comparisons is not data-dependent. Fig.11 An increasing 8-input bitonic merging unit Each box performs the same operation as a blue box, but with the sort in the opposite direction. So, each green box could be replaced by a blue box followed by a crossover where all the wires move to the opposite position. This would allow all the arrows to point the same direction, but would prevent the horizontal lines from being straight. However, a similar crossover could be placed to the right of the bottom half of the outputs from any red block, and the sort would still work correctly, because the reverse of a bitonic sequence is still bitonic. If a red box then has a crossover before and after it, it can be rearranged internally so the two crossovers cancel, so the wires become straight again. Therefore, the following diagram is equivalent to the one above, where each green box has become a blue plus a crossover, and each orange box is a red box that absorbed two such crossovers. ## **RESULTS:** Fig: 12 Existing simulation output Fig: 13 Proposed simulation output **CONCLUSION AND FUTURE SCOPE:**In this paper, a plan strategy for region viable and speed effective counter is planned and reproduced. A twofold counter in view of a novel symmetric piece Aggregate and convey computation approach is proposed. In this process, a new counter design method based on a sorting network is proposed, and we construct (7,3), (15,4) counters based on this biotonic sorting method. Proposed counters are more flexible than existing designs because it achieves less delay when the speed is critical and performs in better in ADP. The image, videos and the clarity of videos can be checked by using approximate multipliers. And that are tired to match the clarity of the exact multipliers. For faster multiplication we have use Brent Kung Adder to reduce area and to increase speed we can use Kogge Stone Adder. **REFERENCE:**[1] Wen-Chang Yeh and Chein-Wei Jen, "High-speed Booth encoded parallel multiplier design," IEEE Trans. on Computers, vol. 49, isseu 7, pp. 692-701, July 2000. - [2] Jung-Yup Kang and Jean-Luc Gaudiot, "A simple high-speed multiplier design," IEEE Trans. on Computers, vol. 55, issue 10, Oct. pp. 1253-1258, 2006. - [3] Shiann-RongKuang, Jiun-Ping Wang and Cang-Yuan Guo, "Modified Booth multipliers with a regular partial product array," IEEE Trans. onCircuit and Systems, vol.56, Issue 5, pp. 404-408, May 2009. - [4] Li-rong Wang, Shyh-JyeJou and Chung-Len Lee, "A well-structured modified Booth multiplier design," Proc. of IEEE VLSI-DAT, pp. 85-88, April 2008. - [5] A. A. Khatibzadeh, K. Raahemifar and M. Ahmadi, "A 1.8V 1.1GHz Novel Digital Multiplier," Proc. of IEEE CCECE, pp. 686-689, May 2005. - [6] S. Hus, V. Venkatraman, S. Mathew, H. Kaul, M. Anders, S. Dighe, W. Burleson and R. Krishnamurthy, "A 2GHZ 13.6mW 12x9b mutiplier for energy efficient FFT accelerators," Proc. of IEEE ESSCIRC, pp. 199-202, Sept. 2005. - [7] Hwang-Cherng Chow and I-Chyn Wey, "A 3.3V 1GHz high speed pipelined Booth multiplier," Proc. of IEEE ISCAS, vol. 1, pp. 457-460, May 2002. - [8] M. Aguirre-Hernandez and M. Linarse-Aranda, "Energy-efficient high-speed CMOS pipelined multiplier," Proc. of IEEE CCE, pp. 460-464, Nov. 2008. - [9] Yung-chin Liang, Ching-ji Huang and Wei-bin Yang, "A 320-MHz 8bit x 8bit pipelined multiplier in ultra-low supply voltage," Proc. of IEEE A-SSCC, pp. 73-76, Nov. 2008. - [10] S. B. Tatapudi and J. G. Delgado-Frias, "Designing pipelined systems with a clock period approaching pipeline register delay," Proc. of IEEE MWSCAS, vol. 1, pp. 871-874, Aug. 2005. - [11] A. D. Booth, "A signed binary multiplication technique," Quarterly J. Mechanical and Applied Math, vol. 4, pp.236-240, 1951. - [12] M. D. Ercegovac and T. Lang, *Digital Arithmetic*, Morgan Kaufmann Publishers, Los Altos, CA 94022, USA, 2003. - [13] C. S. Wallace, "A suggestion for a fast multiplier," IEEE Trans. On Computers, vol. BC13, pp. 14-17, Feb. 1964. - [14] M.D. Ercegovac et al., <sup>a</sup>Fast Multiplication without Carry- Propagate Addition, <sup>o</sup> IEEE Trans. Computers, vol. 39, no. 11, Nov. 1990. - [15] R.K. Kolagotla et al., <sup>a</sup>VLSI Implementation of a 200-Mhz 16 \_ 16 Left-to-Right Carry-Free Multiplier in 0.35\_m CMOS Technology for Next-Generation DSPs, <sup>a</sup> Proc. IEEE 1997 Custom Integrated Circuits Conf., pp. 469-472, 1997. - [16] P.F. Stelling and V.G. Oklobdzija, <sup>a</sup>Optimal Designs for Multipliers and Multiply-Accumulators, <sup>e</sup> Proc. 15th IMACS WorldCongress Scientific Computation, Modeling, and Applied Math., A. Sydow, ed., pp. 739-744, Aug. 1997. - [17] Passport 0.35 micron, 3.3 volt, Optimum Silicon SC Library, CB35OS142, Avant! Corporation, Mar. 1998. - [18] G. Goto et al., <sup>a</sup>A 4.1ns compact 54 \_ 54-b Multiplier UtilizingSign-Select Booth Encoders, <sup>o</sup> IEEE J. Solid-State Circuits, vol. 32, no. 11, pp. 1,676-1,682, Nov. 1997. - [19] G. Goto et al., <sup>a</sup>A 54 \_ 54-b Regularly Structured Tree Multiplier, <sup>o</sup> IEEE J. Solid-State Circuits, vol. 27, no. 9, Sept. 1992. - [20] R. Fried, <sup>a</sup>Minimizing Energy Dissipation in High-Speed Multipliers, <sup>o</sup> Proc. 1997 Int'l Symp. Low Power Electronics and Design, pp. 214-219, 1997. - [21] N.H.E. Weste and K. Eshraghian, Principles of CMOS VLSI Design: A Systems Perspective, second ed., chapter 8, p. 520. Addison Wesley, 1993. - [22] J. Fadavi-Ardekani, <sup>a</sup>M\_N Booth Encoded Multiplier Generator Using Optimized Wallace Trees, <sup>a</sup> IEEE Trans. VLSI Systems, vol. 1, no. 2, June 1993. - [23] A.A. Farooqui et al., <sup>a</sup>General Data-Path Organization of a MAC Unit for VLSI Implementation of DSP Processors, <sup>a</sup> Proc. 1998 IEEE Int'l Symp. Circuits and Systems, vol. 2, pp. 260-263, 1998.