Access the full text.
Sign up today, get DeepDyve free for 14 days.
References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.
SN/HG Date: 9, 7. 2010 12: 00 On-Chip Implementation of Pipeline Digit-Slicing Multiplier-Less Butterfly for Fast Fourier Transform Architecture 1 1.2 1 1 Yazan Samir Algnabi, Rozita Teymourzadeh, Masuri Othman , Md Shabiul Islam Mok Vee Hong Institute of MicroEngineering and Nanoelectronics IMEN, VLSI Design Department, Universiti Kebangsaan Malaysia, 43600 Bangi, Selangor, Malaysia Faculty of Engineering, Architecture and Built Environment, Electrical & Electronic Engineering department, UCSI University, Kuala Lumpur, Malaysia Abstract: The need for wireless communication has driven the communication systems to high performance. However, the main bottleneck that affects the communication capability is the Fast Fourier Transform (FFT), which is the core of most modulators. This study presents on-chip implementation of pipeline digit-slicing multiplier-less butterfly for FFT structure. The approach taken; in order to reduce computation complexity in butterfly, digit-slicing multiplier-less single constant technique was utilized in the critical path of Radix-2 Decimation In Time (DIT) FFT structure. The proposed design focused on the trade-off between the speed and active silicon area for the chip implementation. The new architecture was investigated and simulated with MATLAB software. The Verilog HDL code in Xilinx ISE environment was derived to describe the FFT Butterfly functionality and was downloaded to Virtex II FPGA board. Consequently, the Virtex-II FG456 Proto board was used to implement and test the design on the real hardware. As a result, from the findings, the synthesis report indicates the maximum clock frequency of 549.75 MHz with the total equivalent gate count of 31,159 is a marked and significant improvement over Radix 2 FFT butterfly. In comparison with the conventional butterfly architecture, design that can only run at a maximum clock frequency of 198.987 MHz and the conventional multiplier can only run at a maximum clock frequency of 220.160 MHz, the proposed system exhibits better results. The resulting maximum clock frequency increases by about 276.28% for the FFT butterfly and about 277.06% for the multiplier. It can be concluded that on-chip implementation of pipeline digit-slicing multiplier-less butterfly for FFT structure is an enabler in solving problems that affect communications capability in FFT and possesses huge potentials for future related works and research areas. Key words: Pipelined digit-slicing multiplier-less; Fast Fourier Transform (FFT); Verilog HDL; Xilinx INTRODUCTION higher radix FFT (Bergland, 1969), the mixed-radix FFT (Singleton, 1969), the prime-factor FFT (Kolba FFT plays an important role in many Digital and Parks, 1977), the recursive FFT (Varkonyi-Koczy, Signals Processing (DSP) applications such as in 1995), low-memory reference FFT (Wang et al., 2007), communication systems and image processing. It is an Multiplier-less based FFT (Zhou et al., 2007; Prasanthi efficient algorithm to compute the Discrete Fourier et al., 2005; Mahmud and Othman, 2006) and Transform (DFT). DFT is the main and important Application-Specific Integrated Circuits (ASIC) system procedure in data analysis, system design, and such as stated by Baas (1999). ASIC-based systems are implementation (Oppenheim and Rader, 1990). In order able to fit real low-power or high performance to reduce the complexity computation of the FFT applications; however the function is very solid to be algorithm many modules have been designed and modified (Hsu and Lin, 2008). The study of the digit- implemented in different platforms. These modules slicing technique has been dealt by Bin Nun and focus on the radix order or twiddle factors to perform a Woodward (1976); Peled and Liu (1976); and Sharrif, simple and efficient algorithm which includes the (1980) for the digital filters. Corresponding Author:Rozita Teymourzadeh, Institute of Microengineering and Nanoelectronics IMEN, VLSI Design Department, Universiti Kebangsaan Malaysia, 43600 Bangi, Selangor, Malaysia 1 End time The design and implementation of Digit-slicing FFT In general, higher-radix FFT algorithm has fewer has been discussed by Samad et al., (1998). This study numbers of complex multiplications, whereas radix-2 proposed a similar idea with the ones put forth by FFT algorithm is the simplest form in all FFT algorithms. Samad et al (1998); but having a difference by the use Furthermore, it has a regularity mode that makes it of a different algorithm and different platform, which suitable for VLSI implementation as shown in the helps to improve the performance and achieve higher fallowing Eq. 2: speed. Recently, FPGAs Field Programmable Gate NN Array have become an applicable option to direct 11 nm m nm hardware solution performance in the real time X[m] x[2n]W W x[2n 1]W (2) N N N n 0 n 0 application. In this study, digit-slicing architecture was 22 proposed in designing the pipeline digit-slicing multiplier-less butterfly. The FFT butterfly FFT algorithm relies on a ‘divide-and-conquer’ methodology, which divides the N coefficient points multiplication is the most crucial part in causing the delay in the computation of the FFT. In view of the into smaller blocks in different stages. The first stage computes with groups of two coefficients, yielding N/2 fact, the twiddle factors in the FFT processor were known in advance hence we proposed to use the blocks, each computing the addition and subtraction of the coefficients scaled by the corresponding twiddle pipeline digit slicing multiplier-less butterfly to replace the traditional butterfly in FFT. factors, called a butterfly for its cross-over appearance as shown in Fig. 1. The study structure is organized as follows; describes the FFT architecture in brief, explains the These results are used to compute the next state of N/4 butterfly conventional architecture, discuses the digit slicing architecture, explicates the design of the pipeline blocks, which will then combine the results of two previous blocks, combining four coefficients at this digit- slicing multiplier-less butterfly architecture in point. This process is repeated until one main block is detail and finally the implementation result and conclusion respectively. formed, with a final computation of all N coefficients. Fig. 2 shows the 8-point radix-2 DIT FFT. Fast Fourier Transform (FFT): A useful method to transform domains from the time domain to the frequency domain and the reverse for the implementation on digital hardware is the DFT. For N- point DFT of a complex data sequence x (n) is defined in Eq. 1: N1 kn X(k) x(n) W , k 0,1,......., N1 (1) N n0 Fig. 1: Butterfly structure. Where: x(n) and X(k) = Complex numbers kn j2 / N We = The twiddle factor The DFT of N-point finite sequence represents harmonically related frequency components of x(n). The direct computation of Eq. 1 requires the order of N operations where N is the transform size. Cooley and Tukey (1965) found this new technique to reduce the order of complexity operations of DFT from N to (Nlog N). Consequently, a huge number of FFT algorithms have been developed such as Radix-2, radix-4 and split radix algorithms. These algorithms are mostly used for practical applications due to their simple structure and constant butterfly geometry. Fig. 2: 8-points FFT radix-2 Decimation in Time. 2 End time (Xr jXi) (Ar jAi) (Wr jWi) (Br jBi) (5) (6) (Yr jYi) (Ar jAi) (Wr jWi) (Br jBi) The implementation of the complex multiplier is required for four real multipliers and two real adders as shown in Fig. 4. The complex multiplier is determined in Eq.7: (Br jBi) (Wr jWi) (Br Wr) (Br jWi) ( jBi Wr) ( jBi jWi) [(Br Wr) ( jBi jWi)] (7) [(Br jWi) ( jBi Wr)] Fig. 3: Radix-2 DIT FFT Butterfly Architecture. [(Br Wr) (Bi Wi)] [(Br jWi) ( jBi Wr)] Conventional Butterfly architecture The The real and imaginary parts of the multiplication conventional radix-2 DIT butterfly architecture consists result is and [(BrWr) (BiWi)] [(Br jWi) ( jBiWr)] of complex data I/O, complex multiplier and complex respectively. adder and subtraction as shown in Fig. 3. The complex adder is required for two real adders Consider A and B as the complex input data, and to perform addition functionality as shown in Fig. 5. the complex twiddle factor is considered as (Ar jAi) (Br jBi) (Ar Br) j(Ai Bi) (8) W = Wr-jWi, hence finally the complex output are X and Y. Digit-slicing architecture: The concept behind the digit-slicing architecture is any binary number that can be sliced into a few blocks of shorter binary numbers, with each block carrying a different weight. In this study, the fixed-point 2’s complements arithmetic has been chosen to represent the input data, which are singed numbers with absolute value less than one. The absolute value of the input data x with length of B bits 0 1 2 B-1 (x ,x ,x ,….,x ) has been represented in 2’s complement as: Fig. 4: Complex multiplier structure B1 jj x 2 x (9) k0 To represent the sliced data, there are many different algorithms. Depending on the data type and word length, different structures can be introduced. In this study, where the fundamental sliced algorithm will be presented as following: b1 pk ( pb 1) x 2 X 2 (10) k k0 Where: Fig. 5: Complex Adder Structure x = Sliced into b blocks p = Bit widths per block The index r and i represent the real and imaginary parts respectively: p1 X 2 X (11) k k , j j0 X A WB (3) Y A WB (4) Where: X =All either ones or zeros except k,j 3 End time X = which is zero or minus one As a comparison between the first and the second k=b-1, j= p-1 The algorithm in Eq. (10) applies when the sliced algorithms, the second algorithm requires one extra x 2 3 data word length is 2 such as 2 =4, 2 = 8, 16… bits. block to deal with the sign bit which makes the design Thus, let us consider the decimal number -0.65625 of more complicated and requires more hardware for the which we would like to demonstrate how digit slicing implementation. In this study, the first digit-slicing operates accordingly: algorithm has been chosen to build the digit-slicing FFT butterfly structure. Therefore, any complex numbers, F, x = 1.010 1100 = -0.65625 2 10 can be sliced into smaller blocks b, each having a shorter word length, p, as illustrated in following where, the suffix 2 refers to a binary fixed point two’s equations: complement number 8 bits and the suffix 10 refers to a decimal number, if x is sliced into two blocks, of each F F jF (13) RI four bits wide, that is b = 2 and p = 4: b 1 b 1 pk ( pb1) pk ( pb1) F 2 F 2 j 2 F 2 (14) j 3 2 R k I k X 2 X 2 2 12 k 0 k 0 0 0, j j0 j 3 1 X 2 X 2 2 6 1 1, j j0 4 k (8 1) x 2 X 2 k k0 40 41 7 x 2 12 2 (6) 2 84 7 x (12 96) 2 0.65625 Another algorithm that represents the sliced data x 2 with a word length 2 +1 such as 2 +1=5, 9, 17…bits can be dealt as the following: Fig. 6: The digit-slicing first algorithm for -0.65625 p1 k x 2 X (12) k k0 Where, x is a decimal number whose absolute value is less than one and is sliced into b blocks each of p bits wide. The most significant block is k = 0 where this nd Fig. 7: The digit-slicing 2 algorithm for -0.328125 contains the only sign bit of x plus leading dummy zeros to make up a block of length p bits (Samad et al., 1998): X 0 or 1 only k0 p1 (13) X 2 X ; X 0 or 1 only for k 0 k k , j k , j j0 Let us assume that the decimal number - 0.328125 is represented as nine bits two’s complement number: k x 2 X k k0 4 0 4 1 3 1 4 2 3 2 x [2 ]  [2 ] [2 2 ] [2 ] [2 2 ] Fig. 8: Digit-slicing structure for the input A. 1 3 5 6 1 2 2 2 2 -0.328125 4 End time p1 systems, but it is a complex and expensive operation. F 2 F (15) R k R k , j Many techniques have been introduced for reducing the j0 size and improving the speed of multipliers. Some p1 applications require Constant Coefficient Multipliers F 2 F (16) I k I k , j such as digital signal processing, image processing, and j0 multiple precision arithmetic in the design of compilers. Constant Coefficient Multipliers is one of the most Where, the values of F and F are either zero or Ik,i Rk,I common solutions to speed up the multiplication one. process. The multiplier can be designed for one constant which Pipeline digit-slicing multiplier-less butterfly is termed as Single Constant Multiplier (SCM) or for architecture: The butterfly is the smallest component many constant and is termed as Multiple Constant to build the FFT. As mentioned in the explanations Multiplier (MCM). Since the twiddle factor in FFT prior to this, the butterfly structure contains one processor are known in advance, a special design of complex multiplier, one complex adder, and one SCM has been proposed to perform the multiplication complex subtractor. function with the twiddle factor without using the The digit-slicing architecture has been applied for traditional multiplier, which is termed as Single the butterfly input to slice the data into four groups each Constant Multiplier Less (SCML). The design of the carrying four bits as shown in Fig. 8. SCML consists of four lookup tables (ROMs) and adder to perform the output as shown in Fig 9. To generate the lookup tables data (the multiplication result possibilities), which are 16 different results for each ROM, a special MATLAB program has been written by applying the digit-slicing algorithm for all the possible numbers for the input data (4 bits) from “0000” to “1111” to perform all the possibilities for the multiplication result. The result for the SCML has been optioned by simple addition for all the lookup tables’ results. In the hardware implementation, the addition Fig. 9: Digit-Slicing Single Constant Multiplier logic has been reduced. During the addition of the four (DSSCM) Structure. products obtained from the look-up tables, the least b1 significant digit (4 bits) for each level is always added pk ( pb 1) A 2 A 2 (17) k to zero. These bits will not be affected, or changed and k0 will be carried into the next column. The storage of all p1 these possibilities in four different ROMs allows the A 2 A (18) k k , j design to perform the multiplication process without j0 any real multiplier. where, A are all either ones or zeros except for k,j From Eq. 10 and 11, the digit-slicing multiplier is A which is zero or minus one. k=b-1,j=p-1 represented as the following: The same applies for the input B: b1 4k (7) pk ( pb 1) BW 2 WB 2 (21) B 2 B 2 (19) k k k0 k0 p1 B 2 B (20) k k , j WB 2 WB (22) k k , j j0 j0 where, WB are all either ones or zeros except for k,j where, B are all either ones or zeros except for the WB which is zero or minus one and where W is k,j k=b-1,j=p-1 value B which is zero or minus one. k=b-1,j=p-1 the constant. The multiplication functionality is regarded as the most important operation for most signal processing 5 End time The result of the multiplication will be added and subtracted with the complex inputs A +jA for the The full digit-slicing single constant multiplier-less r i butterfly to perform the butterfly outputs. has been designed and tested in MATLAB as shown in The butterfly output X has been defined as: Fig 10 and 11, of which the result is then compared with the normal multiplier. b1 For the addition and subtraction, the parallel-prefix pk ( pb 1) X 2 X 2 (23) k Koggie and Stone Ling adder were used for high speed k0 and better performance. The pipeline technique was p1 applied for the full design for better performance. X 2 X (24) k k , j j0 RESULT where, X are all either ones or zeros except for k,j Two different modules were implemented for X which is zero or minus one. k=b-1,j=p-1 radix-2 DIT butterfly. The first module uses the By applying Eq. 17, 19 and 21 into Eq. 3: conventional architecture for the butterfly where the X A WB twiddle factors are stored in ROM and called by the butterfly to be multiplied with the inputs by utilising the 4 k ( pb1) 4 k ( pb1) 2 X 2 2 A 2 kk dedicated high speed multiplier equipped with the k 0 k 0 Virtex-II FPGA. 4 k ( pb 1) 2 WB 2 k k0 X A WB k k k X is complex number X X jX k rk ik Re al part of X A WB rk k r k Im ag part of X A WB ik k i k The same step for the output X has been applied to get the output Y: Y A WB b 1 b 1 pk ( pb1) pk ( pb1) 2 Y 2 2 A 2 kk Fig. 10: MATLAB design of Digit-Slicing Single k 0 k 0 (26) Constant Multiplier-Less for the Butterfly. b1 pk ( pb 1) 2 WB 2 k k0 Y A WB k k k Y is complex number Y Y jY k rk ik Re al part of Y A WB Im ag part of Y A WB rk k r k ik k i k Finally, the complex output is represented as the following: X A WB WB (27) rk rk rk ik X A WB W B (28) Fig. 11: MATLAB design of Digit-slicing Butterfly. ik ik ik rk Y A WB WB (29) rk rk rk ik Y A WB WB (30) ik ik ik rk 6 End time Fig. 12: Simulation result of the Pipeline Digit-slicing Fig. 14: RTL schematic for the Pipeline Digit-slicing Single Constant Multiplier-Less for the Single Constant Multiplier-Less for the Butterfly. Butterfly. Fig. 13: Simulation result of Digit-slicing Butterfly. Fig. 15: RTL schematic for the Pipeline Digit-slicing Single Constant Multiplier-Less Lookup table The other module uses the pipelined digit-slicing single (ROM) for the Butterfly. constant multiplier-less architecture to perform the Table 1: Hardware specifications of the digit-slicing butterfly multiplication with the twiddle factor. Both modules Xilinx Virtax-II Total equivalent gate Maximum were built and tested in MATLAB as indicated in Fig. 9 FPGA XC2v250-6FG456 count for design Frequency MHz and 10, and is then coded in Verilog and synthesized by Conventional butterfly 18.408 198.987 using the XST-Xilinx Synthesis Technology tool. The Pipeline Digit-Slicing 31.159 549.750 Multiplier-less Butterfly target FPGA was Xilinx Virtex-II XC2V500-6-FG456 Conventional 16 bits 4.131 220.160 FPGA. The ModelSim simulation result of pipelined Multiplier digit-slicing multiplier-less radix-2 DIT butterfly is Pipeline Digit-Slicing Single 6.483 609.980 shown in Fig. 12 and 13, while the synthesis results for Constant Multiplier-Less 16 bits for the butterfly the two models are presented in Table 1, which demonstrates the hardware specifications for the design. It indicates the maximum clock frequency of CONCLUSION 549.75 MHz for Pipelined digit-slicing Multiplier-less Butterfly as well as the Pipelined Digit-slicing Single This study presented an on-chip Constant Multiplier-less for the butterfly with a implementation of pipeline digit-slicing multiplier-less performance of the maximum clock frequency of butterfly for FFT structure. The implementation has 609.980 MHz. Meanwhile, Fig. 14 and 15 shows the been coded in Verilog hardware descriptive language RTL schematic for the Pipeline Digit-Slicing Single and was tested on Xilinx Virtex-I1 XC2V500-6- FG456 Constant Multiplier-less for the Butterfly. prototyping FPGA board. A maximum clock frequency of 549.75 MHz has been obtained from the synthesis report for the pipeline digit-slicing multiplier-less butterfly that is 2.77 time faster than the conventional butterfly. It can be concluded that on-chip 7 End time implementation of pipeline digit-slicing multiplier-less Singleton, R.C., 1969. An algorithm for computing the butterfly for FFT structure is an enabler in solving mixed radix fast Fourier transform. IEEE Trans. problems that affect communications capability in FFT Audio Elect., 17: 93-103. and possesses huge potentials for future related works Varkonyi-Koczy, A.R., 1995. A recursive Fast Fourier and research areas. Transform algorithm. IEEE Trans. Circuits Syst. 42: 614-616. REFERENCES Wang, Y., Y. Tang, Y. Jiang, J.G. Chung and S.S. Song et al., 2007. Novel memory reference reduction methods for Baas, B.M., 1999. A low-power, high-performance, FFT implementation on DSP processors. IEEE Trans. Signal Process., 55: 2338-2349. 1024-point FFT processor. IEEE J. Solid-State Zhou, Y., J.M. Noras and S.J. Shephend, 2007. Novel Circuits, 34: 380-387. design of multiplier-less FFT processors. Signal Bergland, G.D., 1969. A radix-eight fast-Fourier Proc., 87: 1402-1407. transform subroutine for real-valued series. IEEE Trans. Audio Electroacoust, 17: 138-144. Bin Nun, M.A. and M.E. Woodward. 1976. A modular approach to the hardware implementation of digital filters. Radio Elect. Eng. Cooley, J.W. and J.W. Tukey, 1965. An algorithm for the machine calculation of complex Fourier series. Math, Comp., 19: 297-301. Hsu, Y.P. and S.Y. Lin, 2008. Parallel-computing approach for FFT implementation on Digital Signal Processor (DSP). World Acad. Sci., Eng. Technol.,42 (2008) pp 587-591. Kolba, D.P. and T.W. Parks, 1977. A prime factor FFT algorithm using high-speed convolution. IEEE Trans Acoust. Speech, Signal Process, 25: 281-294. Mahmud, B. and M. Othman, 2006. FPGA implementation of a canonical signed digit multiplier-less based FFT Processor for wireless communication applications. ICSE2006 Proc., Kuala Lumpur, Malaysia, 2006. 641-645 Oppenheim, A.V. and C.M. Rader, 1990. Discrete-Time Signal Processing. 2nd Edn., Upper Saddle River, Prentice-Hall, NJ, 0137549202. Peled, A. and B. Liu, 1976. Digital Signal Processing Theory, Design and Implementation. John Wiley and Sons, US. Prasanthi, R., V. Anuradham, S.K. Sahoo and Chandra Shchar, 2005. Multiplier less FFT Processor Architecture for Signal and Image Processing. ICISIP, 2005 Vol. 2, No. 1, pp.35–45. Samad, S.A., A. Ragoub, M. Othman and Z.A.M. Sheriff, 1998. Implementation of a high speed fast Fourier transform VLSI chip. Microelectronics J. UK., 29 (1998) pp881-887. Sharrif Z.A.M., 1980. Digit slicing architecture for real time digital filters. PhD thesis. Loughborough University.UK, 1980. Sharrif, Z.A.M. and M. Othman, 1989. A novel modular architecture for VLSI Digital Signal Processing chip. CA-DSP’89, Oct. 11-14, lEE, Hong Kong.
Electrical Engineering and Systems Science – arXiv (Cornell University)
Published: Jun 9, 2018
Access the full text.
Sign up today, get DeepDyve free for 14 days.