Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Number-Theoretic Transform with Constant Time Computation for Embedded Post-Quantum Cryptography

Number-Theoretic Transform with Constant Time Computation for Embedded Post-Quantum Cryptography 30 Acta Electrotechnica et Informatica, Vol. 22, No. 4, 2022, 30–37, DOI: 10.2478/aei-2022-0020 NUMBER-THEORETIC TRANSFORM WITH CONSTANT TIME COMPUTATION FOR EMBEDDED POST-QUANTUM CRYPTOGRAPHY ´ ´ Eva KUPCOVA, Milos ˇ DRUTAROVSKY Department of Electronics and Multimedia Telecommunications, Faculty of Electrical Engineering and Informatics, Technical University of Kosice, ˇ Letna ´ 9, 042 00 Kosice, ˇ Slovak Republic, E-mails: eva.kupcova@student.tuke.sk, milos.drutarovsky@tuke.sk ABSTRACT In this article, we describe the principles and advantages of using the Number-Theoretic Transform (NTT) in post-quantum cryp- tography. We deal with usages of NTT in post-quantum algorithms included in the competition announced by the National Institute of Standards and Technology. Attention is paid to the fast multiplication of polynomials using NTT and negacyclic convolution. We also focus on the existing implementation of NTT and its modifications to analyze the effectiveness of individual modifications. Separate attention is paid to the Constant Time implementation of NTT because the constant computation time of the transformation decreases a possibility of side channel attack. We describe measurements performed on OS Linux Ubuntu 20.04 LTS environment in Linux kernel mode, with the highest attention to the measurement executed on a microcontroller with an ARM 32-bit core. Measurements performed on microcontroller units are done using 32-bit and 16-bit arithmetic to demonstrate how to achieve constant computation time of the transformation. We present the results and analysis of measurements performed using modified implementations. Keywords: NTT transformation, post-quantum cryptography, negacyclic convolution, microcontroller, Montgomery reduction 1. INTRODUCTION like NewHope [5], Kyber [6], Falcon [7] or Dilithium [8]. The goal of new cryptographic algorithms is to be able to protect sensitive government information, even after quan- Cryptography is an area of mathematics and com- tum computers are widely used [9]. In July 2022, NIST has puter science that aims to protect the information content selected Kyber algorithm for key encapsulation mechanism of a secret message using encryption methods but does not because of its small keys and speed of operation. Dilithium conceal the existence of this message. Modern cryptog- and Falcon has been selected for digital signatures because raphy is developing dynamically with the development of of their high efficiency [10]. computer technology. It is connected with the electronic Some post-quantum algorithms like NewHope, Ky- form of communication and sending of information [1, 2]. ber, Titanium and others use Number Theoretic Transform In recent years, research has focused also on the de- (NTT) to multiply big-size polynomials in their implemen- velopment of quantum computers - machines that use quan- tations [9]. We will perform experimental tests to deter- tum mechanical phenomena to solve efcient fi mathemat- mine efficiency for the mentioned post-quantum algorithms ical problems that are currently very difficult or unsolv- using NTT. By optimizing NTT, it is possible to achieve an able by conventional computers. If quantum computers are overall improvement of a given post-quantum algorithm. In constructed on a large scale, this will compromise the se- some NTT implementations, negacyclic convolution [11] is curity of many commonly used cryptographic algorithms. used for optimization. We describe three specific imple- Quantum computers could break many recent public key- mentations of NTT and their achieved results. The goal of based cryptosystems in a very short time, including the experimentally created implementations will be to achieve RSA (Rivest-Shamir-Adleman Algorithm), DSA (Digital a constant time of NTT computation in terms of executed Signature Algorithm) and ECC (Elliptic Curve Cryptogra- CPU cycles. phy) [3, 4]. The use of mentioned algorithms plays a crucial Side channels [12] change the overall view of system role in ensuring the confidentiality and authenticity of com- security. At present, it is no longer enough just to choose munications on the internet and other networks. mathematically strong encryption, it is also necessary to The selection of suitable post-quantum algorithms is pay attention to its implementation. Often we don’t know handled by the National Institute of Standards and Tech- about unwanted side channel existing in the implementa- nology (NIST) . The process of selecting and standardizing tion. Therefore, achieving a constant computation time of suitable algorithms takes several years, and it takes another NTT is very important, as it reduces the possibility of de- few years before the algorithm can be considered safe and ploying some an attacks using side channel. reliable. The algorithm gets reliability by going through a We structured the paper in a following way. The crypto analysis process and being examined by several ex- second section describes the NTT itself, its mathematical perts in the field, and must also be able to withstand the foundations, principles and usage. We described improve- various proposed attacks. Therefore, NIST has decided to ments in NTT computation in the third section. The section find optimal post-quantum algorithm for the future, which mentions three types of convolution. Separate attention is will serve as a replacement for several cryptosystems. In paid to the connection between NTT and negacyclic con- 2016, NIST initiated the process of developing and stan- volution. The fourth section describes the procedure of the dardizing one or several other post-quantum algorithms for experiment, modifying the existing NTT implementation. public key encapsulation mechanism and digital signatures, https://www.nist.gov ISSN 1335-8243 (print) ISSN 1338-3957 (online), www.aei.tuke.sk Acta Electrotechnica et Informatica, Vol. 22, No. 4, 2022 31 The individual optimizations used in a given implementa- NTT is defined as: tion and the method of their modification are listed. The N− 1 i j last section describes how to measure the effectiveness of X[i] = x[ j]ω (mod q), where i = 0, 1,..., N − 1, (2) ∑ N individual NTT implementations. The results of individ- j=0 ual measurements are presented together with their analy- and it can be described using the transformation matrix sis. The conclusion provides our summary description of A , whose values in i-th row determine the coefficients by the topic article deals with and our achieved results. which the input vector x = [x[0], x[1],..., x[N − 1]] is multi- plied when computing i-th output X[i] of the NTT:   0 0 0 0 2. NUMBER-THEORETIC TRANSFORM ω ω ω ... ω N N N N N− 1 0 1 2   ω ω ω ... ω N N N N   2(N− 1) 0 2 4   NTT is a generalization of the Discrete Fourier Trans- ω ω ω ... ω  N N  N N A = . (3)   . . . . form (DFT) [13]. The NTT structure is very similar to DFT .   . . . . .  .  . . . . but is defined over the finite field (Galois field) GF(q), 2(N− 1) (N− 1) 0 N− 1 ω ω ω ... ω N N N N where q is a prime number. NTT uses modular arith- metic. The basic intent for the usage of modular arith- metic is to provide closure in operations between elements Inverse NTT (INTT) is defined as: a, b ∈ GF(q). Closure in operations means operations re- sults are in the same finite field [4]. NTT allows fast con- N− 1 − 1 − i j volution to be performed on integer sequences without any x[i] = N X[ j]ω (mod q), where i = 0, 1,..., N− 1, rounding errors and is also used to multiply big-size poly- j=0 nomials (e.g. size 512 and 1024 from Table 1). It is widely (4) used in computer arithmetic and cryptography. The main and it can be described using the inverse transformation advantage of fast NTT is that after using an efficient al- − 1 matrix A , whose values in i-th row determine the coeffi- gorithm, it can reduce the computational complexity from N 2 cients by which the input vector X = [X[0], X[1],..., X[N − O(N ) to O(NlogN), where N is the size of NTT. 1]] is multiplied when computing i-th output x[i] of the To use NTT it is necessary to choose correct prime INTT: number, q = kN + 1, where the size of the multiplica-   tive group G is ϕ(q) = q − 1 = kN, N is the power of ϕ − 0 − 0 − 0 − 0 ω ω ω ... ω N N N N 2, and specify g, that is the (q − 1)-th primitive root of  − (N− 1)  − 0 − 1 − 2 ω ω ω ... ω   N N N N   − 2(N− 1) G . The primitive root g is a number for which applies − 0 − 2 − 4   − 1 ω ω ω ... ω N N N N A = . (5) q− 1 t g ≡ 1 (mod q) and g ̸≡ 1(mod q) for t = (1, 2,..., q− 2)   . . . .  . . . . .  . . . .   [11]. − (N− 1) − 2(N− 1) − (N− 1) − 0 ω ω ω ... ω Using the primitive root g it is possible to compute N N N N N-th primitive root ω : NTT defined over finite field can be implemented k N kN ϕ(q) ω = g (mod q) and ω = g = g ≡ 1(mod q). (1) very efficiently if correct parameters ( N, q, ω ) are used for the computation. To give you an idea, we list typical For a given N-th primitve root ω in the finite field, parameters in Table 1, which are widely used in some post- quantum algorithms for NTT computation. Table 1 Parameters of post-quantum algorithms for NTT computation with specific input size N. Parameters of post-quantum algorithms for NTT computation Post-quantum algorithm N q ω Dilithium [8] 256 8 380 417 80 Falcon [7] 1024 12 277 49 Kyber512 [6] 512 12 289 49 Kyber1024 [6] 1024 12 289 7 NewHope512 [5] 512 12 265 3 NewHope1024 [5] 1024 12 277 49 3. NEGACYCLIC CONVOLUTION IN CONTEXT preted as a convolution between coefficients of polynomi- OF NTT COMPUTATION als. This approach is relevant for polynomial multiplication using fast NTT which is created on the same principles as the Fast Fourier Transform (FFT) [16]. Polynomials are widely used in cryptography but with The whole NTT computation is broken down into increasing polynomial size the time required to perform op- butterflies, which are used in two basic forms of DFT erations (especially for multiplication operation) also in- computation. Butterfly is a portion of the computation creases. The multiplication of to polynomials can be inter- ISSN 1335-8243 (print) ISSN 1338-3957 (online), www.aei.tuke.sk 32 Number-Theoretic Transform with Constant Time Computation ..... that combines the results of smaller discrete Fourier trans- 3.2. Fast polynomial multiplication by using negacyclic forms (DFTs) into a larger DFT, or vice versa (breaking convolution a larger DFT up into sub-transforms). The name ”butter- In cryptography, polynomial multiplication is usually fly” comes from the shape of the data-flow diagram in the N N defined in ring R = Z [x]/⟨x + 1⟩ with modulo x + 1, radix-2 case. The NTT computations are given in a form where N is power of 2 [17]. of DFT (Decimation in Frequency) structure described by Negacyclic convolution computation using NTT is es- the Gentleman-Sande algorithm [15] and DIT (Decimation pecially used for fast multiplication of big-size polynomi- in Time) structure described by the Cooley-Tukey algo- als. rithm [14], which have multiple forms of butterflies [16]. N− 1 i Consider polynom y(x) = y[i]x , that can be rep- i=0 resented as vector y = [y[0], y[1],..., y[N − 1]]. Let a = 3.1. Different forms of convolution [a[0], a[1],..., a[N − 1]] and b = [b[0], b[1],..., b[N − 1]] corespond to polynomials a(x) and b(x) of length N with Consider vector a = [a[0], a[1],..., a[N − 1]] and vec- elements in Z , ω ∈ Z is the N-th primitive root and q N q tor b = [b[0], b[1],..., b[N − 1]] of length N. Then linear ψ = ω is the 2N-th primitive root. convolution of the vectors a and b has length (2N − 1) and 2N Negacyclic convolution of a and b is defined as y = is vector z = [e z[0],e z[1],...,e z[2N − 2]], where e z can be com- [y[0], y[1],..., y[N − 1]], where y(x) = a(x)· b(x) mod (x + puted as [16]: N− 1 1). Consider a = [a[0],ψ a[1],...,ψ a[N − 1]] and b = 2N 2N N− 1 [b[0],ψ b[1],...,ψ b[N − 1]]. The final fast negacyclic 2N 2N e z[i] = a[n]b[m]. (6) convolution is computed as [18]: n+m=i − (N− 1) − 1 − 2 y = [1,ψ ,ψ ,...,ψ ]◦ INTT(NTT(a)◦ NTT(b)), 2N 2N 2N Cyclic convolution of two vectors a and b is defined (9) as: where ◦ indicates that the component-wise multiplication − 1 is performed and ψ is the inverse element to the element 2N z = [z[0], z[1],..., z[N − 1]], (7) ψ . This operation enables to compute the fast polynomial 2N multiplication, but it is necessary to realize that the result is where z[i] =e z[i]+e z[i+ N] and i = 0, 1,..., N − 1. reduced modulo x + 1. The main idea behind this is that the sequence entering the NTT must be pre-multiplied by powers of ψ and the Negacyclic convolution of two vectors a and b is defined 2N output sequence from INTT is then post-multiplied by the as: − 1 powers of ψ and is shown in Figure 1. 2N y = [y[0], y[1],..., y[N − 1]], (8) where y[i] =e z[i]− e z[i+ N] and i = 0, 1,..., N − 1. Fig. 1 Fast computation of negacyclic convolution using NTT with dimension N over finite field GF(q). The NTT based negacyclic convolution computation 4. IMPLEMENTATION OF NTT AND ITS OPTI- is advantageous over linear convolution, as appending of MALIZATION zeros is not required for negacyclic convolution. It means that the negacyclic convolution is able to reduce the re- We customized Michael Scott’s implementations of quired NTT size by half [18, 19]. NTT and INTT in C programming language to our needs for experimental purposes. He describes several ways of NTT optimization in [20], which were used in our experi- http://indigo.ie/ mscott/ntt_ref.c ISSN 1335-8243 (print) ISSN 1338-3957 (online), www.aei.tuke.sk Acta Electrotechnica et Informatica, Vol. 22, No. 4, 2022 33 ments. In one of the cases, the aim is to reach a constant Extract 4 Lazy INTT optimalization NTT and INTT computation time for different input data. U=x[j]; One of the well-known optimizations which are used is V=x[j+t]; Montgomery’s reduction [21]. Our experiment was based x[j]=U+V; on consecutive elimination of respective optimizations re- W=U+n*q-V x[j+t]= modmul (W,S); sulting in three different NTT implementations: Lazy, Con- stant Time and Naive. Scott’s NTT implementation uses well known 4.2. Constant Time implementation of NTT Cooley-Tukey algorithm and INTT uses the Gentleman- Sande algorithm [20]. The usage of these algorithms can Constant Time implementation uses Montgomery’s be considered the first stage of optimization. The Naive reduction [20, 21]. Montgomery reduction is a technique implementation uses previously noted algorithms without that allows a more efficient implementation of modular other optimizations. The core of individual optimizations multiplication. It is a method of reduction T modulo q of is NTT butterfly code improvement in the innermost loops positive integer. Another integer M is needed M > q, for of algorithms mentioned above. which gcd(q, M) = 1, where gcd is greatest common di- visor. Montgomery describes a method for computation − 1 T M mod q without the use of other standard reduction algorithms. If M is chosen correctly, the computation can 4.1. Lazy implementation of NTT be very efficient. The Montgomery reduction is shown in The final implementation contains the optimization the Algorithm 1 [21]. called Lazy. The main idea behind Lazy optimization is to Algorithm 1 Montgomery’s reduction modify function redc, which is shown in Extract 1 [20]. Input: whole numbers q = (q ... q q ) n− 1 1 0 b Constants defined for the computation of NTT and INTT n ′ − 1 with gcd(q, b) = 1, M = b , q = − q mod q, with input sizes N = 512 and N = 1 024 are needed to and T = (t ... t t ) < qM. 2n− 1 1 0 b get correct computation results. We chose mentioned in- − 1 Output: T M mod q put size values because are typicall used in post-quantum 1. A ← T. Note: A = (a ... a a ) . algorithms as mentioned in section 2 in Table 1. The imple- 2n− 1 1 0 b 2. f or i ← 0; i < n; i ← i + 1 mentation also contains modmul, a function which is shown 2.1 u ← a q mod b in Extract 2. i i 2.2 A ← A + u qb 3. A ← A/b . 4. i f A ≥ q Extract 1 Modular reduction function 4.1 A ← A − q int_t redc ( int_dt T) { 5. ret urn A uint_t m=( uint_t )T*N; int_t V=( int_dt )((( uint_dt )m*q+T)>>WL; In NTT implementation, Montgomery’s optimization return V; is used to perform modular multiplication without using the division operation. Using hardware-optimized Central Processing Unit (CPU) architecture, where multiplication is constant in time and usually takes 1 - 2 CPU cycles, Extract 2 Modular multiplication function NTT and INTT implementations can be executed in con- stant time. The requirement to transform NTT input data int_t modmul ( int_t a, int_t b) { return redc (( int_dt )a*b); into Montgomery’s space and output data back may be con- sidered a drawback. In the same way as the previous Lazy implementation in section 4.1, this implementation refers to butterfly code Optimalizations of inner butterfly loop in Cooley- optimization for NTT and INTT computation. Implemen- Tukey NTT code implementation and Gentleman-Sande tation can be accelerated by pre-computing NTT and INTT INTT algorithm code use different principles. Hence it is ω tables in Montgomery’s form. Modified redc function necessary to modify individually NTT and INTT butterfly is shown in Extract 5. The change doesn’t concern modmul codes to reach their optimalizations. Optimized butterfly function. Modified butterfly code for NTT is shown in Ex- code Lazy for NTT is shown in Extract 3 and optimized tract 6, and INTT modification is shown in Extract 7. butterfly code for INTT is shown in Extract 4. Extract 5 Redc function Constant Time implementation int_t redc ( int_dt T) { Extract 3 Lazy NTT optimalization uint_t m=( uint_t )T*N; int_t V =(( uint_dt )m*q+T)>>WL; U=x[j]; V -=q; V= modmul (x[j+t],S); V +=(V >>(WL -1))& q; x[j]=U+V; return V;} x[j+t]=U+2*q-V; ISSN 1335-8243 (print) ISSN 1338-3957 (online), www.aei.tuke.sk 34 Number-Theoretic Transform with Constant Time Computation ..... 4.4. Implementations in 16-bit arithmetic modification Extract 6 Montgomery’s NTT optimalization U=x[j]; Our customized implementation measurements have V= modmul (x[j+t],S); been evaluated over different platforms. Amongst other W=U+V-q; x[j]=W +((W >>(WL -1))& q); measurements some have been tested on microcontroller W=U-V; (MCU) STM32F103 with 32-bit core ARM Cortex-M3 x[j+t]=W+(W >>(WL -1))& q); [23]. To save MCU’s memory, we customized Scott’s im- plementation in such way that it would support 16-bit arith- metic. During NTT and INTT measurements for input size N = 1 024, for which the input size modulo q = 12 289, 16- Extract 7 Montgomery’s INTT optimalization bit arithmetic is sufficient. We used same values as men- U=x[j]; tioned in section 2 Table 1. This way we saved a significant V=x[j+t]; W=U+V-q; portion of MCU memory. x[j]=W +((W >>(WL -1))& q); Scott’s implementation defines data types for num- W=U+q-V; x[j+t]= modmul (W,S); ber values with operators int_t and int_dt, where int_t represents the base value, int_dt represents double of that value size. The base value size can be 16, 32 or 64 bits, and technically describes the type of implemented arith- metic. Types uint_t and uint_dt represent equivalent data types for number values. The original 32-bit size has 4.3. Naive implementation of NTT transformation been modified into 16 bits as well as their double values. It was necessary to re-compute Montgomery’s constants for Last modification is the removal of Montgomery’s 16-bit arithmetic. We found that it is necessary to modify method, which results in Naive implementation. Cooley- certain functions to better suit the modified data types for Tukey pseudocode in programming language C [20] is the the correct NTT and INTT computations. Switching be- bases for Naive NTT. All instances of mod q have been re- tween 16 and 32-bit arithmetic is accomplished by two sets placed by modulo q (% operator), which in C language rep- of global constants and is required to recompilation. resents modulo operation (whole number division remain- der [22]). The base for INTT is Gentleman-Sande algorithm 5. EXPERIMENTAL VERIFICATION AND IMPLE- [20]. In C language mod q is replaced by % operation. By MENTATION EFFECTIVITY ANALYSIS contrast to the previous implementations this one does not use redc function and the function for modular multipli- We performed measurement of CPU cycle count neces- cation is in its base form. Modmul function used in Naive sary for individual NTT and INTT execution. The RTDSC implementation is shown in Extract 8. Butterfly code for instruction was used on the Intel platform to measure the NTT computation is shown in Extract 9. Extract 10 con- number of cycles [24]. We executed 10 000 instances of tains butterfly code for INTT implementation. NTT and INTT measurements and computed median and mean of the measured values. Each NTT used randomly generated numbers as input. To get the most valid results we compared the mean and median values. Using the me- Extract 8 Modular multiplication in Naive implementation dian value during the comparison we avoided influence by extreme measurement values. We performed separate mea- int_t modmul ( int_t a, int_t b) { surements for input value N = 512 and N = 1 024. return ( int_t )((( int_dt )a*b)%q );} 5.1. Experimental results on Intel CPU Extract 9 Naive NTT optimalization We performed measurements in the environment OS U=x[j]; Linux Ubuntu version 20.04 LTS with CPU Intel Core I7- V= modmul (x[j+t],S); 6700HQ. Obtained number of cycles NTT and INTT com- x[j ]=(U+V)%q; x[j+t ]=( U+q-V)%q; putations are shown in Table 2. The results point to a suc- cessful acceleration of NTT and INTT computations con- cerning CPU cycles. The Lazy implementation has a min- imal CPU cycles necessary to compute NTT and INTT. Extract 10 Naive INTT optimalization In comparison with Naive NTT implementation with in- put size 512, the speed rate has been accelerated by al- U=x[j]; V=x[j+t]; most 66%. Montgomery’s method is capable of acceler- x[j ]=(U+V)%q; ating NTT computation by 50%. W=(U+q-V); x[j+t]= modmul (W,S); By this measurement approach we did not confirm that Constant Time is constant in OS Linux, Ubuntu 20.04 LTS. ISSN 1335-8243 (print) ISSN 1338-3957 (online), www.aei.tuke.sk Acta Electrotechnica et Informatica, Vol. 22, No. 4, 2022 35 Table 2 Comparison of NTT and INTT implementations based on average and median number of CPU cycles for input size N = 512 and N = 1024 evaluated for 10 000 experiments on OS Linux in kernel mode. Speed of various NTT and INTT implementations NTT 512 NTT 1024 INTT 512 INTT 1024 Types of NTT and INTT CPU cycles CPU cycles CPU cycles CPU cycles Average 12 071 24 555 14 884 25 440 Lazy Median 10 496 22 268 12 071 22 982 Average 16 779 34 308 15 714 31 739 Constant Time Median 14 918 30 213 13 800 27 892 Average 30 489 55 652 18 601 34 527 Naive Median 27 158 50 269 16 624 31 018 Measured values for Constant Time implementation ing information about the execution time of instructions. did not prove to be constant on OS Linux platform for that We created modified versions of Scott’s NTT and INTT purpose we conducted measurements in kernel mode of OS implementation for 32-bit and 16-bit arithmetic. We exe- Linux (KML). The goal of measurements in KML was to cuted Naive and Constant Time implementation measure- minimalize deviations caused by running applications in the ments for individual arithmetic to compare measured re- background. Results in KML have shown that deviations sults. We performed NTT and INTT computation sepa- have been significantly reduced but using this way of mea- rately for random generated input values and values that surement we did not prove Constant Time implementation equal 0. to be constant. 32-bit arithmetic has failed to produce constant NTT and INTT computation time, however, we revealed few in- teresting facts. Constant Time implementation is signifi- 5.2. Experimental results on embedded CPU cantly faster that Naive implementation. It is caused by We performed verification of constant time of NTT and utilizing 64-bit division by Naive implementation. 64-bit INTT on MCU STM32F103 with 32-bit core ARM Cortex- division runs as software routine in non-constant execu- M3 [23]. We performed this verification of constant period tion time. We performed measurements with input size of of measurements in uVision’s Keil ARM simulator provid- N = 512 and the results are shown in Table 3. Table 3 Implementation comparison based on CPU cycle time for input size N = 512 on MCU using 32-bit arithmetic for NTT and INTT separately for randomly generated inputs and input values equal 0. Measurement of NTT and INTT on MCU Randomly generated input values Input values equal 0 NTT INTT NTT INTT CPU cycles CPU cycles CPU cycles CPU cycles Naive 5 587 815 6 248 705 4 699 422 6 133 638 Constant Time 327 109 289 973 313 797 287 925 Table 4 Implementation comparison based on CPU cycle time for input size N = 1 024 on MCU using 16-bit arithmetic for NTT and INTT separately for randomly generated inputs and input values equal 0. Measurement of NTT and INTT on MCU Randomly generated input values Input values equal 0 NTT INTT NTT INTT CPU cycles CPU cycles CPU cycles CPU cycles Naive 362 479 358 536 329 903 346 616 Constant Time 558 264 469 205 558 264 469 205 We achieved constant computation time of NTT and Naive implementation provides better results than Constant INTT using Constant Time implementation and 16-bit arith- Time implementation. It is caused by the 32-bit operation metic in embedded systems for different input values. Mea- of division that is hardware supported within Naive imple- sured values are shown in Table 4. mentation. Naive implementation has also shown interesting re- By the measurements, we confirmed that Constant sults. 16-bit arithmetic uses hardware support for 32-bit Time implementation leads to constant measurement time operations. Measured results show an interesting paradox – for different input values in 16-bit arithmetic in embedded ISSN 1335-8243 (print) ISSN 1338-3957 (online), www.aei.tuke.sk 36 Number-Theoretic Transform with Constant Time Computation ..... STM32F103 MCU with 32-bit core ARM Cortex-M3. [9] NIST. 2020. Status Report on the Second Round of the NIST Post-Quantum Cryptography Stan- dardization Process [online]. Available from: 6. CONCLUSION https://nvlpubs.nist.gov/nistpubs/ir/ We deal with the number-theoretical transform typ- 2020/NIST.IR.8309.pdf. ically used in post-quantum cryptography. It is possible [10] NIST. 2022. NIST Announces First Four Quantum- to improve individual post-quantum algorithms using op- Resistant Cryptographic Algorithms [online]. Available timized NTT and INTT transformations. One optimization from: https://www.nist.gov/news-events/news. is the use of NTT transformation with negacyclic convo- [11] LONGA, P. – NAEHRIG, M.: Speeding up the lution that can halve required NTT dimension. NTT and Number Theoretic Transform for Faster Ideal Lattice- INTT implementations have been modified in C language Based Cryptography [online]. [cit.2022-04-09]. Avail- in a way that it is possible to study the influence of individ- able from:https://eprint.iacr.org/2016/504. ual optimizations. Several experimental measurements on pdf. different platforms have been executed to determine the ef- [12] ZHOU, Y. – FENG, D.: 2012. Side-Channel Attacks: fectiveness of individual optimizations. The results of these Ten Years After Its Publication and the Impacts measurements have been summarized and analysed. The on Cryptographic Module Security Testing [on- experiments have confirmed the suitability of constant time line]. [cit. 2022-04-11]. Available from:https: implementation for embedded systems for reaching con- //csrc.nist.gov/csrc/media/events/ stant NTT and INTT execution times. If the conditions for physical-security-testing-workshop/ constant NTT and INTT execution time have been met the documents/papers/physecpaper19.pdf. possibility of an attack using a side channel significantly decreases [25]. [13] ALLEN, R. L. – MILLS, D. W.: 2003. Signal Anal- ysis: Time, Frequency, Scale and Structure,IEE Press. ISBN : 978-0-471-23441-8. ACKNOWLEDGMENT [14] COOLEY, W. J. – TUKEY, J. W.: 1965. An Algorithm The authors acknowledge the support of the OP VVV for the Machine Calculation of Complex Fourier Series funded project CZ.02.1.01/0.0/0.0/16 019/0000765 ”Re- [online]. [cit. 2022-03-21]. Mathematics of Computa- search Center for Informatics”. tion. Available from: http://web.stanford.edu/ class/cme324/classics/cooley-tukey.pdf REFERENCES [15] SCHUPP, S.: 2003. Lifting a butterfly – A component- based FFT [online]. [cit. 2022-02-21]. Department of ´ ´ [1] LEVICKY, D.: 2018. APLIKOVANA KRYP- Computer Science, Rensselaer Polytechnic Institute. TOGRAFIA od utajenia sprav ´ ku kybernetickej Available from: https://www.researchgate. bezpecnosti. ˇ (In Slovak language). Kosice ˇ : elfa, s.r.o. net/publication/220060688_Lifting_a_ ISBN: 978-80-8086-265-7. butterfly_-_A_component-based_FFT. [2] PAAR, CH. – PELZL, J.: 2010. Understanding Cryp- [16] CHU, E. – GEORGE, A.: 2000. Inside the FFT Black tography : a Textbook for Students and Practitioners, Box Serial and Parallel Fast Fourier Transform Algo- Springer. ISBN 978-3-642-04100-6. rithms. CRC Press, Boca Raton, FL, USA. ISBN : 0- 8493-0270-6. [3] BERNSTEIN, D. J. – BUCHMANN, J. – DAHMEN, [17] LYUBASHEVSKY, V. – PEIKERT, CH. – REGEV, E.: 2009. Post-Quantum Cryptography, Springer. O.: 2013. On Ideal Lattices and Learning with Er- ISBN: 978-3-540-88701-0. rors Over Rings [online]. [cit. 2022-01-05]. Available [4] DRUTAROVSKY, M. 2017. Kryptografia pre vstavan e ´ from:https://eprint.iacr.org/2012/230.pdf. procesorove systemy. (In Slovak language). Technical ´ ´ [18] CRANDALL, R. – FAGIN, B.: 1994. Discrete university of Kosice. ISBN: 978-80-553-2805-8. weighted transforms and large-integer arithmetic. [5] NewHope. 2020. NewHope, post-quantum key en- Mathematics of Computation 62(205), 305–324. capsulation. [online]. Available from: https:// [19] Poppelmann, ¨ T. – ODER, T. – Gune ¨ ysu, T.: 2015. newhopecrypto.org/. High-Performance Ideal Lattice-Based Cryptography [6] CRYSTALS. 2020. CRYSTALS. Cryptographic Suite on 8-bit ATxmega Microcontrollers[online]. [cit. 2022- for Algebraic Lattices. [online]. Available from: 03-18]. Available from:https://eprint.iacr.org/ https://pq-crystals.org/kyber/index.shtml. 2015/382.pdf. [7] FALCON. 2017. Falcon, Fast-Fourier Lattice-based [20] SCOTT, M.: 2017. A note on the implementation of Compact Signatures over NTRU. [online]. Available the Number Theoretic Transform [online]. [cit.2022- from: https://falcon-sign.info/. 02-04]. Available from:https://eprint.iacr.org/ 2017/727.pdf. [8] CRYSTALS. 2021. CRYSTALS. Cryptographic Suite for Algebraic Lattices. Dilithium Home. [online]. [21] MENEZES, A. J. – van OORSCHOT, P. C. – VAN- Available from: https://pq-crystals.org/ STONE, S. A.: 1996. Handbook of Applied Cryptogra- dilithium/index.shtml. phy, Florida: CRC Press, ISBN : 0-8493-8523-7. ISSN 1335-8243 (print) ISSN 1338-3957 (online), www.aei.tuke.sk Acta Electrotechnica et Informatica, Vol. 22, No. 4, 2022 37 [22] Technotip. 2020. Modulus or Mod- BIOGRAPHIES ulo Division In C Programming Lan- Eva Kupcova ´ was born in Roznava, Slovakia in 1998. She guage [online].[cit.2022-02-19]. Available is currently a student on a master’s degree at the Depart- from:https://www.geeksforgeeks.org/ ment of Electronics and Multimedia Telecommunications modulo-operator-in-c-cpp-with-examples/. at the Faculty of Electrical Engineering and Informatics at the Technical University in Kosice. She finished her [23] KEIL, A.: STMicroelectronics STM32F103C8 [on- bachelor’s degree in 2021 with the final thesis named NTT line]. Available from: https://www.keil.com/dd2/ transform in post-quantum cryptography. She is expected stmicroelectronics/bluenrg_2/. to complete her engineering studies in 2023. [24] PAOLONI, G.: 2010. How to Benchmark Code Exe- Milos ˇ Drutarovsky ´ was born in Slovakia in 1965. He grad- cution Times on Intel®IA-32 and IA-64 Instruction Set uated from the Faculty of Electrical Engineering, Technical Architectures [online]. [cit.2022-05-1]. Available from: University of Kosice in 1988. He obtained his Ph.D. degree https://www.intel.com/content/dam/www/public. in Radioelectronics from the same faculty in 1995. Cur- ´ rently he is a full professor of Telecommunications at the [25] KUPCOVA, E.: 2021. NTT transformacia ´ v post- Faculty of Electrical Engineering and Informatics, Tech- kvantovej kryptografii. (In Slovak language). Bachelor nical University of Kosice, Kosice, Slovak Republic. His thesis. Technical University of Kosice. ˇ research focuses on applied cryptography, algorithms and architectures for embedded cryptographic devices, digital signal processing, field programmable devices and soft mi- Received September 12, 2022, accepted December 8, 2022 crocontrollers embedded into FPGA circuits. ISSN 1335-8243 (print) ISSN 1338-3957 (online), www.aei.tuke.sk http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Acta Electrotechnica et Informatica de Gruyter

Number-Theoretic Transform with Constant Time Computation for Embedded Post-Quantum Cryptography

Loading next page...
 
/lp/de-gruyter/number-theoretic-transform-with-constant-time-computation-for-embedded-z5gZTFxOau

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
de Gruyter
Copyright
© 2022 Eva Kupcová et al., published by Sciendo
ISSN
1338-3957
eISSN
1338-3957
DOI
10.2478/aei-2022-0020
Publisher site
See Article on Publisher Site

Abstract

30 Acta Electrotechnica et Informatica, Vol. 22, No. 4, 2022, 30–37, DOI: 10.2478/aei-2022-0020 NUMBER-THEORETIC TRANSFORM WITH CONSTANT TIME COMPUTATION FOR EMBEDDED POST-QUANTUM CRYPTOGRAPHY ´ ´ Eva KUPCOVA, Milos ˇ DRUTAROVSKY Department of Electronics and Multimedia Telecommunications, Faculty of Electrical Engineering and Informatics, Technical University of Kosice, ˇ Letna ´ 9, 042 00 Kosice, ˇ Slovak Republic, E-mails: eva.kupcova@student.tuke.sk, milos.drutarovsky@tuke.sk ABSTRACT In this article, we describe the principles and advantages of using the Number-Theoretic Transform (NTT) in post-quantum cryp- tography. We deal with usages of NTT in post-quantum algorithms included in the competition announced by the National Institute of Standards and Technology. Attention is paid to the fast multiplication of polynomials using NTT and negacyclic convolution. We also focus on the existing implementation of NTT and its modifications to analyze the effectiveness of individual modifications. Separate attention is paid to the Constant Time implementation of NTT because the constant computation time of the transformation decreases a possibility of side channel attack. We describe measurements performed on OS Linux Ubuntu 20.04 LTS environment in Linux kernel mode, with the highest attention to the measurement executed on a microcontroller with an ARM 32-bit core. Measurements performed on microcontroller units are done using 32-bit and 16-bit arithmetic to demonstrate how to achieve constant computation time of the transformation. We present the results and analysis of measurements performed using modified implementations. Keywords: NTT transformation, post-quantum cryptography, negacyclic convolution, microcontroller, Montgomery reduction 1. INTRODUCTION like NewHope [5], Kyber [6], Falcon [7] or Dilithium [8]. The goal of new cryptographic algorithms is to be able to protect sensitive government information, even after quan- Cryptography is an area of mathematics and com- tum computers are widely used [9]. In July 2022, NIST has puter science that aims to protect the information content selected Kyber algorithm for key encapsulation mechanism of a secret message using encryption methods but does not because of its small keys and speed of operation. Dilithium conceal the existence of this message. Modern cryptog- and Falcon has been selected for digital signatures because raphy is developing dynamically with the development of of their high efficiency [10]. computer technology. It is connected with the electronic Some post-quantum algorithms like NewHope, Ky- form of communication and sending of information [1, 2]. ber, Titanium and others use Number Theoretic Transform In recent years, research has focused also on the de- (NTT) to multiply big-size polynomials in their implemen- velopment of quantum computers - machines that use quan- tations [9]. We will perform experimental tests to deter- tum mechanical phenomena to solve efcient fi mathemat- mine efficiency for the mentioned post-quantum algorithms ical problems that are currently very difficult or unsolv- using NTT. By optimizing NTT, it is possible to achieve an able by conventional computers. If quantum computers are overall improvement of a given post-quantum algorithm. In constructed on a large scale, this will compromise the se- some NTT implementations, negacyclic convolution [11] is curity of many commonly used cryptographic algorithms. used for optimization. We describe three specific imple- Quantum computers could break many recent public key- mentations of NTT and their achieved results. The goal of based cryptosystems in a very short time, including the experimentally created implementations will be to achieve RSA (Rivest-Shamir-Adleman Algorithm), DSA (Digital a constant time of NTT computation in terms of executed Signature Algorithm) and ECC (Elliptic Curve Cryptogra- CPU cycles. phy) [3, 4]. The use of mentioned algorithms plays a crucial Side channels [12] change the overall view of system role in ensuring the confidentiality and authenticity of com- security. At present, it is no longer enough just to choose munications on the internet and other networks. mathematically strong encryption, it is also necessary to The selection of suitable post-quantum algorithms is pay attention to its implementation. Often we don’t know handled by the National Institute of Standards and Tech- about unwanted side channel existing in the implementa- nology (NIST) . The process of selecting and standardizing tion. Therefore, achieving a constant computation time of suitable algorithms takes several years, and it takes another NTT is very important, as it reduces the possibility of de- few years before the algorithm can be considered safe and ploying some an attacks using side channel. reliable. The algorithm gets reliability by going through a We structured the paper in a following way. The crypto analysis process and being examined by several ex- second section describes the NTT itself, its mathematical perts in the field, and must also be able to withstand the foundations, principles and usage. We described improve- various proposed attacks. Therefore, NIST has decided to ments in NTT computation in the third section. The section find optimal post-quantum algorithm for the future, which mentions three types of convolution. Separate attention is will serve as a replacement for several cryptosystems. In paid to the connection between NTT and negacyclic con- 2016, NIST initiated the process of developing and stan- volution. The fourth section describes the procedure of the dardizing one or several other post-quantum algorithms for experiment, modifying the existing NTT implementation. public key encapsulation mechanism and digital signatures, https://www.nist.gov ISSN 1335-8243 (print) ISSN 1338-3957 (online), www.aei.tuke.sk Acta Electrotechnica et Informatica, Vol. 22, No. 4, 2022 31 The individual optimizations used in a given implementa- NTT is defined as: tion and the method of their modification are listed. The N− 1 i j last section describes how to measure the effectiveness of X[i] = x[ j]ω (mod q), where i = 0, 1,..., N − 1, (2) ∑ N individual NTT implementations. The results of individ- j=0 ual measurements are presented together with their analy- and it can be described using the transformation matrix sis. The conclusion provides our summary description of A , whose values in i-th row determine the coefficients by the topic article deals with and our achieved results. which the input vector x = [x[0], x[1],..., x[N − 1]] is multi- plied when computing i-th output X[i] of the NTT:   0 0 0 0 2. NUMBER-THEORETIC TRANSFORM ω ω ω ... ω N N N N N− 1 0 1 2   ω ω ω ... ω N N N N   2(N− 1) 0 2 4   NTT is a generalization of the Discrete Fourier Trans- ω ω ω ... ω  N N  N N A = . (3)   . . . . form (DFT) [13]. The NTT structure is very similar to DFT .   . . . . .  .  . . . . but is defined over the finite field (Galois field) GF(q), 2(N− 1) (N− 1) 0 N− 1 ω ω ω ... ω N N N N where q is a prime number. NTT uses modular arith- metic. The basic intent for the usage of modular arith- metic is to provide closure in operations between elements Inverse NTT (INTT) is defined as: a, b ∈ GF(q). Closure in operations means operations re- sults are in the same finite field [4]. NTT allows fast con- N− 1 − 1 − i j volution to be performed on integer sequences without any x[i] = N X[ j]ω (mod q), where i = 0, 1,..., N− 1, rounding errors and is also used to multiply big-size poly- j=0 nomials (e.g. size 512 and 1024 from Table 1). It is widely (4) used in computer arithmetic and cryptography. The main and it can be described using the inverse transformation advantage of fast NTT is that after using an efficient al- − 1 matrix A , whose values in i-th row determine the coeffi- gorithm, it can reduce the computational complexity from N 2 cients by which the input vector X = [X[0], X[1],..., X[N − O(N ) to O(NlogN), where N is the size of NTT. 1]] is multiplied when computing i-th output x[i] of the To use NTT it is necessary to choose correct prime INTT: number, q = kN + 1, where the size of the multiplica-   tive group G is ϕ(q) = q − 1 = kN, N is the power of ϕ − 0 − 0 − 0 − 0 ω ω ω ... ω N N N N 2, and specify g, that is the (q − 1)-th primitive root of  − (N− 1)  − 0 − 1 − 2 ω ω ω ... ω   N N N N   − 2(N− 1) G . The primitive root g is a number for which applies − 0 − 2 − 4   − 1 ω ω ω ... ω N N N N A = . (5) q− 1 t g ≡ 1 (mod q) and g ̸≡ 1(mod q) for t = (1, 2,..., q− 2)   . . . .  . . . . .  . . . .   [11]. − (N− 1) − 2(N− 1) − (N− 1) − 0 ω ω ω ... ω Using the primitive root g it is possible to compute N N N N N-th primitive root ω : NTT defined over finite field can be implemented k N kN ϕ(q) ω = g (mod q) and ω = g = g ≡ 1(mod q). (1) very efficiently if correct parameters ( N, q, ω ) are used for the computation. To give you an idea, we list typical For a given N-th primitve root ω in the finite field, parameters in Table 1, which are widely used in some post- quantum algorithms for NTT computation. Table 1 Parameters of post-quantum algorithms for NTT computation with specific input size N. Parameters of post-quantum algorithms for NTT computation Post-quantum algorithm N q ω Dilithium [8] 256 8 380 417 80 Falcon [7] 1024 12 277 49 Kyber512 [6] 512 12 289 49 Kyber1024 [6] 1024 12 289 7 NewHope512 [5] 512 12 265 3 NewHope1024 [5] 1024 12 277 49 3. NEGACYCLIC CONVOLUTION IN CONTEXT preted as a convolution between coefficients of polynomi- OF NTT COMPUTATION als. This approach is relevant for polynomial multiplication using fast NTT which is created on the same principles as the Fast Fourier Transform (FFT) [16]. Polynomials are widely used in cryptography but with The whole NTT computation is broken down into increasing polynomial size the time required to perform op- butterflies, which are used in two basic forms of DFT erations (especially for multiplication operation) also in- computation. Butterfly is a portion of the computation creases. The multiplication of to polynomials can be inter- ISSN 1335-8243 (print) ISSN 1338-3957 (online), www.aei.tuke.sk 32 Number-Theoretic Transform with Constant Time Computation ..... that combines the results of smaller discrete Fourier trans- 3.2. Fast polynomial multiplication by using negacyclic forms (DFTs) into a larger DFT, or vice versa (breaking convolution a larger DFT up into sub-transforms). The name ”butter- In cryptography, polynomial multiplication is usually fly” comes from the shape of the data-flow diagram in the N N defined in ring R = Z [x]/⟨x + 1⟩ with modulo x + 1, radix-2 case. The NTT computations are given in a form where N is power of 2 [17]. of DFT (Decimation in Frequency) structure described by Negacyclic convolution computation using NTT is es- the Gentleman-Sande algorithm [15] and DIT (Decimation pecially used for fast multiplication of big-size polynomi- in Time) structure described by the Cooley-Tukey algo- als. rithm [14], which have multiple forms of butterflies [16]. N− 1 i Consider polynom y(x) = y[i]x , that can be rep- i=0 resented as vector y = [y[0], y[1],..., y[N − 1]]. Let a = 3.1. Different forms of convolution [a[0], a[1],..., a[N − 1]] and b = [b[0], b[1],..., b[N − 1]] corespond to polynomials a(x) and b(x) of length N with Consider vector a = [a[0], a[1],..., a[N − 1]] and vec- elements in Z , ω ∈ Z is the N-th primitive root and q N q tor b = [b[0], b[1],..., b[N − 1]] of length N. Then linear ψ = ω is the 2N-th primitive root. convolution of the vectors a and b has length (2N − 1) and 2N Negacyclic convolution of a and b is defined as y = is vector z = [e z[0],e z[1],...,e z[2N − 2]], where e z can be com- [y[0], y[1],..., y[N − 1]], where y(x) = a(x)· b(x) mod (x + puted as [16]: N− 1 1). Consider a = [a[0],ψ a[1],...,ψ a[N − 1]] and b = 2N 2N N− 1 [b[0],ψ b[1],...,ψ b[N − 1]]. The final fast negacyclic 2N 2N e z[i] = a[n]b[m]. (6) convolution is computed as [18]: n+m=i − (N− 1) − 1 − 2 y = [1,ψ ,ψ ,...,ψ ]◦ INTT(NTT(a)◦ NTT(b)), 2N 2N 2N Cyclic convolution of two vectors a and b is defined (9) as: where ◦ indicates that the component-wise multiplication − 1 is performed and ψ is the inverse element to the element 2N z = [z[0], z[1],..., z[N − 1]], (7) ψ . This operation enables to compute the fast polynomial 2N multiplication, but it is necessary to realize that the result is where z[i] =e z[i]+e z[i+ N] and i = 0, 1,..., N − 1. reduced modulo x + 1. The main idea behind this is that the sequence entering the NTT must be pre-multiplied by powers of ψ and the Negacyclic convolution of two vectors a and b is defined 2N output sequence from INTT is then post-multiplied by the as: − 1 powers of ψ and is shown in Figure 1. 2N y = [y[0], y[1],..., y[N − 1]], (8) where y[i] =e z[i]− e z[i+ N] and i = 0, 1,..., N − 1. Fig. 1 Fast computation of negacyclic convolution using NTT with dimension N over finite field GF(q). The NTT based negacyclic convolution computation 4. IMPLEMENTATION OF NTT AND ITS OPTI- is advantageous over linear convolution, as appending of MALIZATION zeros is not required for negacyclic convolution. It means that the negacyclic convolution is able to reduce the re- We customized Michael Scott’s implementations of quired NTT size by half [18, 19]. NTT and INTT in C programming language to our needs for experimental purposes. He describes several ways of NTT optimization in [20], which were used in our experi- http://indigo.ie/ mscott/ntt_ref.c ISSN 1335-8243 (print) ISSN 1338-3957 (online), www.aei.tuke.sk Acta Electrotechnica et Informatica, Vol. 22, No. 4, 2022 33 ments. In one of the cases, the aim is to reach a constant Extract 4 Lazy INTT optimalization NTT and INTT computation time for different input data. U=x[j]; One of the well-known optimizations which are used is V=x[j+t]; Montgomery’s reduction [21]. Our experiment was based x[j]=U+V; on consecutive elimination of respective optimizations re- W=U+n*q-V x[j+t]= modmul (W,S); sulting in three different NTT implementations: Lazy, Con- stant Time and Naive. Scott’s NTT implementation uses well known 4.2. Constant Time implementation of NTT Cooley-Tukey algorithm and INTT uses the Gentleman- Sande algorithm [20]. The usage of these algorithms can Constant Time implementation uses Montgomery’s be considered the first stage of optimization. The Naive reduction [20, 21]. Montgomery reduction is a technique implementation uses previously noted algorithms without that allows a more efficient implementation of modular other optimizations. The core of individual optimizations multiplication. It is a method of reduction T modulo q of is NTT butterfly code improvement in the innermost loops positive integer. Another integer M is needed M > q, for of algorithms mentioned above. which gcd(q, M) = 1, where gcd is greatest common di- visor. Montgomery describes a method for computation − 1 T M mod q without the use of other standard reduction algorithms. If M is chosen correctly, the computation can 4.1. Lazy implementation of NTT be very efficient. The Montgomery reduction is shown in The final implementation contains the optimization the Algorithm 1 [21]. called Lazy. The main idea behind Lazy optimization is to Algorithm 1 Montgomery’s reduction modify function redc, which is shown in Extract 1 [20]. Input: whole numbers q = (q ... q q ) n− 1 1 0 b Constants defined for the computation of NTT and INTT n ′ − 1 with gcd(q, b) = 1, M = b , q = − q mod q, with input sizes N = 512 and N = 1 024 are needed to and T = (t ... t t ) < qM. 2n− 1 1 0 b get correct computation results. We chose mentioned in- − 1 Output: T M mod q put size values because are typicall used in post-quantum 1. A ← T. Note: A = (a ... a a ) . algorithms as mentioned in section 2 in Table 1. The imple- 2n− 1 1 0 b 2. f or i ← 0; i < n; i ← i + 1 mentation also contains modmul, a function which is shown 2.1 u ← a q mod b in Extract 2. i i 2.2 A ← A + u qb 3. A ← A/b . 4. i f A ≥ q Extract 1 Modular reduction function 4.1 A ← A − q int_t redc ( int_dt T) { 5. ret urn A uint_t m=( uint_t )T*N; int_t V=( int_dt )((( uint_dt )m*q+T)>>WL; In NTT implementation, Montgomery’s optimization return V; is used to perform modular multiplication without using the division operation. Using hardware-optimized Central Processing Unit (CPU) architecture, where multiplication is constant in time and usually takes 1 - 2 CPU cycles, Extract 2 Modular multiplication function NTT and INTT implementations can be executed in con- stant time. The requirement to transform NTT input data int_t modmul ( int_t a, int_t b) { return redc (( int_dt )a*b); into Montgomery’s space and output data back may be con- sidered a drawback. In the same way as the previous Lazy implementation in section 4.1, this implementation refers to butterfly code Optimalizations of inner butterfly loop in Cooley- optimization for NTT and INTT computation. Implemen- Tukey NTT code implementation and Gentleman-Sande tation can be accelerated by pre-computing NTT and INTT INTT algorithm code use different principles. Hence it is ω tables in Montgomery’s form. Modified redc function necessary to modify individually NTT and INTT butterfly is shown in Extract 5. The change doesn’t concern modmul codes to reach their optimalizations. Optimized butterfly function. Modified butterfly code for NTT is shown in Ex- code Lazy for NTT is shown in Extract 3 and optimized tract 6, and INTT modification is shown in Extract 7. butterfly code for INTT is shown in Extract 4. Extract 5 Redc function Constant Time implementation int_t redc ( int_dt T) { Extract 3 Lazy NTT optimalization uint_t m=( uint_t )T*N; int_t V =(( uint_dt )m*q+T)>>WL; U=x[j]; V -=q; V= modmul (x[j+t],S); V +=(V >>(WL -1))& q; x[j]=U+V; return V;} x[j+t]=U+2*q-V; ISSN 1335-8243 (print) ISSN 1338-3957 (online), www.aei.tuke.sk 34 Number-Theoretic Transform with Constant Time Computation ..... 4.4. Implementations in 16-bit arithmetic modification Extract 6 Montgomery’s NTT optimalization U=x[j]; Our customized implementation measurements have V= modmul (x[j+t],S); been evaluated over different platforms. Amongst other W=U+V-q; x[j]=W +((W >>(WL -1))& q); measurements some have been tested on microcontroller W=U-V; (MCU) STM32F103 with 32-bit core ARM Cortex-M3 x[j+t]=W+(W >>(WL -1))& q); [23]. To save MCU’s memory, we customized Scott’s im- plementation in such way that it would support 16-bit arith- metic. During NTT and INTT measurements for input size N = 1 024, for which the input size modulo q = 12 289, 16- Extract 7 Montgomery’s INTT optimalization bit arithmetic is sufficient. We used same values as men- U=x[j]; tioned in section 2 Table 1. This way we saved a significant V=x[j+t]; W=U+V-q; portion of MCU memory. x[j]=W +((W >>(WL -1))& q); Scott’s implementation defines data types for num- W=U+q-V; x[j+t]= modmul (W,S); ber values with operators int_t and int_dt, where int_t represents the base value, int_dt represents double of that value size. The base value size can be 16, 32 or 64 bits, and technically describes the type of implemented arith- metic. Types uint_t and uint_dt represent equivalent data types for number values. The original 32-bit size has 4.3. Naive implementation of NTT transformation been modified into 16 bits as well as their double values. It was necessary to re-compute Montgomery’s constants for Last modification is the removal of Montgomery’s 16-bit arithmetic. We found that it is necessary to modify method, which results in Naive implementation. Cooley- certain functions to better suit the modified data types for Tukey pseudocode in programming language C [20] is the the correct NTT and INTT computations. Switching be- bases for Naive NTT. All instances of mod q have been re- tween 16 and 32-bit arithmetic is accomplished by two sets placed by modulo q (% operator), which in C language rep- of global constants and is required to recompilation. resents modulo operation (whole number division remain- der [22]). The base for INTT is Gentleman-Sande algorithm 5. EXPERIMENTAL VERIFICATION AND IMPLE- [20]. In C language mod q is replaced by % operation. By MENTATION EFFECTIVITY ANALYSIS contrast to the previous implementations this one does not use redc function and the function for modular multipli- We performed measurement of CPU cycle count neces- cation is in its base form. Modmul function used in Naive sary for individual NTT and INTT execution. The RTDSC implementation is shown in Extract 8. Butterfly code for instruction was used on the Intel platform to measure the NTT computation is shown in Extract 9. Extract 10 con- number of cycles [24]. We executed 10 000 instances of tains butterfly code for INTT implementation. NTT and INTT measurements and computed median and mean of the measured values. Each NTT used randomly generated numbers as input. To get the most valid results we compared the mean and median values. Using the me- Extract 8 Modular multiplication in Naive implementation dian value during the comparison we avoided influence by extreme measurement values. We performed separate mea- int_t modmul ( int_t a, int_t b) { surements for input value N = 512 and N = 1 024. return ( int_t )((( int_dt )a*b)%q );} 5.1. Experimental results on Intel CPU Extract 9 Naive NTT optimalization We performed measurements in the environment OS U=x[j]; Linux Ubuntu version 20.04 LTS with CPU Intel Core I7- V= modmul (x[j+t],S); 6700HQ. Obtained number of cycles NTT and INTT com- x[j ]=(U+V)%q; x[j+t ]=( U+q-V)%q; putations are shown in Table 2. The results point to a suc- cessful acceleration of NTT and INTT computations con- cerning CPU cycles. The Lazy implementation has a min- imal CPU cycles necessary to compute NTT and INTT. Extract 10 Naive INTT optimalization In comparison with Naive NTT implementation with in- put size 512, the speed rate has been accelerated by al- U=x[j]; V=x[j+t]; most 66%. Montgomery’s method is capable of acceler- x[j ]=(U+V)%q; ating NTT computation by 50%. W=(U+q-V); x[j+t]= modmul (W,S); By this measurement approach we did not confirm that Constant Time is constant in OS Linux, Ubuntu 20.04 LTS. ISSN 1335-8243 (print) ISSN 1338-3957 (online), www.aei.tuke.sk Acta Electrotechnica et Informatica, Vol. 22, No. 4, 2022 35 Table 2 Comparison of NTT and INTT implementations based on average and median number of CPU cycles for input size N = 512 and N = 1024 evaluated for 10 000 experiments on OS Linux in kernel mode. Speed of various NTT and INTT implementations NTT 512 NTT 1024 INTT 512 INTT 1024 Types of NTT and INTT CPU cycles CPU cycles CPU cycles CPU cycles Average 12 071 24 555 14 884 25 440 Lazy Median 10 496 22 268 12 071 22 982 Average 16 779 34 308 15 714 31 739 Constant Time Median 14 918 30 213 13 800 27 892 Average 30 489 55 652 18 601 34 527 Naive Median 27 158 50 269 16 624 31 018 Measured values for Constant Time implementation ing information about the execution time of instructions. did not prove to be constant on OS Linux platform for that We created modified versions of Scott’s NTT and INTT purpose we conducted measurements in kernel mode of OS implementation for 32-bit and 16-bit arithmetic. We exe- Linux (KML). The goal of measurements in KML was to cuted Naive and Constant Time implementation measure- minimalize deviations caused by running applications in the ments for individual arithmetic to compare measured re- background. Results in KML have shown that deviations sults. We performed NTT and INTT computation sepa- have been significantly reduced but using this way of mea- rately for random generated input values and values that surement we did not prove Constant Time implementation equal 0. to be constant. 32-bit arithmetic has failed to produce constant NTT and INTT computation time, however, we revealed few in- teresting facts. Constant Time implementation is signifi- 5.2. Experimental results on embedded CPU cantly faster that Naive implementation. It is caused by We performed verification of constant time of NTT and utilizing 64-bit division by Naive implementation. 64-bit INTT on MCU STM32F103 with 32-bit core ARM Cortex- division runs as software routine in non-constant execu- M3 [23]. We performed this verification of constant period tion time. We performed measurements with input size of of measurements in uVision’s Keil ARM simulator provid- N = 512 and the results are shown in Table 3. Table 3 Implementation comparison based on CPU cycle time for input size N = 512 on MCU using 32-bit arithmetic for NTT and INTT separately for randomly generated inputs and input values equal 0. Measurement of NTT and INTT on MCU Randomly generated input values Input values equal 0 NTT INTT NTT INTT CPU cycles CPU cycles CPU cycles CPU cycles Naive 5 587 815 6 248 705 4 699 422 6 133 638 Constant Time 327 109 289 973 313 797 287 925 Table 4 Implementation comparison based on CPU cycle time for input size N = 1 024 on MCU using 16-bit arithmetic for NTT and INTT separately for randomly generated inputs and input values equal 0. Measurement of NTT and INTT on MCU Randomly generated input values Input values equal 0 NTT INTT NTT INTT CPU cycles CPU cycles CPU cycles CPU cycles Naive 362 479 358 536 329 903 346 616 Constant Time 558 264 469 205 558 264 469 205 We achieved constant computation time of NTT and Naive implementation provides better results than Constant INTT using Constant Time implementation and 16-bit arith- Time implementation. It is caused by the 32-bit operation metic in embedded systems for different input values. Mea- of division that is hardware supported within Naive imple- sured values are shown in Table 4. mentation. Naive implementation has also shown interesting re- By the measurements, we confirmed that Constant sults. 16-bit arithmetic uses hardware support for 32-bit Time implementation leads to constant measurement time operations. Measured results show an interesting paradox – for different input values in 16-bit arithmetic in embedded ISSN 1335-8243 (print) ISSN 1338-3957 (online), www.aei.tuke.sk 36 Number-Theoretic Transform with Constant Time Computation ..... STM32F103 MCU with 32-bit core ARM Cortex-M3. [9] NIST. 2020. Status Report on the Second Round of the NIST Post-Quantum Cryptography Stan- dardization Process [online]. Available from: 6. CONCLUSION https://nvlpubs.nist.gov/nistpubs/ir/ We deal with the number-theoretical transform typ- 2020/NIST.IR.8309.pdf. ically used in post-quantum cryptography. It is possible [10] NIST. 2022. NIST Announces First Four Quantum- to improve individual post-quantum algorithms using op- Resistant Cryptographic Algorithms [online]. Available timized NTT and INTT transformations. One optimization from: https://www.nist.gov/news-events/news. is the use of NTT transformation with negacyclic convo- [11] LONGA, P. – NAEHRIG, M.: Speeding up the lution that can halve required NTT dimension. NTT and Number Theoretic Transform for Faster Ideal Lattice- INTT implementations have been modified in C language Based Cryptography [online]. [cit.2022-04-09]. Avail- in a way that it is possible to study the influence of individ- able from:https://eprint.iacr.org/2016/504. ual optimizations. Several experimental measurements on pdf. different platforms have been executed to determine the ef- [12] ZHOU, Y. – FENG, D.: 2012. Side-Channel Attacks: fectiveness of individual optimizations. The results of these Ten Years After Its Publication and the Impacts measurements have been summarized and analysed. The on Cryptographic Module Security Testing [on- experiments have confirmed the suitability of constant time line]. [cit. 2022-04-11]. Available from:https: implementation for embedded systems for reaching con- //csrc.nist.gov/csrc/media/events/ stant NTT and INTT execution times. If the conditions for physical-security-testing-workshop/ constant NTT and INTT execution time have been met the documents/papers/physecpaper19.pdf. possibility of an attack using a side channel significantly decreases [25]. [13] ALLEN, R. L. – MILLS, D. W.: 2003. Signal Anal- ysis: Time, Frequency, Scale and Structure,IEE Press. ISBN : 978-0-471-23441-8. ACKNOWLEDGMENT [14] COOLEY, W. J. – TUKEY, J. W.: 1965. An Algorithm The authors acknowledge the support of the OP VVV for the Machine Calculation of Complex Fourier Series funded project CZ.02.1.01/0.0/0.0/16 019/0000765 ”Re- [online]. [cit. 2022-03-21]. Mathematics of Computa- search Center for Informatics”. tion. Available from: http://web.stanford.edu/ class/cme324/classics/cooley-tukey.pdf REFERENCES [15] SCHUPP, S.: 2003. Lifting a butterfly – A component- based FFT [online]. [cit. 2022-02-21]. Department of ´ ´ [1] LEVICKY, D.: 2018. APLIKOVANA KRYP- Computer Science, Rensselaer Polytechnic Institute. TOGRAFIA od utajenia sprav ´ ku kybernetickej Available from: https://www.researchgate. bezpecnosti. ˇ (In Slovak language). Kosice ˇ : elfa, s.r.o. net/publication/220060688_Lifting_a_ ISBN: 978-80-8086-265-7. butterfly_-_A_component-based_FFT. [2] PAAR, CH. – PELZL, J.: 2010. Understanding Cryp- [16] CHU, E. – GEORGE, A.: 2000. Inside the FFT Black tography : a Textbook for Students and Practitioners, Box Serial and Parallel Fast Fourier Transform Algo- Springer. ISBN 978-3-642-04100-6. rithms. CRC Press, Boca Raton, FL, USA. ISBN : 0- 8493-0270-6. [3] BERNSTEIN, D. J. – BUCHMANN, J. – DAHMEN, [17] LYUBASHEVSKY, V. – PEIKERT, CH. – REGEV, E.: 2009. Post-Quantum Cryptography, Springer. O.: 2013. On Ideal Lattices and Learning with Er- ISBN: 978-3-540-88701-0. rors Over Rings [online]. [cit. 2022-01-05]. Available [4] DRUTAROVSKY, M. 2017. Kryptografia pre vstavan e ´ from:https://eprint.iacr.org/2012/230.pdf. procesorove systemy. (In Slovak language). Technical ´ ´ [18] CRANDALL, R. – FAGIN, B.: 1994. Discrete university of Kosice. ISBN: 978-80-553-2805-8. weighted transforms and large-integer arithmetic. [5] NewHope. 2020. NewHope, post-quantum key en- Mathematics of Computation 62(205), 305–324. capsulation. [online]. Available from: https:// [19] Poppelmann, ¨ T. – ODER, T. – Gune ¨ ysu, T.: 2015. newhopecrypto.org/. High-Performance Ideal Lattice-Based Cryptography [6] CRYSTALS. 2020. CRYSTALS. Cryptographic Suite on 8-bit ATxmega Microcontrollers[online]. [cit. 2022- for Algebraic Lattices. [online]. Available from: 03-18]. Available from:https://eprint.iacr.org/ https://pq-crystals.org/kyber/index.shtml. 2015/382.pdf. [7] FALCON. 2017. Falcon, Fast-Fourier Lattice-based [20] SCOTT, M.: 2017. A note on the implementation of Compact Signatures over NTRU. [online]. Available the Number Theoretic Transform [online]. [cit.2022- from: https://falcon-sign.info/. 02-04]. Available from:https://eprint.iacr.org/ 2017/727.pdf. [8] CRYSTALS. 2021. CRYSTALS. Cryptographic Suite for Algebraic Lattices. Dilithium Home. [online]. [21] MENEZES, A. J. – van OORSCHOT, P. C. – VAN- Available from: https://pq-crystals.org/ STONE, S. A.: 1996. Handbook of Applied Cryptogra- dilithium/index.shtml. phy, Florida: CRC Press, ISBN : 0-8493-8523-7. ISSN 1335-8243 (print) ISSN 1338-3957 (online), www.aei.tuke.sk Acta Electrotechnica et Informatica, Vol. 22, No. 4, 2022 37 [22] Technotip. 2020. Modulus or Mod- BIOGRAPHIES ulo Division In C Programming Lan- Eva Kupcova ´ was born in Roznava, Slovakia in 1998. She guage [online].[cit.2022-02-19]. Available is currently a student on a master’s degree at the Depart- from:https://www.geeksforgeeks.org/ ment of Electronics and Multimedia Telecommunications modulo-operator-in-c-cpp-with-examples/. at the Faculty of Electrical Engineering and Informatics at the Technical University in Kosice. She finished her [23] KEIL, A.: STMicroelectronics STM32F103C8 [on- bachelor’s degree in 2021 with the final thesis named NTT line]. Available from: https://www.keil.com/dd2/ transform in post-quantum cryptography. She is expected stmicroelectronics/bluenrg_2/. to complete her engineering studies in 2023. [24] PAOLONI, G.: 2010. How to Benchmark Code Exe- Milos ˇ Drutarovsky ´ was born in Slovakia in 1965. He grad- cution Times on Intel®IA-32 and IA-64 Instruction Set uated from the Faculty of Electrical Engineering, Technical Architectures [online]. [cit.2022-05-1]. Available from: University of Kosice in 1988. He obtained his Ph.D. degree https://www.intel.com/content/dam/www/public. in Radioelectronics from the same faculty in 1995. Cur- ´ rently he is a full professor of Telecommunications at the [25] KUPCOVA, E.: 2021. NTT transformacia ´ v post- Faculty of Electrical Engineering and Informatics, Tech- kvantovej kryptografii. (In Slovak language). Bachelor nical University of Kosice, Kosice, Slovak Republic. His thesis. Technical University of Kosice. ˇ research focuses on applied cryptography, algorithms and architectures for embedded cryptographic devices, digital signal processing, field programmable devices and soft mi- Received September 12, 2022, accepted December 8, 2022 crocontrollers embedded into FPGA circuits. ISSN 1335-8243 (print) ISSN 1338-3957 (online), www.aei.tuke.sk

Journal

Acta Electrotechnica et Informaticade Gruyter

Published: Dec 1, 2022

Keywords: NTT transformation; post-quantum cryptography; negacyclic convolution; microcontroller; Montgomery reduction

There are no references for this article.