From: olson@umbc.edu (Bryan G. Olson) Newsgroups: sci.crypt Subject: A Quick IDEA, was: Speed of DES/IDEA implementations Date: 7 Dec 1993 21:49:41 -0500 A while ago I posted a message claiming a speed of 238,000 bytes/sec for an implementation of IDEA on a 33Mh 486. Below is an explanation and some code to show how it works. The basic trick should be useful on many (but not all) processors. I expect only those familiar with IDEA and its reference implementation will be able to follow the discussion. See: Lai, Xueja and Massey, James L. A Proposal for a New Block Encryption Standard, Eurocrypt 90 For those who have been asking for the code, sorry I kept putting it off. I wanted to get it out of Turbo Pascal ideal-mode, but I never had the time. Colin Plum wrote IDEA-386 code which is included in PGP 2.3a and uses the same tricks. I don't know who's is faster, but I expect they will be very close. Now here's how it's done. A major bottleneck in software IDEA is the mul() routine, which is used 34 times per 64 bit block. The routine performs multiplication in the multiplicative group mod 2^16+1. The two factors are each in a 16 bit word, and the output is also in a 16 bit word. Note that 0 is not a member of the multiplicative group and 2^16 does not fit in 16 bits. We therefor use the 0 word to represent 2^16. Now group elements map one to one onto all possible 16 bit words, since 2^16+1 is prime. Here is (essentially) the reference implementation from [Lai]. unsigned mul( unsigned a, unsigned b ) { long int p ; long unsigned q ; if( a==0 ) p= 0x00010001 - b ; else if( b==0 ) p= 0x00010001 - a ; else { q= a*b; p= (q & 0xffff) - (q>>16) if( p<0 ) p= p + 0x00010001 ; } return (unsigned)(p & 0xffff) ; } Note the method of reducing a 32 bit word modulo 2^16-1. We subtract the high word from the low word, and add the modulus back if the result is less than 0. [Lai] contains a proof that this works, and you can convince yourself fairly easily. To speed up this routine, we note that the tests for a=0 and b=0 will rarely be false. With the possible exception of the first 2 of the 34 multiplications, 0 should be no more likely than any of the other 65535 numbers. Note that if (and only if) either a or b is 0 then q will also be 0, and we can check for this in one instruction if our processor sets a zero flag for multiplication (as the 68000 does but 80x86 does not). Fortunately p will also be zero after the subtraction if and only if either a or b is 0. Proof: r will be zero when the high order word of q equals the low order word, and that happens when q is divisible by 00010001 hex. Since 00010001h = 2^16+1 is prime, this happens if either a or b is a multiple of 2^16+1, and 0 is the only such multiple which will fit in a 16 bit word. The speed-up strategy is to proceed under the assumption that a and b are not 0, check to be sure in one instruction, and recompute if the assumption was wrong. Here's some 8086 assembler code: mov ax, [a] mul [b] ; ax is implied. q is now in DX AX sub ax, dx ; mod 2^16+1 jnz not0 ; Jump if neither op was 0. Usually taken. mov ax, 1 ; recompute result knowing one op is 0. sub ax, [a] sub ax, [b] jmp out ; Just jump over adding the carry. not0: adc ax, 0 ; If r<0 add 1, otherwise do nothing. out: ; Result is now in ax Note that when r<0 we add 1 instead of 2^16+1 since the 2^16 part overflows out of the result. The "adc ax, 0" does all the work of checking for a negative result and adding the modulus if needed. The multiplication takes 9 instructions, 4 of which are rarely executed. I believe similar tricks are possible on many processors. The one drawback to the check-after-multiply tactic is that we can't let the multiply overwrite the only copy of an operand. Note that most software implementations of IDEA will run at slightly different speeds when 0's come up in the multiply routine. The reference implementation is faster on 0, this one is faster on non-zero. This may be a problem for some real-time stuff, and also suggests an attack based on timing. Finally, below is an implementation of the complete encryption function in 8086 assembler, to replace the cipher_idea() function in PGP. It takes the same parameters as the function from PGP, and uses the c language calling conventions. I tested it using the debug features of the idea.c file in PGP. You will need to add segment/assume directives. This version uses no global data and should be reentrant. The handling of zero multipliers is outside the inner loop so that a short conditional jump can loop back to the beginning. Forward conditional jumps are usually not taken and backward jumps are usually taken, which is consistent with 586 branch prediction (or so I've heard). Stalls where the output of one instruction is needed for the next seem unavoidable. Last I heard, IDEA was patent pending. My code is up for grabs, although I would get a kick out being credited if you use it. On the other hand Colin's code is already tested and ready to assemble and link with PGP. --Bryan ____________________CODE STARTS BELOW THIS LINE_________ ; Called as: asmcrypt( inbuff, outbuff, zkey ) just like PGP PROC _asmcrypt ; establish parameter and local space on stack ; follow c language calling conventions ARG inblock:Word, outblock:Word, zkey:Word LOCAL sx1:Word,sx4:Word,skk:Word,done8:Word =stacksize push bp mov bp, sp sub sp, stacksize ; push ax ; My compiler assumes these are not saved. ; push bx ; push cx ; push dx push si push di ; Put the 16 bit sub-blocks in registers and/or local variables mov si, [inblock] mov ax, [si] mov [sx1], ax ; x1 is in ax and sx1 mov di, [si+2] ; x2 is in di mov bx, [si+4] ; x3 is in bx mov dx, [si+6] mov [sx4], dx ; x4 is in sx4 mov si, [zkey] ; si points to next subkey mov [done8], si add [done8], 96 ; we will be finished with 8 rounds ; when si=done8 @@loop: ; 8 rounds of this add di, [si+2] ; x2+=zkey[2] is in di add bx, [si+4] ; x3+=zkey[4] is in bx mul [Word si] ;x1 *= zkey[0] sub ax, dx jz @@x1 ; if 0, use special case multiply adc ax, 0 @@x1out: mov [sx1], ax ; x1 is in ax and sx1 xor ax, bx ; ax= x1^x3 mul [Word si+8] ; compute kk sub ax, dx ; if 0, use special case multiply jz @@kk adc ax, 0 @@kkout: mov cx, ax ; kk is in cx mov ax, [sx4] ; x4 *= zkey[6] mul [Word si+6] sub ax, dx jz @@x4 ; if 0, use special case multiply adc ax, 0 @@x4out: mov [sx4], ax ; x4 is in sx4 and ax xor ax, di ; x4^x2 add ax, cx ; kk+(x2^x4) mul [Word si+10] ; compute t1 sub ax, dx jz @@t1 ; if 0, use special case multiply adc ax, 0 @@t1out: ; t1 is in ax add cx, ax ; t2 is in cx kk+t1 xor [sx4], cx ; x4 in sx4 xor di, cx ; new x3 in di xor bx, ax ; new x2 in bx xchg bx, di ; x2 in di, x3 in bx xor ax, [sx1] ; x1 in ax mov [sx1], ax ; and [sx1] add si, 12 ; point to next subkey cmp si, [done8] jne @@loop jmp @@out8 ;------------------------------------------ ; Special case multiplications, when one factor is 0 @@x1: mov ax, 1 sub ax, [sx1] sub ax, [Word si] jmp @@x1out @@kk: mov ax, [sx1] ; rebuild overwritten operand xor ax, bx neg ax inc ax sub ax, [si+8] jmp @@kkout @@x4: mov ax, 1 sub ax, [sx4] sub ax, [Word si+6] jmp @@x4out @@t1: mov ax, [sx4] ; rebuild xor ax, di add ax, cx neg ax inc ax sub ax, [si+10] jmp @@t1out ;--------------------------------------------------- ; 8 rounds are done, now that extra pseudo-round @@out8: push di mov di, [outblock] mul [Word si] sub ax, dx jnz @@o1n ; jump over special case code mov ax, 1 sub ax, [sx1] sub ax, [si] jmp @@o1out @@o1n: adc ax, 0 @@o1out: mov [di], ax ; final ciphertext block 1 mov ax, [sx4] mul [Word si+6] sub ax, dx jnz @@o4n ; jump over special case code mov ax, 1 sub ax, [sx4] sub ax, [si+6] jmp @@o4out @@o4n: adc ax, 0 @@o4out: mov [di+6], ax ; final ciphertext block 4 add bx, [si+2] mov [di+2], bx ; final ciphertext block 2 pop ax add ax, [si+4] mov [di+4], ax ; final ciphertext block 3 ; Restore the stack and return pop di pop si ; pop dx ; pop cx ; pop bx ; pop ax mov sp, bp pop bp ret ENDP _asmcrypt