1392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom#!/usr/bin/env perl 2392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# 3392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# ==================================================================== 4392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# Written by Andy Polyakov <appro@openssl.org> for the OpenSSL 5392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# project. The module is, however, dual licensed under OpenSSL and 6392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# CRYPTOGAMS licenses depending on where you obtain it. For further 7392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# details see http://www.openssl.org/~appro/cryptogams/. 8392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# ==================================================================== 9392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# 10392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# March, May, June 2010 11392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# 12392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# The module implements "4-bit" GCM GHASH function and underlying 13392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# single multiplication operation in GF(2^128). "4-bit" means that it 14392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# uses 256 bytes per-key table [+64/128 bytes fixed table]. It has two 15392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# code paths: vanilla x86 and vanilla MMX. Former will be executed on 16392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# 486 and Pentium, latter on all others. MMX GHASH features so called 17392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# "528B" variant of "4-bit" method utilizing additional 256+16 bytes 18392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# of per-key storage [+512 bytes shared table]. Performance results 19392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# are for streamed GHASH subroutine and are expressed in cycles per 20392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# processed byte, less is better: 21392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# 22392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# gcc 2.95.3(*) MMX assembler x86 assembler 23392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# 24392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# Pentium 105/111(**) - 50 25392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# PIII 68 /75 12.2 24 26392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# P4 125/125 17.8 84(***) 27392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# Opteron 66 /70 10.1 30 28392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# Core2 54 /67 8.4 18 29392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# 30392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# (*) gcc 3.4.x was observed to generate few percent slower code, 31392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# which is one of reasons why 2.95.3 results were chosen, 32392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# another reason is lack of 3.4.x results for older CPUs; 33392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# comparison with MMX results is not completely fair, because C 34392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# results are for vanilla "256B" implementation, while 35392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# assembler results are for "528B";-) 36392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# (**) second number is result for code compiled with -fPIC flag, 37392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# which is actually more relevant, because assembler code is 38392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# position-independent; 39392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# (***) see comment in non-MMX routine for further details; 40392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# 41392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# To summarize, it's >2-5 times faster than gcc-generated code. To 42392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# anchor it to something else SHA1 assembler processes one byte in 43392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# 11-13 cycles on contemporary x86 cores. As for choice of MMX in 44392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# particular, see comment at the end of the file... 45392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 46392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# May 2010 47392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# 48392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# Add PCLMULQDQ version performing at 2.10 cycles per processed byte. 49392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# The question is how close is it to theoretical limit? The pclmulqdq 50392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# instruction latency appears to be 14 cycles and there can't be more 51392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# than 2 of them executing at any given time. This means that single 52392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# Karatsuba multiplication would take 28 cycles *plus* few cycles for 53392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# pre- and post-processing. Then multiplication has to be followed by 54392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# modulo-reduction. Given that aggregated reduction method [see 55392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# "Carry-less Multiplication and Its Usage for Computing the GCM Mode" 56392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# white paper by Intel] allows you to perform reduction only once in 57392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# a while we can assume that asymptotic performance can be estimated 58392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# as (28+Tmod/Naggr)/16, where Tmod is time to perform reduction 59392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# and Naggr is the aggregation factor. 60392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# 61392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# Before we proceed to this implementation let's have closer look at 62392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# the best-performing code suggested by Intel in their white paper. 63392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# By tracing inter-register dependencies Tmod is estimated as ~19 64392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# cycles and Naggr chosen by Intel is 4, resulting in 2.05 cycles per 65392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# processed byte. As implied, this is quite optimistic estimate, 66392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# because it does not account for Karatsuba pre- and post-processing, 67392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# which for a single multiplication is ~5 cycles. Unfortunately Intel 68392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# does not provide performance data for GHASH alone. But benchmarking 69392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# AES_GCM_encrypt ripped out of Fig. 15 of the white paper with aadt 70392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# alone resulted in 2.46 cycles per byte of out 16KB buffer. Note that 71392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# the result accounts even for pre-computing of degrees of the hash 72392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# key H, but its portion is negligible at 16KB buffer size. 73392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# 74392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# Moving on to the implementation in question. Tmod is estimated as 75392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# ~13 cycles and Naggr is 2, giving asymptotic performance of ... 76392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# 2.16. How is it possible that measured performance is better than 77392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# optimistic theoretical estimate? There is one thing Intel failed 78392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# to recognize. By serializing GHASH with CTR in same subroutine 79392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# former's performance is really limited to above (Tmul + Tmod/Naggr) 80392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# equation. But if GHASH procedure is detached, the modulo-reduction 81392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# can be interleaved with Naggr-1 multiplications at instruction level 82392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# and under ideal conditions even disappear from the equation. So that 83392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# optimistic theoretical estimate for this implementation is ... 84392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# 28/16=1.75, and not 2.16. Well, it's probably way too optimistic, 85392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# at least for such small Naggr. I'd argue that (28+Tproc/Naggr), 86392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# where Tproc is time required for Karatsuba pre- and post-processing, 87392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# is more realistic estimate. In this case it gives ... 1.91 cycles. 88392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# Or in other words, depending on how well we can interleave reduction 89392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# and one of the two multiplications the performance should be betwen 90392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# 1.91 and 2.16. As already mentioned, this implementation processes 91392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# one byte out of 8KB buffer in 2.10 cycles, while x86_64 counterpart 92392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# - in 2.02. x86_64 performance is better, because larger register 93392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# bank allows to interleave reduction and multiplication better. 94392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# 95392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# Does it make sense to increase Naggr? To start with it's virtually 96392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# impossible in 32-bit mode, because of limited register bank 97392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# capacity. Otherwise improvement has to be weighed agiainst slower 98392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# setup, as well as code size and complexity increase. As even 99392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# optimistic estimate doesn't promise 30% performance improvement, 100392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# there are currently no plans to increase Naggr. 101392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# 102392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# Special thanks to David Woodhouse <dwmw2@infradead.org> for 103392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# providing access to a Westmere-based system on behalf of Intel 104392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# Open Source Technology Centre. 105392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 106392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# January 2010 107392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# 108392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# Tweaked to optimize transitions between integer and FP operations 109392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# on same XMM register, PCLMULQDQ subroutine was measured to process 110392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# one byte in 2.07 cycles on Sandy Bridge, and in 2.12 - on Westmere. 111392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# The minor regression on Westmere is outweighed by ~15% improvement 112392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# on Sandy Bridge. Strangely enough attempt to modify 64-bit code in 113392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# similar manner resulted in almost 20% degradation on Sandy Bridge, 114392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# where original 64-bit code processes one byte in 1.95 cycles. 115392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 116392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1; 117392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrompush(@INC,"${dir}","${dir}../../perlasm"); 118392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstromrequire "x86asm.pl"; 119392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 120392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&asm_init($ARGV[0],"ghash-x86.pl",$x86only = $ARGV[$#ARGV] eq "386"); 121392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 122392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom$sse2=0; 123392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstromfor (@ARGV) { $sse2=1 if (/-DOPENSSL_IA32_SSE2/); } 124392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 125392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom($Zhh,$Zhl,$Zlh,$Zll) = ("ebp","edx","ecx","ebx"); 126392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom$inp = "edi"; 127392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom$Htbl = "esi"; 128392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 129392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom$unroll = 0; # Affects x86 loop. Folded loop performs ~7% worse 130392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # than unrolled, which has to be weighted against 131392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # 2.5x x86-specific code size reduction. 132392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 133392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstromsub x86_loop { 134392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my $off = shift; 135392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my $rem = "eax"; 136392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 137392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Zhh,&DWP(4,$Htbl,$Zll)); 138392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Zhl,&DWP(0,$Htbl,$Zll)); 139392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Zlh,&DWP(12,$Htbl,$Zll)); 140392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Zll,&DWP(8,$Htbl,$Zll)); 141392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($rem,$rem); # avoid partial register stalls on PIII 142392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 143392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # shrd practically kills P4, 2.5x deterioration, but P4 has 144392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # MMX code-path to execute. shrd runs tad faster [than twice 145392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # the shifts, move's and or's] on pre-MMX Pentium (as well as 146392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # PIII and Core2), *but* minimizes code size, spares register 147392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # and thus allows to fold the loop... 148392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom if (!$unroll) { 149392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my $cnt = $inp; 150392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($cnt,15); 151392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &jmp (&label("x86_loop")); 152392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &set_label("x86_loop",16); 153392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom for($i=1;$i<=2;$i++) { 154392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&LB($rem),&LB($Zll)); 155392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &shrd ($Zll,$Zlh,4); 156392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &and (&LB($rem),0xf); 157392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &shrd ($Zlh,$Zhl,4); 158392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &shrd ($Zhl,$Zhh,4); 159392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &shr ($Zhh,4); 160392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($Zhh,&DWP($off+16,"esp",$rem,4)); 161392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 162392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&LB($rem),&BP($off,"esp",$cnt)); 163392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom if ($i&1) { 164392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &and (&LB($rem),0xf0); 165392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom } else { 166392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &shl (&LB($rem),4); 167392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom } 168392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 169392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($Zll,&DWP(8,$Htbl,$rem)); 170392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($Zlh,&DWP(12,$Htbl,$rem)); 171392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($Zhl,&DWP(0,$Htbl,$rem)); 172392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($Zhh,&DWP(4,$Htbl,$rem)); 173392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 174392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom if ($i&1) { 175392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &dec ($cnt); 176392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &js (&label("x86_break")); 177392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom } else { 178392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &jmp (&label("x86_loop")); 179392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom } 180392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom } 181392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &set_label("x86_break",16); 182392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom } else { 183392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom for($i=1;$i<32;$i++) { 184392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &comment($i); 185392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&LB($rem),&LB($Zll)); 186392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &shrd ($Zll,$Zlh,4); 187392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &and (&LB($rem),0xf); 188392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &shrd ($Zlh,$Zhl,4); 189392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &shrd ($Zhl,$Zhh,4); 190392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &shr ($Zhh,4); 191392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($Zhh,&DWP($off+16,"esp",$rem,4)); 192392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 193392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom if ($i&1) { 194392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&LB($rem),&BP($off+15-($i>>1),"esp")); 195392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &and (&LB($rem),0xf0); 196392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom } else { 197392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&LB($rem),&BP($off+15-($i>>1),"esp")); 198392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &shl (&LB($rem),4); 199392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom } 200392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 201392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($Zll,&DWP(8,$Htbl,$rem)); 202392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($Zlh,&DWP(12,$Htbl,$rem)); 203392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($Zhl,&DWP(0,$Htbl,$rem)); 204392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($Zhh,&DWP(4,$Htbl,$rem)); 205392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom } 206392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom } 207392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &bswap ($Zll); 208392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &bswap ($Zlh); 209392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &bswap ($Zhl); 210392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom if (!$x86only) { 211392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &bswap ($Zhh); 212392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom } else { 213392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ("eax",$Zhh); 214392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &bswap ("eax"); 215392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Zhh,"eax"); 216392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom } 217392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom} 218392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 219392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstromif ($unroll) { 220392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &function_begin_B("_x86_gmult_4bit_inner"); 221392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &x86_loop(4); 222392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &ret (); 223392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &function_end_B("_x86_gmult_4bit_inner"); 224392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom} 225392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 226392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstromsub deposit_rem_4bit { 227392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my $bias = shift; 228392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 229392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP($bias+0, "esp"),0x0000<<16); 230392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP($bias+4, "esp"),0x1C20<<16); 231392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP($bias+8, "esp"),0x3840<<16); 232392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP($bias+12,"esp"),0x2460<<16); 233392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP($bias+16,"esp"),0x7080<<16); 234392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP($bias+20,"esp"),0x6CA0<<16); 235392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP($bias+24,"esp"),0x48C0<<16); 236392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP($bias+28,"esp"),0x54E0<<16); 237392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP($bias+32,"esp"),0xE100<<16); 238392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP($bias+36,"esp"),0xFD20<<16); 239392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP($bias+40,"esp"),0xD940<<16); 240392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP($bias+44,"esp"),0xC560<<16); 241392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP($bias+48,"esp"),0x9180<<16); 242392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP($bias+52,"esp"),0x8DA0<<16); 243392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP($bias+56,"esp"),0xA9C0<<16); 244392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP($bias+60,"esp"),0xB5E0<<16); 245392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom} 246392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 247392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom$suffix = $x86only ? "" : "_x86"; 248392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 249392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_begin("gcm_gmult_4bit".$suffix); 250392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &stack_push(16+4+1); # +1 for stack alignment 251392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($inp,&wparam(0)); # load Xi 252392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Htbl,&wparam(1)); # load Htable 253392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 254392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Zhh,&DWP(0,$inp)); # load Xi[16] 255392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Zhl,&DWP(4,$inp)); 256392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Zlh,&DWP(8,$inp)); 257392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Zll,&DWP(12,$inp)); 258392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 259392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &deposit_rem_4bit(16); 260392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 261392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(0,"esp"),$Zhh); # copy Xi[16] on stack 262392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(4,"esp"),$Zhl); 263392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(8,"esp"),$Zlh); 264392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(12,"esp"),$Zll); 265392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &shr ($Zll,20); 266392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &and ($Zll,0xf0); 267392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 268392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom if ($unroll) { 269392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &call ("_x86_gmult_4bit_inner"); 270392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom } else { 271392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &x86_loop(0); 272392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($inp,&wparam(0)); 273392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom } 274392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 275392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(12,$inp),$Zll); 276392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(8,$inp),$Zlh); 277392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(4,$inp),$Zhl); 278392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(0,$inp),$Zhh); 279392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &stack_pop(16+4+1); 280392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_end("gcm_gmult_4bit".$suffix); 281392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 282392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_begin("gcm_ghash_4bit".$suffix); 283392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &stack_push(16+4+1); # +1 for 64-bit alignment 284392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Zll,&wparam(0)); # load Xi 285392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Htbl,&wparam(1)); # load Htable 286392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($inp,&wparam(2)); # load in 287392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ("ecx",&wparam(3)); # load len 288392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &add ("ecx",$inp); 289392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&wparam(3),"ecx"); 290392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 291392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Zhh,&DWP(0,$Zll)); # load Xi[16] 292392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Zhl,&DWP(4,$Zll)); 293392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Zlh,&DWP(8,$Zll)); 294392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Zll,&DWP(12,$Zll)); 295392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 296392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &deposit_rem_4bit(16); 297392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 298392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &set_label("x86_outer_loop",16); 299392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($Zll,&DWP(12,$inp)); # xor with input 300392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($Zlh,&DWP(8,$inp)); 301392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($Zhl,&DWP(4,$inp)); 302392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($Zhh,&DWP(0,$inp)); 303392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(12,"esp"),$Zll); # dump it on stack 304392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(8,"esp"),$Zlh); 305392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(4,"esp"),$Zhl); 306392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(0,"esp"),$Zhh); 307392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 308392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &shr ($Zll,20); 309392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &and ($Zll,0xf0); 310392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 311392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom if ($unroll) { 312392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &call ("_x86_gmult_4bit_inner"); 313392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom } else { 314392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &x86_loop(0); 315392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($inp,&wparam(2)); 316392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom } 317392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &lea ($inp,&DWP(16,$inp)); 318392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &cmp ($inp,&wparam(3)); 319392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&wparam(2),$inp) if (!$unroll); 320392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &jb (&label("x86_outer_loop")); 321392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 322392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($inp,&wparam(0)); # load Xi 323392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(12,$inp),$Zll); 324392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(8,$inp),$Zlh); 325392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(4,$inp),$Zhl); 326392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(0,$inp),$Zhh); 327392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &stack_pop(16+4+1); 328392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_end("gcm_ghash_4bit".$suffix); 329392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 330392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstromif (!$x86only) {{{ 331392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 332392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&static_label("rem_4bit"); 333392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 334392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstromif (!$sse2) {{ # pure-MMX "May" version... 335392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 336392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom$S=12; # shift factor for rem_4bit 337392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 338392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_begin_B("_mmx_gmult_4bit_inner"); 339392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# MMX version performs 3.5 times better on P4 (see comment in non-MMX 340392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# routine for further details), 100% better on Opteron, ~70% better 341392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# on Core2 and PIII... In other words effort is considered to be well 342392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# spent... Since initial release the loop was unrolled in order to 343392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# "liberate" register previously used as loop counter. Instead it's 344392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# used to optimize critical path in 'Z.hi ^= rem_4bit[Z.lo&0xf]'. 345392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# The path involves move of Z.lo from MMX to integer register, 346392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# effective address calculation and finally merge of value to Z.hi. 347392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# Reference to rem_4bit is scheduled so late that I had to >>4 348392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# rem_4bit elements. This resulted in 20-45% procent improvement 349392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# on contemporary �-archs. 350392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom{ 351392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my $cnt; 352392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my $rem_4bit = "eax"; 353392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my @rem = ($Zhh,$Zll); 354392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my $nhi = $Zhl; 355392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my $nlo = $Zlh; 356392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 357392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my ($Zlo,$Zhi) = ("mm0","mm1"); 358392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my $tmp = "mm2"; 359392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 360392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($nlo,$nlo); # avoid partial register stalls on PIII 361392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($nhi,$Zll); 362392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&LB($nlo),&LB($nhi)); 363392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &shl (&LB($nlo),4); 364392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &and ($nhi,0xf0); 365392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq ($Zlo,&QWP(8,$Htbl,$nlo)); 366392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq ($Zhi,&QWP(0,$Htbl,$nlo)); 367392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movd ($rem[0],$Zlo); 368392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 369392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom for ($cnt=28;$cnt>=-2;$cnt--) { 370392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my $odd = $cnt&1; 371392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my $nix = $odd ? $nlo : $nhi; 372392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 373392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &shl (&LB($nlo),4) if ($odd); 374392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Zlo,4); 375392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq ($tmp,$Zhi); 376392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Zhi,4); 377392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zlo,&QWP(8,$Htbl,$nix)); 378392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&LB($nlo),&BP($cnt/2,$inp)) if (!$odd && $cnt>=0); 379392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psllq ($tmp,60); 380392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &and ($nhi,0xf0) if ($odd); 381392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zhi,&QWP(0,$rem_4bit,$rem[1],8)) if ($cnt<28); 382392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &and ($rem[0],0xf); 383392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zhi,&QWP(0,$Htbl,$nix)); 384392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($nhi,$nlo) if (!$odd && $cnt>=0); 385392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movd ($rem[1],$Zlo); 386392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zlo,$tmp); 387392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 388392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom push (@rem,shift(@rem)); # "rotate" registers 389392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom } 390392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 391392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($inp,&DWP(4,$rem_4bit,$rem[1],8)); # last rem_4bit[rem] 392392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 393392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Zlo,32); # lower part of Zlo is already there 394392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movd ($Zhl,$Zhi); 395392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Zhi,32); 396392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movd ($Zlh,$Zlo); 397392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movd ($Zhh,$Zhi); 398392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &shl ($inp,4); # compensate for rem_4bit[i] being >>4 399392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 400392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &bswap ($Zll); 401392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &bswap ($Zhl); 402392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &bswap ($Zlh); 403392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($Zhh,$inp); 404392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &bswap ($Zhh); 405392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 406392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &ret (); 407392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom} 408392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_end_B("_mmx_gmult_4bit_inner"); 409392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 410392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_begin("gcm_gmult_4bit_mmx"); 411392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($inp,&wparam(0)); # load Xi 412392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Htbl,&wparam(1)); # load Htable 413392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 414392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &call (&label("pic_point")); 415392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &set_label("pic_point"); 416392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &blindpop("eax"); 417392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &lea ("eax",&DWP(&label("rem_4bit")."-".&label("pic_point"),"eax")); 418392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 419392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movz ($Zll,&BP(15,$inp)); 420392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 421392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &call ("_mmx_gmult_4bit_inner"); 422392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 423392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($inp,&wparam(0)); # load Xi 424392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &emms (); 425392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(12,$inp),$Zll); 426392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(4,$inp),$Zhl); 427392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(8,$inp),$Zlh); 428392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(0,$inp),$Zhh); 429392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_end("gcm_gmult_4bit_mmx"); 430392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 431392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# Streamed version performs 20% better on P4, 7% on Opteron, 432392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# 10% on Core2 and PIII... 433392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_begin("gcm_ghash_4bit_mmx"); 434392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Zhh,&wparam(0)); # load Xi 435392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Htbl,&wparam(1)); # load Htable 436392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($inp,&wparam(2)); # load in 437392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Zlh,&wparam(3)); # load len 438392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 439392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &call (&label("pic_point")); 440392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &set_label("pic_point"); 441392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &blindpop("eax"); 442392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &lea ("eax",&DWP(&label("rem_4bit")."-".&label("pic_point"),"eax")); 443392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 444392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &add ($Zlh,$inp); 445392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&wparam(3),$Zlh); # len to point at the end of input 446392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &stack_push(4+1); # +1 for stack alignment 447392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 448392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Zll,&DWP(12,$Zhh)); # load Xi[16] 449392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Zhl,&DWP(4,$Zhh)); 450392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Zlh,&DWP(8,$Zhh)); 451392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Zhh,&DWP(0,$Zhh)); 452392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &jmp (&label("mmx_outer_loop")); 453392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 454392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &set_label("mmx_outer_loop",16); 455392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($Zll,&DWP(12,$inp)); 456392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($Zhl,&DWP(4,$inp)); 457392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($Zlh,&DWP(8,$inp)); 458392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($Zhh,&DWP(0,$inp)); 459392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&wparam(2),$inp); 460392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(12,"esp"),$Zll); 461392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(4,"esp"),$Zhl); 462392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(8,"esp"),$Zlh); 463392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(0,"esp"),$Zhh); 464392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 465392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($inp,"esp"); 466392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &shr ($Zll,24); 467392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 468392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &call ("_mmx_gmult_4bit_inner"); 469392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 470392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($inp,&wparam(2)); 471392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &lea ($inp,&DWP(16,$inp)); 472392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &cmp ($inp,&wparam(3)); 473392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &jb (&label("mmx_outer_loop")); 474392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 475392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($inp,&wparam(0)); # load Xi 476392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &emms (); 477392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(12,$inp),$Zll); 478392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(4,$inp),$Zhl); 479392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(8,$inp),$Zlh); 480392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(0,$inp),$Zhh); 481392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 482392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &stack_pop(4+1); 483392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_end("gcm_ghash_4bit_mmx"); 484392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 485392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom}} else {{ # "June" MMX version... 486392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # ... has slower "April" gcm_gmult_4bit_mmx with folded 487392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # loop. This is done to conserve code size... 488392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom$S=16; # shift factor for rem_4bit 489392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 490392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstromsub mmx_loop() { 491392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# MMX version performs 2.8 times better on P4 (see comment in non-MMX 492392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# routine for further details), 40% better on Opteron and Core2, 50% 493392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# better on PIII... In other words effort is considered to be well 494392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# spent... 495392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my $inp = shift; 496392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my $rem_4bit = shift; 497392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my $cnt = $Zhh; 498392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my $nhi = $Zhl; 499392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my $nlo = $Zlh; 500392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my $rem = $Zll; 501392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 502392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my ($Zlo,$Zhi) = ("mm0","mm1"); 503392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my $tmp = "mm2"; 504392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 505392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($nlo,$nlo); # avoid partial register stalls on PIII 506392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($nhi,$Zll); 507392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&LB($nlo),&LB($nhi)); 508392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($cnt,14); 509392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &shl (&LB($nlo),4); 510392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &and ($nhi,0xf0); 511392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq ($Zlo,&QWP(8,$Htbl,$nlo)); 512392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq ($Zhi,&QWP(0,$Htbl,$nlo)); 513392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movd ($rem,$Zlo); 514392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &jmp (&label("mmx_loop")); 515392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 516392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &set_label("mmx_loop",16); 517392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Zlo,4); 518392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &and ($rem,0xf); 519392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq ($tmp,$Zhi); 520392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Zhi,4); 521392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zlo,&QWP(8,$Htbl,$nhi)); 522392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&LB($nlo),&BP(0,$inp,$cnt)); 523392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psllq ($tmp,60); 524392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8)); 525392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &dec ($cnt); 526392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movd ($rem,$Zlo); 527392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zhi,&QWP(0,$Htbl,$nhi)); 528392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($nhi,$nlo); 529392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zlo,$tmp); 530392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &js (&label("mmx_break")); 531392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 532392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &shl (&LB($nlo),4); 533392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &and ($rem,0xf); 534392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Zlo,4); 535392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &and ($nhi,0xf0); 536392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq ($tmp,$Zhi); 537392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Zhi,4); 538392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zlo,&QWP(8,$Htbl,$nlo)); 539392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psllq ($tmp,60); 540392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8)); 541392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movd ($rem,$Zlo); 542392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zhi,&QWP(0,$Htbl,$nlo)); 543392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zlo,$tmp); 544392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &jmp (&label("mmx_loop")); 545392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 546392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &set_label("mmx_break",16); 547392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &shl (&LB($nlo),4); 548392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &and ($rem,0xf); 549392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Zlo,4); 550392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &and ($nhi,0xf0); 551392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq ($tmp,$Zhi); 552392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Zhi,4); 553392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zlo,&QWP(8,$Htbl,$nlo)); 554392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psllq ($tmp,60); 555392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8)); 556392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movd ($rem,$Zlo); 557392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zhi,&QWP(0,$Htbl,$nlo)); 558392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zlo,$tmp); 559392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 560392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Zlo,4); 561392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &and ($rem,0xf); 562392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq ($tmp,$Zhi); 563392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Zhi,4); 564392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zlo,&QWP(8,$Htbl,$nhi)); 565392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psllq ($tmp,60); 566392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8)); 567392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movd ($rem,$Zlo); 568392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zhi,&QWP(0,$Htbl,$nhi)); 569392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zlo,$tmp); 570392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 571392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Zlo,32); # lower part of Zlo is already there 572392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movd ($Zhl,$Zhi); 573392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Zhi,32); 574392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movd ($Zlh,$Zlo); 575392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movd ($Zhh,$Zhi); 576392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 577392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &bswap ($Zll); 578392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &bswap ($Zhl); 579392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &bswap ($Zlh); 580392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &bswap ($Zhh); 581392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom} 582392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 583392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_begin("gcm_gmult_4bit_mmx"); 584392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($inp,&wparam(0)); # load Xi 585392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Htbl,&wparam(1)); # load Htable 586392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 587392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &call (&label("pic_point")); 588392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &set_label("pic_point"); 589392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &blindpop("eax"); 590392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &lea ("eax",&DWP(&label("rem_4bit")."-".&label("pic_point"),"eax")); 591392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 592392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movz ($Zll,&BP(15,$inp)); 593392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 594392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mmx_loop($inp,"eax"); 595392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 596392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &emms (); 597392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(12,$inp),$Zll); 598392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(4,$inp),$Zhl); 599392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(8,$inp),$Zlh); 600392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(0,$inp),$Zhh); 601392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_end("gcm_gmult_4bit_mmx"); 602392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 603392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom###################################################################### 604392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# Below subroutine is "528B" variant of "4-bit" GCM GHASH function 605392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# (see gcm128.c for details). It provides further 20-40% performance 606392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# improvement over above mentioned "May" version. 607392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 608392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&static_label("rem_8bit"); 609392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 610392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_begin("gcm_ghash_4bit_mmx"); 611392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom{ my ($Zlo,$Zhi) = ("mm7","mm6"); 612392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my $rem_8bit = "esi"; 613392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my $Htbl = "ebx"; 614392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 615392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # parameter block 616392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ("eax",&wparam(0)); # Xi 617392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ("ebx",&wparam(1)); # Htable 618392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ("ecx",&wparam(2)); # inp 619392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ("edx",&wparam(3)); # len 620392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ("ebp","esp"); # original %esp 621392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &call (&label("pic_point")); 622392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &set_label ("pic_point"); 623392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &blindpop ($rem_8bit); 624392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &lea ($rem_8bit,&DWP(&label("rem_8bit")."-".&label("pic_point"),$rem_8bit)); 625392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 626392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &sub ("esp",512+16+16); # allocate stack frame... 627392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &and ("esp",-64); # ...and align it 628392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &sub ("esp",16); # place for (u8)(H[]<<4) 629392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 630392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &add ("edx","ecx"); # pointer to the end of input 631392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(528+16+0,"esp"),"eax"); # save Xi 632392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(528+16+8,"esp"),"edx"); # save inp+len 633392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(528+16+12,"esp"),"ebp"); # save original %esp 634392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 635392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom { my @lo = ("mm0","mm1","mm2"); 636392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my @hi = ("mm3","mm4","mm5"); 637392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my @tmp = ("mm6","mm7"); 638392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my $off1=0,$off2=0,$i; 639392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 640392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &add ($Htbl,128); # optimize for size 641392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &lea ("edi",&DWP(16+128,"esp")); 642392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &lea ("ebp",&DWP(16+256+128,"esp")); 643392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 644392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # decompose Htable (low and high parts are kept separately), 645392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # generate Htable[]>>4, (u8)(Htable[]<<4), save to stack... 646392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom for ($i=0;$i<18;$i++) { 647392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 648392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ("edx",&DWP(16*$i+8-128,$Htbl)) if ($i<16); 649392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq ($lo[0],&QWP(16*$i+8-128,$Htbl)) if ($i<16); 650392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psllq ($tmp[1],60) if ($i>1); 651392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq ($hi[0],&QWP(16*$i+0-128,$Htbl)) if ($i<16); 652392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &por ($lo[2],$tmp[1]) if ($i>1); 653392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq (&QWP($off1-128,"edi"),$lo[1]) if ($i>0 && $i<17); 654392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($lo[1],4) if ($i>0 && $i<17); 655392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq (&QWP($off1,"edi"),$hi[1]) if ($i>0 && $i<17); 656392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq ($tmp[0],$hi[1]) if ($i>0 && $i<17); 657392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq (&QWP($off2-128,"ebp"),$lo[2]) if ($i>1); 658392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($hi[1],4) if ($i>0 && $i<17); 659392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq (&QWP($off2,"ebp"),$hi[2]) if ($i>1); 660392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &shl ("edx",4) if ($i<16); 661392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&BP($i,"esp"),&LB("edx")) if ($i<16); 662392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 663392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom unshift (@lo,pop(@lo)); # "rotate" registers 664392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom unshift (@hi,pop(@hi)); 665392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom unshift (@tmp,pop(@tmp)); 666392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom $off1 += 8 if ($i>0); 667392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom $off2 += 8 if ($i>1); 668392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom } 669392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom } 670392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 671392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq ($Zhi,&QWP(0,"eax")); 672392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ("ebx",&DWP(8,"eax")); 673392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ("edx",&DWP(12,"eax")); # load Xi 674392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 675392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&set_label("outer",16); 676392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom { my $nlo = "eax"; 677392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my $dat = "edx"; 678392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my @nhi = ("edi","ebp"); 679392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my @rem = ("ebx","ecx"); 680392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my @red = ("mm0","mm1","mm2"); 681392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom my $tmp = "mm3"; 682392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 683392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($dat,&DWP(12,"ecx")); # merge input data 684392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ("ebx",&DWP(8,"ecx")); 685392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zhi,&QWP(0,"ecx")); 686392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &lea ("ecx",&DWP(16,"ecx")); # inp+=16 687392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom #&mov (&DWP(528+12,"esp"),$dat); # save inp^Xi 688392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(528+8,"esp"),"ebx"); 689392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq (&QWP(528+0,"esp"),$Zhi); 690392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(528+16+4,"esp"),"ecx"); # save inp 691392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 692392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor ($nlo,$nlo); 693392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &rol ($dat,8); 694392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&LB($nlo),&LB($dat)); 695392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($nhi[1],$nlo); 696392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &and (&LB($nlo),0x0f); 697392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &shr ($nhi[1],4); 698392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($red[0],$red[0]); 699392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &rol ($dat,8); # next byte 700392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($red[1],$red[1]); 701392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($red[2],$red[2]); 702392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 703392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # Just like in "May" verson modulo-schedule for critical path in 704392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # 'Z.hi ^= rem_8bit[Z.lo&0xff^((u8)H[nhi]<<4)]<<48'. Final 'pxor' 705392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # is scheduled so late that rem_8bit[] has to be shifted *right* 706392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # by 16, which is why last argument to pinsrw is 2, which 707392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # corresponds to <<32=<<48>>16... 708392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom for ($j=11,$i=0;$i<15;$i++) { 709392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 710392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom if ($i>0) { 711392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zlo,&QWP(16,"esp",$nlo,8)); # Z^=H[nlo] 712392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &rol ($dat,8); # next byte 713392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zhi,&QWP(16+128,"esp",$nlo,8)); 714392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 715392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zlo,$tmp); 716392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zhi,&QWP(16+256+128,"esp",$nhi[0],8)); 717392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor (&LB($rem[1]),&BP(0,"esp",$nhi[0])); # rem^(H[nhi]<<4) 718392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom } else { 719392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq ($Zlo,&QWP(16,"esp",$nlo,8)); 720392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq ($Zhi,&QWP(16+128,"esp",$nlo,8)); 721392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom } 722392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 723392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&LB($nlo),&LB($dat)); 724392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($dat,&DWP(528+$j,"esp")) if (--$j%4==0); 725392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 726392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movd ($rem[0],$Zlo); 727392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movz ($rem[1],&LB($rem[1])) if ($i>0); 728392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Zlo,8); # Z>>=8 729392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 730392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq ($tmp,$Zhi); 731392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($nhi[0],$nlo); 732392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Zhi,8); 733392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 734392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zlo,&QWP(16+256+0,"esp",$nhi[1],8)); # Z^=H[nhi]>>4 735392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &and (&LB($nlo),0x0f); 736392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psllq ($tmp,56); 737392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 738392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zhi,$red[1]) if ($i>1); 739392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &shr ($nhi[0],4); 740392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pinsrw ($red[0],&WP(0,$rem_8bit,$rem[1],2),2) if ($i>0); 741392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 742392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom unshift (@red,pop(@red)); # "rotate" registers 743392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom unshift (@rem,pop(@rem)); 744392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom unshift (@nhi,pop(@nhi)); 745392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom } 746392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 747392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zlo,&QWP(16,"esp",$nlo,8)); # Z^=H[nlo] 748392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zhi,&QWP(16+128,"esp",$nlo,8)); 749392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xor (&LB($rem[1]),&BP(0,"esp",$nhi[0])); # rem^(H[nhi]<<4) 750392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 751392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zlo,$tmp); 752392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zhi,&QWP(16+256+128,"esp",$nhi[0],8)); 753392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movz ($rem[1],&LB($rem[1])); 754392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 755392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($red[2],$red[2]); # clear 2nd word 756392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psllq ($red[1],4); 757392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 758392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movd ($rem[0],$Zlo); 759392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Zlo,4); # Z>>=4 760392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 761392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq ($tmp,$Zhi); 762392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Zhi,4); 763392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &shl ($rem[0],4); # rem<<4 764392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 765392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zlo,&QWP(16,"esp",$nhi[1],8)); # Z^=H[nhi] 766392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psllq ($tmp,60); 767392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movz ($rem[0],&LB($rem[0])); 768392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 769392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zlo,$tmp); 770392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zhi,&QWP(16+128,"esp",$nhi[1],8)); 771392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 772392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pinsrw ($red[0],&WP(0,$rem_8bit,$rem[1],2),2); 773392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zhi,$red[1]); 774392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 775392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movd ($dat,$Zlo); 776392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pinsrw ($red[2],&WP(0,$rem_8bit,$rem[0],2),3); # last is <<48 777392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 778392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psllq ($red[0],12); # correct by <<16>>4 779392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zhi,$red[0]); 780392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Zlo,32); 781392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Zhi,$red[2]); 782392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 783392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ("ecx",&DWP(528+16+4,"esp")); # restore inp 784392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movd ("ebx",$Zlo); 785392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq ($tmp,$Zhi); # 01234567 786392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psllw ($Zhi,8); # 1.3.5.7. 787392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlw ($tmp,8); # .0.2.4.6 788392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &por ($Zhi,$tmp); # 10325476 789392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &bswap ($dat); 790392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufw ($Zhi,$Zhi,0b00011011); # 76543210 791392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &bswap ("ebx"); 792392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 793392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &cmp ("ecx",&DWP(528+16+8,"esp")); # are we done? 794392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &jne (&label("outer")); 795392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom } 796392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 797392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ("eax",&DWP(528+16+0,"esp")); # restore Xi 798392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(12,"eax"),"edx"); 799392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov (&DWP(8,"eax"),"ebx"); 800392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movq (&QWP(0,"eax"),$Zhi); 801392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 802392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ("esp",&DWP(528+16+12,"esp")); # restore original %esp 803392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &emms (); 804392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom} 805392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_end("gcm_ghash_4bit_mmx"); 806392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom}} 807392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 808392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstromif ($sse2) {{ 809392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom###################################################################### 810392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# PCLMULQDQ version. 811392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 812392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom$Xip="eax"; 813392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom$Htbl="edx"; 814392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom$const="ecx"; 815392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom$inp="esi"; 816392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom$len="ebx"; 817392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 818392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom($Xi,$Xhi)=("xmm0","xmm1"); $Hkey="xmm2"; 819392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom($T1,$T2,$T3)=("xmm3","xmm4","xmm5"); 820392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom($Xn,$Xhn)=("xmm6","xmm7"); 821392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 822392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&static_label("bswap"); 823392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 824392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstromsub clmul64x64_T2 { # minimal "register" pressure 825392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrommy ($Xhi,$Xi,$Hkey)=@_; 826392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 827392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($Xhi,$Xi); # 828392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufd ($T1,$Xi,0b01001110); 829392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufd ($T2,$Hkey,0b01001110); 830392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($T1,$Xi); # 831392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($T2,$Hkey); 832392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 833392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pclmulqdq ($Xi,$Hkey,0x00); ####### 834392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pclmulqdq ($Xhi,$Hkey,0x11); ####### 835392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pclmulqdq ($T1,$T2,0x00); ####### 836392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xorps ($T1,$Xi); # 837392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xorps ($T1,$Xhi); # 838392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 839392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T2,$T1); # 840392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrldq ($T1,8); 841392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pslldq ($T2,8); # 842392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xhi,$T1); 843392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$T2); # 844392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom} 845392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 846392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstromsub clmul64x64_T3 { 847392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# Even though this subroutine offers visually better ILP, it 848392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# was empirically found to be a tad slower than above version. 849392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# At least in gcm_ghash_clmul context. But it's just as well, 850392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# because loop modulo-scheduling is possible only thanks to 851392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# minimized "register" pressure... 852392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrommy ($Xhi,$Xi,$Hkey)=@_; 853392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 854392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T1,$Xi); # 855392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($Xhi,$Xi); 856392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pclmulqdq ($Xi,$Hkey,0x00); ####### 857392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pclmulqdq ($Xhi,$Hkey,0x11); ####### 858392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufd ($T2,$T1,0b01001110); # 859392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufd ($T3,$Hkey,0b01001110); 860392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($T2,$T1); # 861392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($T3,$Hkey); 862392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pclmulqdq ($T2,$T3,0x00); ####### 863392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($T2,$Xi); # 864392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($T2,$Xhi); # 865392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 866392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T3,$T2); # 867392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrldq ($T2,8); 868392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pslldq ($T3,8); # 869392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xhi,$T2); 870392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$T3); # 871392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom} 872392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 873392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstromif (1) { # Algorithm 9 with <<1 twist. 874392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # Reduction is shorter and uses only two 875392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # temporary registers, which makes it better 876392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # candidate for interleaving with 64x64 877392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # multiplication. Pre-modulo-scheduled loop 878392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # was found to be ~20% faster than Algorithm 5 879392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # below. Algorithm 9 was therefore chosen for 880392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # further optimization... 881392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 882392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstromsub reduction_alg9 { # 17/13 times faster than Intel version 883392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrommy ($Xhi,$Xi) = @_; 884392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 885392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # 1st phase 886392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T1,$Xi) # 887392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psllq ($Xi,1); 888392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$T1); # 889392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psllq ($Xi,5); # 890392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$T1); # 891392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psllq ($Xi,57); # 892392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T2,$Xi); # 893392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pslldq ($Xi,8); 894392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrldq ($T2,8); # 895392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$T1); 896392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xhi,$T2); # 897392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 898392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # 2nd phase 899392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T2,$Xi); 900392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Xi,5); 901392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$T2); # 902392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Xi,1); # 903392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$T2); # 904392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($T2,$Xhi); 905392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Xi,1); # 906392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$T2); # 907392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom} 908392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 909392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_begin_B("gcm_init_clmul"); 910392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Htbl,&wparam(0)); 911392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Xip,&wparam(1)); 912392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 913392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &call (&label("pic")); 914392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&set_label("pic"); 915392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &blindpop ($const); 916392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); 917392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 918392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu ($Hkey,&QWP(0,$Xip)); 919392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufd ($Hkey,$Hkey,0b01001110);# dword swap 920392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 921392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # <<1 twist 922392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufd ($T2,$Hkey,0b11111111); # broadcast uppermost dword 923392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T1,$Hkey); 924392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psllq ($Hkey,1); 925392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($T3,$T3); # 926392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($T1,63); 927392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pcmpgtd ($T3,$T2); # broadcast carry bit 928392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pslldq ($T1,8); 929392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &por ($Hkey,$T1); # H<<=1 930392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 931392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # magic reduction 932392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pand ($T3,&QWP(16,$const)); # 0x1c2_polynomial 933392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Hkey,$T3); # if(carry) H^=0x1c2_polynomial 934392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 935392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # calculate H^2 936392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($Xi,$Hkey); 937392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &clmul64x64_T2 ($Xhi,$Xi,$Hkey); 938392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &reduction_alg9 ($Xhi,$Xi); 939392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 940392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu (&QWP(0,$Htbl),$Hkey); # save H 941392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu (&QWP(16,$Htbl),$Xi); # save H^2 942392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 943392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &ret (); 944392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_end_B("gcm_init_clmul"); 945392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 946392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_begin_B("gcm_gmult_clmul"); 947392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Xip,&wparam(0)); 948392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Htbl,&wparam(1)); 949392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 950392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &call (&label("pic")); 951392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&set_label("pic"); 952392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &blindpop ($const); 953392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); 954392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 955392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu ($Xi,&QWP(0,$Xip)); 956392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T3,&QWP(0,$const)); 957392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movups ($Hkey,&QWP(0,$Htbl)); 958392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufb ($Xi,$T3); 959392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 960392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &clmul64x64_T2 ($Xhi,$Xi,$Hkey); 961392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &reduction_alg9 ($Xhi,$Xi); 962392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 963392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufb ($Xi,$T3); 964392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu (&QWP(0,$Xip),$Xi); 965392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 966392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &ret (); 967392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_end_B("gcm_gmult_clmul"); 968392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 969392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_begin("gcm_ghash_clmul"); 970392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Xip,&wparam(0)); 971392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Htbl,&wparam(1)); 972392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($inp,&wparam(2)); 973392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($len,&wparam(3)); 974392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 975392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &call (&label("pic")); 976392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&set_label("pic"); 977392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &blindpop ($const); 978392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); 979392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 980392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu ($Xi,&QWP(0,$Xip)); 981392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T3,&QWP(0,$const)); 982392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu ($Hkey,&QWP(0,$Htbl)); 983392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufb ($Xi,$T3); 984392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 985392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &sub ($len,0x10); 986392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &jz (&label("odd_tail")); 987392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 988392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom ####### 989392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # Xi+2 =[H*(Ii+1 + Xi+1)] mod P = 990392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # [(H*Ii+1) + (H*Xi+1)] mod P = 991392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # [(H*Ii+1) + H^2*(Ii+Xi)] mod P 992392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # 993392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu ($T1,&QWP(0,$inp)); # Ii 994392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu ($Xn,&QWP(16,$inp)); # Ii+1 995392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufb ($T1,$T3); 996392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufb ($Xn,$T3); 997392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$T1); # Ii+Xi 998392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 999392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &clmul64x64_T2 ($Xhn,$Xn,$Hkey); # H*Ii+1 1000392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movups ($Hkey,&QWP(16,$Htbl)); # load H^2 1001392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1002392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &lea ($inp,&DWP(32,$inp)); # i+=2 1003392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &sub ($len,0x20); 1004392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &jbe (&label("even_tail")); 1005392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1006392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&set_label("mod_loop"); 1007392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &clmul64x64_T2 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi) 1008392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu ($T1,&QWP(0,$inp)); # Ii 1009392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movups ($Hkey,&QWP(0,$Htbl)); # load H 1010392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1011392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi) 1012392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xhi,$Xhn); 1013392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1014392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu ($Xn,&QWP(16,$inp)); # Ii+1 1015392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufb ($T1,$T3); 1016392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufb ($Xn,$T3); 1017392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1018392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T3,$Xn); #&clmul64x64_TX ($Xhn,$Xn,$Hkey); H*Ii+1 1019392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($Xhn,$Xn); 1020392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xhi,$T1); # "Ii+Xi", consume early 1021392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1022392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T1,$Xi) #&reduction_alg9($Xhi,$Xi); 1st phase 1023392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psllq ($Xi,1); 1024392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$T1); # 1025392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psllq ($Xi,5); # 1026392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$T1); # 1027392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pclmulqdq ($Xn,$Hkey,0x00); ####### 1028392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psllq ($Xi,57); # 1029392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T2,$Xi); # 1030392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pslldq ($Xi,8); 1031392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrldq ($T2,8); # 1032392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$T1); 1033392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufd ($T1,$T3,0b01001110); 1034392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xhi,$T2); # 1035392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($T1,$T3); 1036392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufd ($T3,$Hkey,0b01001110); 1037392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($T3,$Hkey); # 1038392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1039392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pclmulqdq ($Xhn,$Hkey,0x11); ####### 1040392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T2,$Xi); # 2nd phase 1041392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Xi,5); 1042392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$T2); # 1043392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Xi,1); # 1044392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$T2); # 1045392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($T2,$Xhi); 1046392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrlq ($Xi,1); # 1047392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$T2); # 1048392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1049392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pclmulqdq ($T1,$T3,0x00); ####### 1050392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movups ($Hkey,&QWP(16,$Htbl)); # load H^2 1051392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xorps ($T1,$Xn); # 1052392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &xorps ($T1,$Xhn); # 1053392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1054392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T3,$T1); # 1055392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrldq ($T1,8); 1056392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pslldq ($T3,8); # 1057392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xhn,$T1); 1058392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xn,$T3); # 1059392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T3,&QWP(0,$const)); 1060392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1061392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &lea ($inp,&DWP(32,$inp)); 1062392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &sub ($len,0x20); 1063392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &ja (&label("mod_loop")); 1064392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1065392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&set_label("even_tail"); 1066392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &clmul64x64_T2 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi) 1067392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1068392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi) 1069392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xhi,$Xhn); 1070392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1071392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &reduction_alg9 ($Xhi,$Xi); 1072392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1073392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &test ($len,$len); 1074392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &jnz (&label("done")); 1075392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1076392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movups ($Hkey,&QWP(0,$Htbl)); # load H 1077392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&set_label("odd_tail"); 1078392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu ($T1,&QWP(0,$inp)); # Ii 1079392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufb ($T1,$T3); 1080392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$T1); # Ii+Xi 1081392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1082392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &clmul64x64_T2 ($Xhi,$Xi,$Hkey); # H*(Ii+Xi) 1083392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &reduction_alg9 ($Xhi,$Xi); 1084392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1085392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&set_label("done"); 1086392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufb ($Xi,$T3); 1087392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu (&QWP(0,$Xip),$Xi); 1088392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_end("gcm_ghash_clmul"); 1089392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1090392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom} else { # Algorith 5. Kept for reference purposes. 1091392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1092392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstromsub reduction_alg5 { # 19/16 times faster than Intel version 1093392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrommy ($Xhi,$Xi)=@_; 1094392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1095392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # <<1 1096392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T1,$Xi); # 1097392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T2,$Xhi); 1098392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pslld ($Xi,1); 1099392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pslld ($Xhi,1); # 1100392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrld ($T1,31); 1101392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrld ($T2,31); # 1102392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T3,$T1); 1103392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pslldq ($T1,4); 1104392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrldq ($T3,12); # 1105392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pslldq ($T2,4); 1106392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &por ($Xhi,$T3); # 1107392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &por ($Xi,$T1); 1108392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &por ($Xhi,$T2); # 1109392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1110392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # 1st phase 1111392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T1,$Xi); 1112392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T2,$Xi); 1113392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T3,$Xi); # 1114392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pslld ($T1,31); 1115392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pslld ($T2,30); 1116392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pslld ($Xi,25); # 1117392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($T1,$T2); 1118392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($T1,$Xi); # 1119392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T2,$T1); # 1120392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pslldq ($T1,12); 1121392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrldq ($T2,4); # 1122392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($T3,$T1); 1123392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1124392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # 2nd phase 1125392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xhi,$T3); # 1126392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($Xi,$T3); 1127392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T1,$T3); 1128392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrld ($Xi,1); # 1129392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrld ($T1,2); 1130392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &psrld ($T3,7); # 1131392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$T1); 1132392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xhi,$T2); 1133392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$T3); # 1134392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$Xhi); # 1135392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom} 1136392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1137392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_begin_B("gcm_init_clmul"); 1138392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Htbl,&wparam(0)); 1139392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Xip,&wparam(1)); 1140392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1141392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &call (&label("pic")); 1142392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&set_label("pic"); 1143392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &blindpop ($const); 1144392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); 1145392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1146392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu ($Hkey,&QWP(0,$Xip)); 1147392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufd ($Hkey,$Hkey,0b01001110);# dword swap 1148392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1149392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # calculate H^2 1150392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($Xi,$Hkey); 1151392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &clmul64x64_T3 ($Xhi,$Xi,$Hkey); 1152392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &reduction_alg5 ($Xhi,$Xi); 1153392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1154392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu (&QWP(0,$Htbl),$Hkey); # save H 1155392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu (&QWP(16,$Htbl),$Xi); # save H^2 1156392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1157392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &ret (); 1158392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_end_B("gcm_init_clmul"); 1159392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1160392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_begin_B("gcm_gmult_clmul"); 1161392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Xip,&wparam(0)); 1162392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Htbl,&wparam(1)); 1163392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1164392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &call (&label("pic")); 1165392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&set_label("pic"); 1166392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &blindpop ($const); 1167392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); 1168392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1169392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu ($Xi,&QWP(0,$Xip)); 1170392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($Xn,&QWP(0,$const)); 1171392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu ($Hkey,&QWP(0,$Htbl)); 1172392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufb ($Xi,$Xn); 1173392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1174392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &clmul64x64_T3 ($Xhi,$Xi,$Hkey); 1175392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &reduction_alg5 ($Xhi,$Xi); 1176392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1177392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufb ($Xi,$Xn); 1178392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu (&QWP(0,$Xip),$Xi); 1179392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1180392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &ret (); 1181392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_end_B("gcm_gmult_clmul"); 1182392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1183392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_begin("gcm_ghash_clmul"); 1184392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Xip,&wparam(0)); 1185392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($Htbl,&wparam(1)); 1186392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($inp,&wparam(2)); 1187392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &mov ($len,&wparam(3)); 1188392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1189392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &call (&label("pic")); 1190392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&set_label("pic"); 1191392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &blindpop ($const); 1192392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); 1193392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1194392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu ($Xi,&QWP(0,$Xip)); 1195392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T3,&QWP(0,$const)); 1196392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu ($Hkey,&QWP(0,$Htbl)); 1197392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufb ($Xi,$T3); 1198392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1199392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &sub ($len,0x10); 1200392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &jz (&label("odd_tail")); 1201392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1202392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom ####### 1203392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # Xi+2 =[H*(Ii+1 + Xi+1)] mod P = 1204392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # [(H*Ii+1) + (H*Xi+1)] mod P = 1205392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # [(H*Ii+1) + H^2*(Ii+Xi)] mod P 1206392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom # 1207392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu ($T1,&QWP(0,$inp)); # Ii 1208392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu ($Xn,&QWP(16,$inp)); # Ii+1 1209392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufb ($T1,$T3); 1210392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufb ($Xn,$T3); 1211392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$T1); # Ii+Xi 1212392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1213392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &clmul64x64_T3 ($Xhn,$Xn,$Hkey); # H*Ii+1 1214392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu ($Hkey,&QWP(16,$Htbl)); # load H^2 1215392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1216392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &sub ($len,0x20); 1217392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &lea ($inp,&DWP(32,$inp)); # i+=2 1218392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &jbe (&label("even_tail")); 1219392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1220392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&set_label("mod_loop"); 1221392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &clmul64x64_T3 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi) 1222392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu ($Hkey,&QWP(0,$Htbl)); # load H 1223392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1224392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi) 1225392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xhi,$Xhn); 1226392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1227392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &reduction_alg5 ($Xhi,$Xi); 1228392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1229392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom ####### 1230392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T3,&QWP(0,$const)); 1231392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu ($T1,&QWP(0,$inp)); # Ii 1232392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu ($Xn,&QWP(16,$inp)); # Ii+1 1233392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufb ($T1,$T3); 1234392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufb ($Xn,$T3); 1235392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$T1); # Ii+Xi 1236392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1237392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &clmul64x64_T3 ($Xhn,$Xn,$Hkey); # H*Ii+1 1238392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu ($Hkey,&QWP(16,$Htbl)); # load H^2 1239392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1240392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &sub ($len,0x20); 1241392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &lea ($inp,&DWP(32,$inp)); 1242392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &ja (&label("mod_loop")); 1243392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1244392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&set_label("even_tail"); 1245392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &clmul64x64_T3 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi) 1246392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1247392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi) 1248392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xhi,$Xhn); 1249392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1250392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &reduction_alg5 ($Xhi,$Xi); 1251392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1252392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T3,&QWP(0,$const)); 1253392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &test ($len,$len); 1254392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &jnz (&label("done")); 1255392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1256392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu ($Hkey,&QWP(0,$Htbl)); # load H 1257392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&set_label("odd_tail"); 1258392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu ($T1,&QWP(0,$inp)); # Ii 1259392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufb ($T1,$T3); 1260392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pxor ($Xi,$T1); # Ii+Xi 1261392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1262392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &clmul64x64_T3 ($Xhi,$Xi,$Hkey); # H*(Ii+Xi) 1263392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &reduction_alg5 ($Xhi,$Xi); 1264392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1265392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqa ($T3,&QWP(0,$const)); 1266392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&set_label("done"); 1267392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &pshufb ($Xi,$T3); 1268392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &movdqu (&QWP(0,$Xip),$Xi); 1269392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&function_end("gcm_ghash_clmul"); 1270392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1271392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom} 1272392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1273392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&set_label("bswap",64); 1274392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_byte(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0); 1275392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_byte(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2); # 0x1c2_polynomial 1276392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom}} # $sse2 1277392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1278392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&set_label("rem_4bit",64); 1279392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_word(0,0x0000<<$S,0,0x1C20<<$S,0,0x3840<<$S,0,0x2460<<$S); 1280392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_word(0,0x7080<<$S,0,0x6CA0<<$S,0,0x48C0<<$S,0,0x54E0<<$S); 1281392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_word(0,0xE100<<$S,0,0xFD20<<$S,0,0xD940<<$S,0,0xC560<<$S); 1282392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_word(0,0x9180<<$S,0,0x8DA0<<$S,0,0xA9C0<<$S,0,0xB5E0<<$S); 1283392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&set_label("rem_8bit",64); 1284392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0x0000,0x01C2,0x0384,0x0246,0x0708,0x06CA,0x048C,0x054E); 1285392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0x0E10,0x0FD2,0x0D94,0x0C56,0x0918,0x08DA,0x0A9C,0x0B5E); 1286392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0x1C20,0x1DE2,0x1FA4,0x1E66,0x1B28,0x1AEA,0x18AC,0x196E); 1287392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0x1230,0x13F2,0x11B4,0x1076,0x1538,0x14FA,0x16BC,0x177E); 1288392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0x3840,0x3982,0x3BC4,0x3A06,0x3F48,0x3E8A,0x3CCC,0x3D0E); 1289392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0x3650,0x3792,0x35D4,0x3416,0x3158,0x309A,0x32DC,0x331E); 1290392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0x2460,0x25A2,0x27E4,0x2626,0x2368,0x22AA,0x20EC,0x212E); 1291392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0x2A70,0x2BB2,0x29F4,0x2836,0x2D78,0x2CBA,0x2EFC,0x2F3E); 1292392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0x7080,0x7142,0x7304,0x72C6,0x7788,0x764A,0x740C,0x75CE); 1293392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0x7E90,0x7F52,0x7D14,0x7CD6,0x7998,0x785A,0x7A1C,0x7BDE); 1294392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0x6CA0,0x6D62,0x6F24,0x6EE6,0x6BA8,0x6A6A,0x682C,0x69EE); 1295392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0x62B0,0x6372,0x6134,0x60F6,0x65B8,0x647A,0x663C,0x67FE); 1296392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0x48C0,0x4902,0x4B44,0x4A86,0x4FC8,0x4E0A,0x4C4C,0x4D8E); 1297392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0x46D0,0x4712,0x4554,0x4496,0x41D8,0x401A,0x425C,0x439E); 1298392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0x54E0,0x5522,0x5764,0x56A6,0x53E8,0x522A,0x506C,0x51AE); 1299392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0x5AF0,0x5B32,0x5974,0x58B6,0x5DF8,0x5C3A,0x5E7C,0x5FBE); 1300392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0xE100,0xE0C2,0xE284,0xE346,0xE608,0xE7CA,0xE58C,0xE44E); 1301392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0xEF10,0xEED2,0xEC94,0xED56,0xE818,0xE9DA,0xEB9C,0xEA5E); 1302392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0xFD20,0xFCE2,0xFEA4,0xFF66,0xFA28,0xFBEA,0xF9AC,0xF86E); 1303392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0xF330,0xF2F2,0xF0B4,0xF176,0xF438,0xF5FA,0xF7BC,0xF67E); 1304392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0xD940,0xD882,0xDAC4,0xDB06,0xDE48,0xDF8A,0xDDCC,0xDC0E); 1305392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0xD750,0xD692,0xD4D4,0xD516,0xD058,0xD19A,0xD3DC,0xD21E); 1306392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0xC560,0xC4A2,0xC6E4,0xC726,0xC268,0xC3AA,0xC1EC,0xC02E); 1307392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0xCB70,0xCAB2,0xC8F4,0xC936,0xCC78,0xCDBA,0xCFFC,0xCE3E); 1308392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0x9180,0x9042,0x9204,0x93C6,0x9688,0x974A,0x950C,0x94CE); 1309392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0x9F90,0x9E52,0x9C14,0x9DD6,0x9898,0x995A,0x9B1C,0x9ADE); 1310392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0x8DA0,0x8C62,0x8E24,0x8FE6,0x8AA8,0x8B6A,0x892C,0x88EE); 1311392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0x83B0,0x8272,0x8034,0x81F6,0x84B8,0x857A,0x873C,0x86FE); 1312392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0xA9C0,0xA802,0xAA44,0xAB86,0xAEC8,0xAF0A,0xAD4C,0xAC8E); 1313392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0xA7D0,0xA612,0xA454,0xA596,0xA0D8,0xA11A,0xA35C,0xA29E); 1314392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0xB5E0,0xB422,0xB664,0xB7A6,0xB2E8,0xB32A,0xB16C,0xB0AE); 1315392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom &data_short(0xBBF0,0xBA32,0xB874,0xB9B6,0xBCF8,0xBD3A,0xBF7C,0xBEBE); 1316392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom}}} # !$x86only 1317392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1318392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&asciz("GHASH for x86, CRYPTOGAMS by <appro\@openssl.org>"); 1319392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom&asm_finish(); 1320392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom 1321392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# A question was risen about choice of vanilla MMX. Or rather why wasn't 1322392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# SSE2 chosen instead? In addition to the fact that MMX runs on legacy 1323392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# CPUs such as PIII, "4-bit" MMX version was observed to provide better 1324392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# performance than *corresponding* SSE2 one even on contemporary CPUs. 1325392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# SSE2 results were provided by Peter-Michael Hager. He maintains SSE2 1326392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# implementation featuring full range of lookup-table sizes, but with 1327392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# per-invocation lookup table setup. Latter means that table size is 1328392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# chosen depending on how much data is to be hashed in every given call, 1329392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# more data - larger table. Best reported result for Core2 is ~4 cycles 1330392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# per processed byte out of 64KB block. This number accounts even for 1331392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# 64KB table setup overhead. As discussed in gcm128.c we choose to be 1332392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# more conservative in respect to lookup table sizes, but how do the 1333392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# results compare? Minimalistic "256B" MMX version delivers ~11 cycles 1334392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# on same platform. As also discussed in gcm128.c, next in line "8-bit 1335392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# Shoup's" or "4KB" method should deliver twice the performance of 1336392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# "256B" one, in other words not worse than ~6 cycles per byte. It 1337392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# should be also be noted that in SSE2 case improvement can be "super- 1338392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# linear," i.e. more than twice, mostly because >>8 maps to single 1339392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# instruction on SSE2 register. This is unlike "4-bit" case when >>4 1340392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# maps to same amount of instructions in both MMX and SSE2 cases. 1341392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# Bottom line is that switch to SSE2 is considered to be justifiable 1342392aa7cc7d2b122614c5393c3e357da07fd07af3Brian Carlstrom# only in case we choose to implement "8-bit" method... 1343