README.txt revision 9118dbc7bd82a2f8a3c3d5fd38d3265afb6a5774
1//===---------------------------------------------------------------------===// 2// Random ideas for the ARM backend. 3//===---------------------------------------------------------------------===// 4 5Reimplement 'select' in terms of 'SEL'. 6 7* We would really like to support UXTAB16, but we need to prove that the 8 add doesn't need to overflow between the two 16-bit chunks. 9 10* implement predication support 11* Implement pre/post increment support. (e.g. PR935) 12* Coalesce stack slots! 13* Implement smarter constant generation for binops with large immediates. 14 15* Consider materializing FP constants like 0.0f and 1.0f using integer 16 immediate instructions then copy to FPU. Slower than load into FPU? 17 18//===---------------------------------------------------------------------===// 19 20Crazy idea: Consider code that uses lots of 8-bit or 16-bit values. By the 21time regalloc happens, these values are now in a 32-bit register, usually with 22the top-bits known to be sign or zero extended. If spilled, we should be able 23to spill these to a 8-bit or 16-bit stack slot, zero or sign extending as part 24of the reload. 25 26Doing this reduces the size of the stack frame (important for thumb etc), and 27also increases the likelihood that we will be able to reload multiple values 28from the stack with a single load. 29 30//===---------------------------------------------------------------------===// 31 32The constant island pass is in good shape. Some cleanups might be desirable, 33but there is unlikely to be much improvement in the generated code. 34 351. There may be some advantage to trying to be smarter about the initial 36placement, rather than putting everything at the end. 37 382. There might be some compile-time efficiency to be had by representing 39consecutive islands as a single block rather than multiple blocks. 40 413. Use a priority queue to sort constant pool users in inverse order of 42 position so we always process the one closed to the end of functions 43 first. This may simply CreateNewWater. 44 45//===---------------------------------------------------------------------===// 46 47We need to start generating predicated instructions. The .td files have a way 48to express this now (see the PPC conditional return instruction), but the 49branch folding pass (or a new if-cvt pass) should start producing these, at 50least in the trivial case. 51 52Among the obvious wins, doing so can eliminate the need to custom expand 53copysign (i.e. we won't need to custom expand it to get the conditional 54negate). 55 56This allows us to eliminate one instruction from: 57 58define i32 @_Z6slow4bii(i32 %x, i32 %y) { 59 %tmp = icmp sgt i32 %x, %y 60 %retval = select i1 %tmp, i32 %x, i32 %y 61 ret i32 %retval 62} 63 64__Z6slow4bii: 65 cmp r0, r1 66 movgt r1, r0 67 mov r0, r1 68 bx lr 69 70//===---------------------------------------------------------------------===// 71 72Implement long long "X-3" with instructions that fold the immediate in. These 73were disabled due to badness with the ARM carry flag on subtracts. 74 75//===---------------------------------------------------------------------===// 76 77We currently compile abs: 78int foo(int p) { return p < 0 ? -p : p; } 79 80into: 81 82_foo: 83 rsb r1, r0, #0 84 cmn r0, #1 85 movgt r1, r0 86 mov r0, r1 87 bx lr 88 89This is very, uh, literal. This could be a 3 operation sequence: 90 t = (p sra 31); 91 res = (p xor t)-t 92 93Which would be better. This occurs in png decode. 94 95//===---------------------------------------------------------------------===// 96 97More load / store optimizations: 981) Look past instructions without side-effects (not load, store, branch, etc.) 99 when forming the list of loads / stores to optimize. 100 1012) Smarter register allocation? 102We are probably missing some opportunities to use ldm / stm. Consider: 103 104ldr r5, [r0] 105ldr r4, [r0, #4] 106 107This cannot be merged into a ldm. Perhaps we will need to do the transformation 108before register allocation. Then teach the register allocator to allocate a 109chunk of consecutive registers. 110 1113) Better representation for block transfer? This is from Olden/power: 112 113 fldd d0, [r4] 114 fstd d0, [r4, #+32] 115 fldd d0, [r4, #+8] 116 fstd d0, [r4, #+40] 117 fldd d0, [r4, #+16] 118 fstd d0, [r4, #+48] 119 fldd d0, [r4, #+24] 120 fstd d0, [r4, #+56] 121 122If we can spare the registers, it would be better to use fldm and fstm here. 123Need major register allocator enhancement though. 124 1254) Can we recognize the relative position of constantpool entries? i.e. Treat 126 127 ldr r0, LCPI17_3 128 ldr r1, LCPI17_4 129 ldr r2, LCPI17_5 130 131 as 132 ldr r0, LCPI17 133 ldr r1, LCPI17+4 134 ldr r2, LCPI17+8 135 136 Then the ldr's can be combined into a single ldm. See Olden/power. 137 138Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a 139double 64-bit FP constant: 140 141 adr r0, L6 142 ldmia r0, {r0-r1} 143 144 .align 2 145L6: 146 .long -858993459 147 .long 1074318540 148 1495) Can we make use of ldrd and strd? Instead of generating ldm / stm, use 150ldrd/strd instead if there are only two destination registers that form an 151odd/even pair. However, we probably would pay a penalty if the address is not 152aligned on 8-byte boundary. This requires more information on load / store 153nodes (and MI's?) then we currently carry. 154 1556) struct copies appear to be done field by field 156instead of by words, at least sometimes: 157 158struct foo { int x; short s; char c1; char c2; }; 159void cpy(struct foo*a, struct foo*b) { *a = *b; } 160 161llvm code (-O2) 162 ldrb r3, [r1, #+6] 163 ldr r2, [r1] 164 ldrb r12, [r1, #+7] 165 ldrh r1, [r1, #+4] 166 str r2, [r0] 167 strh r1, [r0, #+4] 168 strb r3, [r0, #+6] 169 strb r12, [r0, #+7] 170gcc code (-O2) 171 ldmia r1, {r1-r2} 172 stmia r0, {r1-r2} 173 174In this benchmark poor handling of aggregate copies has shown up as 175having a large effect on size, and possibly speed as well (we don't have 176a good way to measure on ARM). 177 178//===---------------------------------------------------------------------===// 179 180* Consider this silly example: 181 182double bar(double x) { 183 double r = foo(3.1); 184 return x+r; 185} 186 187_bar: 188 sub sp, sp, #16 189 str r4, [sp, #+12] 190 str r5, [sp, #+8] 191 str lr, [sp, #+4] 192 mov r4, r0 193 mov r5, r1 194 ldr r0, LCPI2_0 195 bl _foo 196 fmsr f0, r0 197 fcvtsd d0, f0 198 fmdrr d1, r4, r5 199 faddd d0, d0, d1 200 fmrrd r0, r1, d0 201 ldr lr, [sp, #+4] 202 ldr r5, [sp, #+8] 203 ldr r4, [sp, #+12] 204 add sp, sp, #16 205 bx lr 206 207Ignore the prologue and epilogue stuff for a second. Note 208 mov r4, r0 209 mov r5, r1 210the copys to callee-save registers and the fact they are only being used by the 211fmdrr instruction. It would have been better had the fmdrr been scheduled 212before the call and place the result in a callee-save DPR register. The two 213mov ops would not have been necessary. 214 215//===---------------------------------------------------------------------===// 216 217Calling convention related stuff: 218 219* gcc's parameter passing implementation is terrible and we suffer as a result: 220 221e.g. 222struct s { 223 double d1; 224 int s1; 225}; 226 227void foo(struct s S) { 228 printf("%g, %d\n", S.d1, S.s1); 229} 230 231'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and 232then reload them to r1, r2, and r3 before issuing the call (r0 contains the 233address of the format string): 234 235 stmfd sp!, {r7, lr} 236 add r7, sp, #0 237 sub sp, sp, #12 238 stmia sp, {r0, r1, r2} 239 ldmia sp, {r1-r2} 240 ldr r0, L5 241 ldr r3, [sp, #8] 242L2: 243 add r0, pc, r0 244 bl L_printf$stub 245 246Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves? 247 248* Return an aggregate type is even worse: 249 250e.g. 251struct s foo(void) { 252 struct s S = {1.1, 2}; 253 return S; 254} 255 256 mov ip, r0 257 ldr r0, L5 258 sub sp, sp, #12 259L2: 260 add r0, pc, r0 261 @ lr needed for prologue 262 ldmia r0, {r0, r1, r2} 263 stmia sp, {r0, r1, r2} 264 stmia ip, {r0, r1, r2} 265 mov r0, ip 266 add sp, sp, #12 267 bx lr 268 269r0 (and later ip) is the hidden parameter from caller to store the value in. The 270first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1, 271r2 into the address passed in. However, there is one additional stmia that 272stores r0, r1, and r2 to some stack location. The store is dead. 273 274The llvm-gcc generated code looks like this: 275 276csretcc void %foo(%struct.s* %agg.result) { 277entry: 278 %S = alloca %struct.s, align 4 ; <%struct.s*> [#uses=1] 279 %memtmp = alloca %struct.s ; <%struct.s*> [#uses=1] 280 cast %struct.s* %S to sbyte* ; <sbyte*>:0 [#uses=2] 281 call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 ) 282 cast %struct.s* %agg.result to sbyte* ; <sbyte*>:1 [#uses=2] 283 call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 ) 284 cast %struct.s* %memtmp to sbyte* ; <sbyte*>:2 [#uses=1] 285 call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 ) 286 ret void 287} 288 289llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from 290constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated 291into a number of load and stores, or 2) custom lower memcpy (of small size) to 292be ldmia / stmia. I think option 2 is better but the current register 293allocator cannot allocate a chunk of registers at a time. 294 295A feasible temporary solution is to use specific physical registers at the 296lowering time for small (<= 4 words?) transfer size. 297 298* ARM CSRet calling convention requires the hidden argument to be returned by 299the callee. 300 301//===---------------------------------------------------------------------===// 302 303We can definitely do a better job on BB placements to eliminate some branches. 304It's very common to see llvm generated assembly code that looks like this: 305 306LBB3: 307 ... 308LBB4: 309... 310 beq LBB3 311 b LBB2 312 313If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can 314then eliminate beq and and turn the unconditional branch to LBB2 to a bne. 315 316See McCat/18-imp/ComputeBoundingBoxes for an example. 317 318//===---------------------------------------------------------------------===// 319 320Register scavenging is now implemented. The example in the previous version 321of this document produces optimal code at -O2. 322 323//===---------------------------------------------------------------------===// 324 325Pre-/post- indexed load / stores: 326 3271) We should not make the pre/post- indexed load/store transform if the base ptr 328is guaranteed to be live beyond the load/store. This can happen if the base 329ptr is live out of the block we are performing the optimization. e.g. 330 331mov r1, r2 332ldr r3, [r1], #4 333... 334 335vs. 336 337ldr r3, [r2] 338add r1, r2, #4 339... 340 341In most cases, this is just a wasted optimization. However, sometimes it can 342negatively impact the performance because two-address code is more restrictive 343when it comes to scheduling. 344 345Unfortunately, liveout information is currently unavailable during DAG combine 346time. 347 3482) Consider spliting a indexed load / store into a pair of add/sub + load/store 349 to solve #1 (in TwoAddressInstructionPass.cpp). 350 3513) Enhance LSR to generate more opportunities for indexed ops. 352 3534) Once we added support for multiple result patterns, write indexed loads 354 patterns instead of C++ instruction selection code. 355 3565) Use FLDM / FSTM to emulate indexed FP load / store. 357 358//===---------------------------------------------------------------------===// 359 360We should add i64 support to take advantage of the 64-bit load / stores. 361We can add a pseudo i64 register class containing pseudo registers that are 362register pairs. All other ops (e.g. add, sub) would be expanded as usual. 363 364We need to add pseudo instructions (i.e. gethi / getlo) to extract i32 registers 365from the i64 register. These are single moves which can be eliminated if the 366destination register is a sub-register of the source. We should implement proper 367subreg support in the register allocator to coalesce these away. 368 369There are other minor issues such as multiple instructions for a spill / restore 370/ move. 371 372//===---------------------------------------------------------------------===// 373 374Implement support for some more tricky ways to materialize immediates. For 375example, to get 0xffff8000, we can use: 376 377mov r9, #&3f8000 378sub r9, r9, #&400000 379 380//===---------------------------------------------------------------------===// 381 382We sometimes generate multiple add / sub instructions to update sp in prologue 383and epilogue if the inc / dec value is too large to fit in a single immediate 384operand. In some cases, perhaps it might be better to load the value from a 385constantpool instead. 386 387//===---------------------------------------------------------------------===// 388 389GCC generates significantly better code for this function. 390 391int foo(int StackPtr, unsigned char *Line, unsigned char *Stack, int LineLen) { 392 int i = 0; 393 394 if (StackPtr != 0) { 395 while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768))) 396 Line[i++] = Stack[--StackPtr]; 397 if (LineLen > 32768) 398 { 399 while (StackPtr != 0 && i < LineLen) 400 { 401 i++; 402 --StackPtr; 403 } 404 } 405 } 406 return StackPtr; 407} 408 409//===---------------------------------------------------------------------===// 410 411This should compile to the mlas instruction: 412int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; } 413 414//===---------------------------------------------------------------------===// 415 416At some point, we should triage these to see if they still apply to us: 417 418http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598 419http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560 420http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016 421 422http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831 423http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826 424http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825 425http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824 426http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823 427http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820 428http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982 429 430http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242 431http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831 432http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760 433http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759 434http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703 435http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702 436http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663 437 438http://www.inf.u-szeged.hu/gcc-arm/ 439http://citeseer.ist.psu.edu/debus04linktime.html 440 441//===---------------------------------------------------------------------===// 442 443gcc generates smaller code for this function at -O2 or -Os: 444 445void foo(signed char* p) { 446 if (*p == 3) 447 bar(); 448 else if (*p == 4) 449 baz(); 450 else if (*p == 5) 451 quux(); 452} 453 454llvm decides it's a good idea to turn the repeated if...else into a 455binary tree, as if it were a switch; the resulting code requires -1 456compare-and-branches when *p<=2 or *p==5, the same number if *p==4 457or *p>6, and +1 if *p==3. So it should be a speed win 458(on balance). However, the revised code is larger, with 4 conditional 459branches instead of 3. 460 461More seriously, there is a byte->word extend before 462each comparison, where there should be only one, and the condition codes 463are not remembered when the same two values are compared twice. 464 465//===---------------------------------------------------------------------===// 466 467More register scavenging work: 468 4691. Use the register scavenger to track frame index materialized into registers 470 (those that do not fit in addressing modes) to allow reuse in the same BB. 4712. Finish scavenging for Thumb. 4723. We know some spills and restores are unnecessary. The issue is once live 473 intervals are merged, they are not never split. So every def is spilled 474 and every use requires a restore if the register allocator decides the 475 resulting live interval is not assigned a physical register. It may be 476 possible (with the help of the scavenger) to turn some spill / restore 477 pairs into register copies. 478 479//===---------------------------------------------------------------------===// 480 481More LSR enhancements possible: 482 4831. Teach LSR about pre- and post- indexed ops to allow iv increment be merged 484 in a load / store. 4852. Allow iv reuse even when a type conversion is required. For example, i8 486 and i32 load / store addressing modes are identical. 487 488 489//===---------------------------------------------------------------------===// 490 491This: 492 493int foo(int a, int b, int c, int d) { 494 long long acc = (long long)a * (long long)b; 495 acc += (long long)c * (long long)d; 496 return (int)(acc >> 32); 497} 498 499Should compile to use SMLAL (Signed Multiply Accumulate Long) which multiplies 500two signed 32-bit values to produce a 64-bit value, and accumulates this with 501a 64-bit value. 502 503We currently get this with v6: 504 505_foo: 506 mul r12, r1, r0 507 smmul r1, r1, r0 508 smmul r0, r3, r2 509 mul r3, r3, r2 510 adds r3, r3, r12 511 adc r0, r0, r1 512 bx lr 513 514and this with v4: 515 516_foo: 517 stmfd sp!, {r7, lr} 518 mov r7, sp 519 mul r12, r1, r0 520 smull r0, r1, r1, r0 521 smull lr, r0, r3, r2 522 mul r3, r3, r2 523 adds r3, r3, r12 524 adc r0, r0, r1 525 ldmfd sp!, {r7, pc} 526 527This apparently occurs in real code. 528 529//===---------------------------------------------------------------------===// 530