README.txt revision 1a9da0d66cbb7742e60ed957a7670a6547911de1
1//===---------------------------------------------------------------------===// 2// Random ideas for the ARM backend. 3//===---------------------------------------------------------------------===// 4 5Reimplement 'select' in terms of 'SEL'. 6 7* We would really like to support UXTAB16, but we need to prove that the 8 add doesn't need to overflow between the two 16-bit chunks. 9 10* implement predication support 11* Implement pre/post increment support. (e.g. PR935) 12* Coalesce stack slots! 13* Implement smarter constant generation for binops with large immediates. 14 15* Consider materializing FP constants like 0.0f and 1.0f using integer 16 immediate instructions then copy to FPU. Slower than load into FPU? 17 18//===---------------------------------------------------------------------===// 19 20The constant island pass is in good shape. Some cleanups might be desirable, 21but there is unlikely to be much improvement in the generated code. 22 231. There may be some advantage to trying to be smarter about the initial 24placement, rather than putting everything at the end. 25 262. The handling of 2-byte padding for Thumb is overly conservative. There 27would be a small gain to keeping accurate track of the padding (which would 28require aligning functions containing constant pools to 4-byte boundaries). 29 303. There might be some compile-time efficiency to be had by representing 31consecutive islands as a single block rather than multiple blocks. 32 334. Use a priority queue to sort constant pool users in inverse order of 34 position so we always process the one closed to the end of functions 35 first. This may simply CreateNewWater. 36 37//===---------------------------------------------------------------------===// 38 39We need to start generating predicated instructions. The .td files have a way 40to express this now (see the PPC conditional return instruction), but the 41branch folding pass (or a new if-cvt pass) should start producing these, at 42least in the trivial case. 43 44Among the obvious wins, doing so can eliminate the need to custom expand 45copysign (i.e. we won't need to custom expand it to get the conditional 46negate). 47 48This allows us to eliminate one instruction from: 49 50define i32 @_Z6slow4bii(i32 %x, i32 %y) { 51 %tmp = icmp sgt i32 %x, %y 52 %retval = select i1 %tmp, i32 %x, i32 %y 53 ret i32 %retval 54} 55 56__Z6slow4bii: 57 cmp r0, r1 58 movgt r1, r0 59 mov r0, r1 60 bx lr 61 62//===---------------------------------------------------------------------===// 63 64Implement long long "X-3" with instructions that fold the immediate in. These 65were disabled due to badness with the ARM carry flag on subtracts. 66 67//===---------------------------------------------------------------------===// 68 69We currently compile abs: 70int foo(int p) { return p < 0 ? -p : p; } 71 72into: 73 74_foo: 75 rsb r1, r0, #0 76 cmn r0, #1 77 movgt r1, r0 78 mov r0, r1 79 bx lr 80 81This is very, uh, literal. This could be a 3 operation sequence: 82 t = (p sra 31); 83 res = (p xor t)-t 84 85Which would be better. This occurs in png decode. 86 87//===---------------------------------------------------------------------===// 88 89More load / store optimizations: 901) Look past instructions without side-effects (not load, store, branch, etc.) 91 when forming the list of loads / stores to optimize. 92 932) Smarter register allocation? 94We are probably missing some opportunities to use ldm / stm. Consider: 95 96ldr r5, [r0] 97ldr r4, [r0, #4] 98 99This cannot be merged into a ldm. Perhaps we will need to do the transformation 100before register allocation. Then teach the register allocator to allocate a 101chunk of consecutive registers. 102 1033) Better representation for block transfer? This is from Olden/power: 104 105 fldd d0, [r4] 106 fstd d0, [r4, #+32] 107 fldd d0, [r4, #+8] 108 fstd d0, [r4, #+40] 109 fldd d0, [r4, #+16] 110 fstd d0, [r4, #+48] 111 fldd d0, [r4, #+24] 112 fstd d0, [r4, #+56] 113 114If we can spare the registers, it would be better to use fldm and fstm here. 115Need major register allocator enhancement though. 116 1174) Can we recognize the relative position of constantpool entries? i.e. Treat 118 119 ldr r0, LCPI17_3 120 ldr r1, LCPI17_4 121 ldr r2, LCPI17_5 122 123 as 124 ldr r0, LCPI17 125 ldr r1, LCPI17+4 126 ldr r2, LCPI17+8 127 128 Then the ldr's can be combined into a single ldm. See Olden/power. 129 130Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a 131double 64-bit FP constant: 132 133 adr r0, L6 134 ldmia r0, {r0-r1} 135 136 .align 2 137L6: 138 .long -858993459 139 .long 1074318540 140 1415) Can we make use of ldrd and strd? Instead of generating ldm / stm, use 142ldrd/strd instead if there are only two destination registers that form an 143odd/even pair. However, we probably would pay a penalty if the address is not 144aligned on 8-byte boundary. This requires more information on load / store 145nodes (and MI's?) then we currently carry. 146 1476) struct copies appear to be done field by field 148instead of by words, at least sometimes: 149 150struct foo { int x; short s; char c1; char c2; }; 151void cpy(struct foo*a, struct foo*b) { *a = *b; } 152 153llvm code (-O2) 154 ldrb r3, [r1, #+6] 155 ldr r2, [r1] 156 ldrb r12, [r1, #+7] 157 ldrh r1, [r1, #+4] 158 str r2, [r0] 159 strh r1, [r0, #+4] 160 strb r3, [r0, #+6] 161 strb r12, [r0, #+7] 162gcc code (-O2) 163 ldmia r1, {r1-r2} 164 stmia r0, {r1-r2} 165 166In this benchmark poor handling of aggregate copies has shown up as 167having a large effect on size, and possibly speed as well (we don't have 168a good way to measure on ARM). 169 170//===---------------------------------------------------------------------===// 171 172* Consider this silly example: 173 174double bar(double x) { 175 double r = foo(3.1); 176 return x+r; 177} 178 179_bar: 180 sub sp, sp, #16 181 str r4, [sp, #+12] 182 str r5, [sp, #+8] 183 str lr, [sp, #+4] 184 mov r4, r0 185 mov r5, r1 186 ldr r0, LCPI2_0 187 bl _foo 188 fmsr f0, r0 189 fcvtsd d0, f0 190 fmdrr d1, r4, r5 191 faddd d0, d0, d1 192 fmrrd r0, r1, d0 193 ldr lr, [sp, #+4] 194 ldr r5, [sp, #+8] 195 ldr r4, [sp, #+12] 196 add sp, sp, #16 197 bx lr 198 199Ignore the prologue and epilogue stuff for a second. Note 200 mov r4, r0 201 mov r5, r1 202the copys to callee-save registers and the fact they are only being used by the 203fmdrr instruction. It would have been better had the fmdrr been scheduled 204before the call and place the result in a callee-save DPR register. The two 205mov ops would not have been necessary. 206 207//===---------------------------------------------------------------------===// 208 209Calling convention related stuff: 210 211* gcc's parameter passing implementation is terrible and we suffer as a result: 212 213e.g. 214struct s { 215 double d1; 216 int s1; 217}; 218 219void foo(struct s S) { 220 printf("%g, %d\n", S.d1, S.s1); 221} 222 223'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and 224then reload them to r1, r2, and r3 before issuing the call (r0 contains the 225address of the format string): 226 227 stmfd sp!, {r7, lr} 228 add r7, sp, #0 229 sub sp, sp, #12 230 stmia sp, {r0, r1, r2} 231 ldmia sp, {r1-r2} 232 ldr r0, L5 233 ldr r3, [sp, #8] 234L2: 235 add r0, pc, r0 236 bl L_printf$stub 237 238Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves? 239 240* Return an aggregate type is even worse: 241 242e.g. 243struct s foo(void) { 244 struct s S = {1.1, 2}; 245 return S; 246} 247 248 mov ip, r0 249 ldr r0, L5 250 sub sp, sp, #12 251L2: 252 add r0, pc, r0 253 @ lr needed for prologue 254 ldmia r0, {r0, r1, r2} 255 stmia sp, {r0, r1, r2} 256 stmia ip, {r0, r1, r2} 257 mov r0, ip 258 add sp, sp, #12 259 bx lr 260 261r0 (and later ip) is the hidden parameter from caller to store the value in. The 262first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1, 263r2 into the address passed in. However, there is one additional stmia that 264stores r0, r1, and r2 to some stack location. The store is dead. 265 266The llvm-gcc generated code looks like this: 267 268csretcc void %foo(%struct.s* %agg.result) { 269entry: 270 %S = alloca %struct.s, align 4 ; <%struct.s*> [#uses=1] 271 %memtmp = alloca %struct.s ; <%struct.s*> [#uses=1] 272 cast %struct.s* %S to sbyte* ; <sbyte*>:0 [#uses=2] 273 call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 ) 274 cast %struct.s* %agg.result to sbyte* ; <sbyte*>:1 [#uses=2] 275 call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 ) 276 cast %struct.s* %memtmp to sbyte* ; <sbyte*>:2 [#uses=1] 277 call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 ) 278 ret void 279} 280 281llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from 282constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated 283into a number of load and stores, or 2) custom lower memcpy (of small size) to 284be ldmia / stmia. I think option 2 is better but the current register 285allocator cannot allocate a chunk of registers at a time. 286 287A feasible temporary solution is to use specific physical registers at the 288lowering time for small (<= 4 words?) transfer size. 289 290* ARM CSRet calling convention requires the hidden argument to be returned by 291the callee. 292 293//===---------------------------------------------------------------------===// 294 295We can definitely do a better job on BB placements to eliminate some branches. 296It's very common to see llvm generated assembly code that looks like this: 297 298LBB3: 299 ... 300LBB4: 301... 302 beq LBB3 303 b LBB2 304 305If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can 306then eliminate beq and and turn the unconditional branch to LBB2 to a bne. 307 308See McCat/18-imp/ComputeBoundingBoxes for an example. 309 310//===---------------------------------------------------------------------===// 311 312Register scavenging is now implemented. The example in the previous version 313of this document produces optimal code at -O2. 314 315//===---------------------------------------------------------------------===// 316 317Pre-/post- indexed load / stores: 318 3191) We should not make the pre/post- indexed load/store transform if the base ptr 320is guaranteed to be live beyond the load/store. This can happen if the base 321ptr is live out of the block we are performing the optimization. e.g. 322 323mov r1, r2 324ldr r3, [r1], #4 325... 326 327vs. 328 329ldr r3, [r2] 330add r1, r2, #4 331... 332 333In most cases, this is just a wasted optimization. However, sometimes it can 334negatively impact the performance because two-address code is more restrictive 335when it comes to scheduling. 336 337Unfortunately, liveout information is currently unavailable during DAG combine 338time. 339 3402) Consider spliting a indexed load / store into a pair of add/sub + load/store 341 to solve #1 (in TwoAddressInstructionPass.cpp). 342 3433) Enhance LSR to generate more opportunities for indexed ops. 344 3454) Once we added support for multiple result patterns, write indexed loads 346 patterns instead of C++ instruction selection code. 347 3485) Use FLDM / FSTM to emulate indexed FP load / store. 349 350//===---------------------------------------------------------------------===// 351 352We should add i64 support to take advantage of the 64-bit load / stores. 353We can add a pseudo i64 register class containing pseudo registers that are 354register pairs. All other ops (e.g. add, sub) would be expanded as usual. 355 356We need to add pseudo instructions (i.e. gethi / getlo) to extract i32 registers 357from the i64 register. These are single moves which can be eliminated if the 358destination register is a sub-register of the source. We should implement proper 359subreg support in the register allocator to coalesce these away. 360 361There are other minor issues such as multiple instructions for a spill / restore 362/ move. 363 364//===---------------------------------------------------------------------===// 365 366Implement support for some more tricky ways to materialize immediates. For 367example, to get 0xffff8000, we can use: 368 369mov r9, #&3f8000 370sub r9, r9, #&400000 371 372//===---------------------------------------------------------------------===// 373 374We sometimes generate multiple add / sub instructions to update sp in prologue 375and epilogue if the inc / dec value is too large to fit in a single immediate 376operand. In some cases, perhaps it might be better to load the value from a 377constantpool instead. 378 379//===---------------------------------------------------------------------===// 380 381GCC generates significantly better code for this function. 382 383int foo(int StackPtr, unsigned char *Line, unsigned char *Stack, int LineLen) { 384 int i = 0; 385 386 if (StackPtr != 0) { 387 while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768))) 388 Line[i++] = Stack[--StackPtr]; 389 if (LineLen > 32768) 390 { 391 while (StackPtr != 0 && i < LineLen) 392 { 393 i++; 394 --StackPtr; 395 } 396 } 397 } 398 return StackPtr; 399} 400 401//===---------------------------------------------------------------------===// 402 403This should compile to the mlas instruction: 404int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; } 405 406//===---------------------------------------------------------------------===// 407 408At some point, we should triage these to see if they still apply to us: 409 410http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598 411http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560 412http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016 413 414http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831 415http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826 416http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825 417http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824 418http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823 419http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820 420http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982 421 422http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242 423http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831 424http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760 425http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759 426http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703 427http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702 428http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663 429 430http://www.inf.u-szeged.hu/gcc-arm/ 431http://citeseer.ist.psu.edu/debus04linktime.html 432 433//===---------------------------------------------------------------------===// 434 435gcc generates smaller code for this function at -O2 or -Os: 436 437void foo(signed char* p) { 438 if (*p == 3) 439 bar(); 440 else if (*p == 4) 441 baz(); 442 else if (*p == 5) 443 quux(); 444} 445 446llvm decides it's a good idea to turn the repeated if...else into a 447binary tree, as if it were a switch; the resulting code requires -1 448compare-and-branches when *p<=2 or *p==5, the same number if *p==4 449or *p>6, and +1 if *p==3. So it should be a speed win 450(on balance). However, the revised code is larger, with 4 conditional 451branches instead of 3. 452 453More seriously, there is a byte->word extend before 454each comparison, where there should be only one, and the condition codes 455are not remembered when the same two values are compared twice. 456 457//===---------------------------------------------------------------------===// 458 459More register scavenging work: 460 4611. Use the register scavenger to track frame index materialized into registers 462 (those that do not fit in addressing modes) to allow reuse in the same BB. 4632. Finish scavenging for Thumb. 4643. We know some spills and restores are unnecessary. The issue is once live 465 intervals are merged, they are not never split. So every def is spilled 466 and every use requires a restore if the register allocator decides the 467 resulting live interval is not assigned a physical register. It may be 468 possible (with the help of the scavenger) to turn some spill / restore 469 pairs into register copies. 470 471//===---------------------------------------------------------------------===// 472 473Teach LSR about ARM addressing modes. 474