README.txt revision 1a9da0d66cbb7742e60ed957a7670a6547911de1
1//===---------------------------------------------------------------------===//
2// Random ideas for the ARM backend.
3//===---------------------------------------------------------------------===//
4
5Reimplement 'select' in terms of 'SEL'.
6
7* We would really like to support UXTAB16, but we need to prove that the
8  add doesn't need to overflow between the two 16-bit chunks.
9
10* implement predication support
11* Implement pre/post increment support.  (e.g. PR935)
12* Coalesce stack slots!
13* Implement smarter constant generation for binops with large immediates.
14
15* Consider materializing FP constants like 0.0f and 1.0f using integer 
16  immediate instructions then copy to FPU.  Slower than load into FPU?
17
18//===---------------------------------------------------------------------===//
19
20The constant island pass is in good shape.  Some cleanups might be desirable,
21but there is unlikely to be much improvement in the generated code.
22
231.  There may be some advantage to trying to be smarter about the initial
24placement, rather than putting everything at the end.
25
262.  The handling of 2-byte padding for Thumb is overly conservative.  There 
27would be a small gain to keeping accurate track of the padding (which would
28require aligning functions containing constant pools to 4-byte boundaries).
29
303.  There might be some compile-time efficiency to be had by representing
31consecutive islands as a single block rather than multiple blocks.
32
334.  Use a priority queue to sort constant pool users in inverse order of
34    position so we always process the one closed to the end of functions
35    first. This may simply CreateNewWater.
36
37//===---------------------------------------------------------------------===//
38
39We need to start generating predicated instructions.  The .td files have a way
40to express this now (see the PPC conditional return instruction), but the 
41branch folding pass (or a new if-cvt pass) should start producing these, at
42least in the trivial case.
43
44Among the obvious wins, doing so can eliminate the need to custom expand 
45copysign (i.e. we won't need to custom expand it to get the conditional
46negate).
47
48This allows us to eliminate one instruction from:
49
50define i32 @_Z6slow4bii(i32 %x, i32 %y) {
51        %tmp = icmp sgt i32 %x, %y
52        %retval = select i1 %tmp, i32 %x, i32 %y
53        ret i32 %retval
54}
55
56__Z6slow4bii:
57        cmp r0, r1
58        movgt r1, r0
59        mov r0, r1
60        bx lr
61
62//===---------------------------------------------------------------------===//
63
64Implement long long "X-3" with instructions that fold the immediate in.  These
65were disabled due to badness with the ARM carry flag on subtracts.
66
67//===---------------------------------------------------------------------===//
68
69We currently compile abs:
70int foo(int p) { return p < 0 ? -p : p; }
71
72into:
73
74_foo:
75        rsb r1, r0, #0
76        cmn r0, #1
77        movgt r1, r0
78        mov r0, r1
79        bx lr
80
81This is very, uh, literal.  This could be a 3 operation sequence:
82  t = (p sra 31); 
83  res = (p xor t)-t
84
85Which would be better.  This occurs in png decode.
86
87//===---------------------------------------------------------------------===//
88
89More load / store optimizations:
901) Look past instructions without side-effects (not load, store, branch, etc.)
91   when forming the list of loads / stores to optimize.
92
932) Smarter register allocation?
94We are probably missing some opportunities to use ldm / stm. Consider:
95
96ldr r5, [r0]
97ldr r4, [r0, #4]
98
99This cannot be merged into a ldm. Perhaps we will need to do the transformation
100before register allocation. Then teach the register allocator to allocate a
101chunk of consecutive registers.
102
1033) Better representation for block transfer? This is from Olden/power:
104
105	fldd d0, [r4]
106	fstd d0, [r4, #+32]
107	fldd d0, [r4, #+8]
108	fstd d0, [r4, #+40]
109	fldd d0, [r4, #+16]
110	fstd d0, [r4, #+48]
111	fldd d0, [r4, #+24]
112	fstd d0, [r4, #+56]
113
114If we can spare the registers, it would be better to use fldm and fstm here.
115Need major register allocator enhancement though.
116
1174) Can we recognize the relative position of constantpool entries? i.e. Treat
118
119	ldr r0, LCPI17_3
120	ldr r1, LCPI17_4
121	ldr r2, LCPI17_5
122
123   as
124	ldr r0, LCPI17
125	ldr r1, LCPI17+4
126	ldr r2, LCPI17+8
127
128   Then the ldr's can be combined into a single ldm. See Olden/power.
129
130Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a
131double 64-bit FP constant:
132
133	adr	r0, L6
134	ldmia	r0, {r0-r1}
135
136	.align 2
137L6:
138	.long	-858993459
139	.long	1074318540
140
1415) Can we make use of ldrd and strd? Instead of generating ldm / stm, use
142ldrd/strd instead if there are only two destination registers that form an
143odd/even pair. However, we probably would pay a penalty if the address is not
144aligned on 8-byte boundary. This requires more information on load / store
145nodes (and MI's?) then we currently carry.
146
1476) struct copies appear to be done field by field 
148instead of by words, at least sometimes:
149
150struct foo { int x; short s; char c1; char c2; };
151void cpy(struct foo*a, struct foo*b) { *a = *b; }
152
153llvm code (-O2)
154        ldrb r3, [r1, #+6]
155        ldr r2, [r1]
156        ldrb r12, [r1, #+7]
157        ldrh r1, [r1, #+4]
158        str r2, [r0]
159        strh r1, [r0, #+4]
160        strb r3, [r0, #+6]
161        strb r12, [r0, #+7]
162gcc code (-O2)
163        ldmia   r1, {r1-r2}
164        stmia   r0, {r1-r2}
165
166In this benchmark poor handling of aggregate copies has shown up as
167having a large effect on size, and possibly speed as well (we don't have
168a good way to measure on ARM).
169
170//===---------------------------------------------------------------------===//
171
172* Consider this silly example:
173
174double bar(double x) {  
175  double r = foo(3.1);
176  return x+r;
177}
178
179_bar:
180	sub sp, sp, #16
181	str r4, [sp, #+12]
182	str r5, [sp, #+8]
183	str lr, [sp, #+4]
184	mov r4, r0
185	mov r5, r1
186	ldr r0, LCPI2_0
187	bl _foo
188	fmsr f0, r0
189	fcvtsd d0, f0
190	fmdrr d1, r4, r5
191	faddd d0, d0, d1
192	fmrrd r0, r1, d0
193	ldr lr, [sp, #+4]
194	ldr r5, [sp, #+8]
195	ldr r4, [sp, #+12]
196	add sp, sp, #16
197	bx lr
198
199Ignore the prologue and epilogue stuff for a second. Note 
200	mov r4, r0
201	mov r5, r1
202the copys to callee-save registers and the fact they are only being used by the
203fmdrr instruction. It would have been better had the fmdrr been scheduled
204before the call and place the result in a callee-save DPR register. The two
205mov ops would not have been necessary.
206
207//===---------------------------------------------------------------------===//
208
209Calling convention related stuff:
210
211* gcc's parameter passing implementation is terrible and we suffer as a result:
212
213e.g.
214struct s {
215  double d1;
216  int s1;
217};
218
219void foo(struct s S) {
220  printf("%g, %d\n", S.d1, S.s1);
221}
222
223'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and
224then reload them to r1, r2, and r3 before issuing the call (r0 contains the
225address of the format string):
226
227	stmfd	sp!, {r7, lr}
228	add	r7, sp, #0
229	sub	sp, sp, #12
230	stmia	sp, {r0, r1, r2}
231	ldmia	sp, {r1-r2}
232	ldr	r0, L5
233	ldr	r3, [sp, #8]
234L2:
235	add	r0, pc, r0
236	bl	L_printf$stub
237
238Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves?
239
240* Return an aggregate type is even worse:
241
242e.g.
243struct s foo(void) {
244  struct s S = {1.1, 2};
245  return S;
246}
247
248	mov	ip, r0
249	ldr	r0, L5
250	sub	sp, sp, #12
251L2:
252	add	r0, pc, r0
253	@ lr needed for prologue
254	ldmia	r0, {r0, r1, r2}
255	stmia	sp, {r0, r1, r2}
256	stmia	ip, {r0, r1, r2}
257	mov	r0, ip
258	add	sp, sp, #12
259	bx	lr
260
261r0 (and later ip) is the hidden parameter from caller to store the value in. The
262first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1,
263r2 into the address passed in. However, there is one additional stmia that
264stores r0, r1, and r2 to some stack location. The store is dead.
265
266The llvm-gcc generated code looks like this:
267
268csretcc void %foo(%struct.s* %agg.result) {
269entry:
270	%S = alloca %struct.s, align 4		; <%struct.s*> [#uses=1]
271	%memtmp = alloca %struct.s		; <%struct.s*> [#uses=1]
272	cast %struct.s* %S to sbyte*		; <sbyte*>:0 [#uses=2]
273	call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 )
274	cast %struct.s* %agg.result to sbyte*		; <sbyte*>:1 [#uses=2]
275	call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 )
276	cast %struct.s* %memtmp to sbyte*		; <sbyte*>:2 [#uses=1]
277	call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 )
278	ret void
279}
280
281llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from
282constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated
283into a number of load and stores, or 2) custom lower memcpy (of small size) to
284be ldmia / stmia. I think option 2 is better but the current register
285allocator cannot allocate a chunk of registers at a time.
286
287A feasible temporary solution is to use specific physical registers at the
288lowering time for small (<= 4 words?) transfer size.
289
290* ARM CSRet calling convention requires the hidden argument to be returned by
291the callee.
292
293//===---------------------------------------------------------------------===//
294
295We can definitely do a better job on BB placements to eliminate some branches.
296It's very common to see llvm generated assembly code that looks like this:
297
298LBB3:
299 ...
300LBB4:
301...
302  beq LBB3
303  b LBB2
304
305If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can
306then eliminate beq and and turn the unconditional branch to LBB2 to a bne.
307
308See McCat/18-imp/ComputeBoundingBoxes for an example.
309
310//===---------------------------------------------------------------------===//
311
312Register scavenging is now implemented.  The example in the previous version
313of this document produces optimal code at -O2.
314
315//===---------------------------------------------------------------------===//
316
317Pre-/post- indexed load / stores:
318
3191) We should not make the pre/post- indexed load/store transform if the base ptr
320is guaranteed to be live beyond the load/store. This can happen if the base
321ptr is live out of the block we are performing the optimization. e.g.
322
323mov r1, r2
324ldr r3, [r1], #4
325...
326
327vs.
328
329ldr r3, [r2]
330add r1, r2, #4
331...
332
333In most cases, this is just a wasted optimization. However, sometimes it can
334negatively impact the performance because two-address code is more restrictive
335when it comes to scheduling.
336
337Unfortunately, liveout information is currently unavailable during DAG combine
338time.
339
3402) Consider spliting a indexed load / store into a pair of add/sub + load/store
341   to solve #1 (in TwoAddressInstructionPass.cpp).
342
3433) Enhance LSR to generate more opportunities for indexed ops.
344
3454) Once we added support for multiple result patterns, write indexed loads
346   patterns instead of C++ instruction selection code.
347
3485) Use FLDM / FSTM to emulate indexed FP load / store.
349
350//===---------------------------------------------------------------------===//
351
352We should add i64 support to take advantage of the 64-bit load / stores.
353We can add a pseudo i64 register class containing pseudo registers that are
354register pairs. All other ops (e.g. add, sub) would be expanded as usual.
355
356We need to add pseudo instructions (i.e. gethi / getlo) to extract i32 registers
357from the i64 register. These are single moves which can be eliminated if the
358destination register is a sub-register of the source. We should implement proper
359subreg support in the register allocator to coalesce these away.
360
361There are other minor issues such as multiple instructions for a spill / restore
362/ move.
363
364//===---------------------------------------------------------------------===//
365
366Implement support for some more tricky ways to materialize immediates.  For
367example, to get 0xffff8000, we can use:
368
369mov r9, #&3f8000
370sub r9, r9, #&400000
371
372//===---------------------------------------------------------------------===//
373
374We sometimes generate multiple add / sub instructions to update sp in prologue
375and epilogue if the inc / dec value is too large to fit in a single immediate
376operand. In some cases, perhaps it might be better to load the value from a
377constantpool instead.
378
379//===---------------------------------------------------------------------===//
380
381GCC generates significantly better code for this function.
382
383int foo(int StackPtr, unsigned char *Line, unsigned char *Stack, int LineLen) {
384    int i = 0;
385
386    if (StackPtr != 0) {
387       while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768)))
388          Line[i++] = Stack[--StackPtr];
389        if (LineLen > 32768)
390        {
391            while (StackPtr != 0 && i < LineLen)
392            {
393                i++;
394                --StackPtr;
395            }
396        }
397    }
398    return StackPtr;
399}
400
401//===---------------------------------------------------------------------===//
402
403This should compile to the mlas instruction:
404int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; }
405
406//===---------------------------------------------------------------------===//
407
408At some point, we should triage these to see if they still apply to us:
409
410http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598
411http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560
412http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016
413
414http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831
415http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826
416http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825
417http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824
418http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823
419http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820
420http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982
421
422http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242
423http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831
424http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760
425http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759
426http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703
427http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702
428http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663
429
430http://www.inf.u-szeged.hu/gcc-arm/
431http://citeseer.ist.psu.edu/debus04linktime.html
432
433//===---------------------------------------------------------------------===//
434
435gcc generates smaller code for this function at -O2 or -Os:
436
437void foo(signed char* p) {
438  if (*p == 3)
439     bar();
440   else if (*p == 4)
441    baz();
442  else if (*p == 5)
443    quux();
444}
445
446llvm decides it's a good idea to turn the repeated if...else into a
447binary tree, as if it were a switch; the resulting code requires -1 
448compare-and-branches when *p<=2 or *p==5, the same number if *p==4
449or *p>6, and +1 if *p==3.  So it should be a speed win
450(on balance).  However, the revised code is larger, with 4 conditional 
451branches instead of 3.
452
453More seriously, there is a byte->word extend before
454each comparison, where there should be only one, and the condition codes
455are not remembered when the same two values are compared twice.
456
457//===---------------------------------------------------------------------===//
458
459More register scavenging work:
460
4611. Use the register scavenger to track frame index materialized into registers
462   (those that do not fit in addressing modes) to allow reuse in the same BB.
4632. Finish scavenging for Thumb.
4643. We know some spills and restores are unnecessary. The issue is once live
465   intervals are merged, they are not never split. So every def is spilled
466   and every use requires a restore if the register allocator decides the
467   resulting live interval is not assigned a physical register. It may be
468   possible (with the help of the scavenger) to turn some spill / restore
469   pairs into register copies.
470
471//===---------------------------------------------------------------------===//
472
473Teach LSR about ARM addressing modes.
474