README.txt revision dbd8eb26ce1e7de9b69f5c46f45ba011a706c9b9
1//===---------------------------------------------------------------------===//
2// Random notes about and ideas for the SystemZ backend.
3//===---------------------------------------------------------------------===//
4
5The initial backend is deliberately restricted to z10.  We should add support
6for later architectures at some point.
7
8--
9
10SystemZDAGToDAGISel::SelectInlineAsmMemoryOperand() is passed "m" for all
11inline asm memory constraints; it doesn't get to see the original constraint.
12This means that it must conservatively treat all inline asm constraints
13as the most restricted type, "R".
14
15--
16
17If an inline asm ties an i32 "r" result to an i64 input, the input
18will be treated as an i32, leaving the upper bits uninitialised.
19For example:
20
21define void @f4(i32 *%dst) {
22  %val = call i32 asm "blah $0", "=r,0" (i64 103)
23  store i32 %val, i32 *%dst
24  ret void
25}
26
27from CodeGen/SystemZ/asm-09.ll will use LHI rather than LGHI.
28to load 103.  This seems to be a general target-independent problem.
29
30--
31
32The tuning of the choice between LOAD ADDRESS (LA) and addition in
33SystemZISelDAGToDAG.cpp is suspect.  It should be tweaked based on
34performance measurements.
35
36--
37
38We don't support tail calls at present.
39
40--
41
42We don't support prefetching yet.
43
44--
45
46There is no scheduling support.
47
48--
49
50We don't use the BRANCH ON COUNT or BRANCH ON INDEX families of instruction.
51
52--
53
54We might want to use BRANCH ON CONDITION for conditional indirect calls
55and conditional returns.
56
57--
58
59We don't use the combined COMPARE AND BRANCH instructions.
60
61--
62
63We don't use the condition code results of anything except comparisons.
64
65Implementing this may need something more finely grained than the z_cmp
66and z_ucmp that we have now.  It might (or might not) also be useful to
67have a mask of "don't care" values in conditional branches.  For example,
68integer comparisons never set CC to 3, so the bottom bit of the CC mask
69isn't particularly relevant.  JNLH and JE are equally good for testing
70equality after an integer comparison, etc.
71
72--
73
74We don't use the LOAD AND TEST or TEST DATA CLASS instructions.
75
76--
77
78We could use the generic floating-point forms of LOAD COMPLEMENT,
79LOAD NEGATIVE and LOAD POSITIVE in cases where we don't need the
80condition codes.  For example, we could use LCDFR instead of LCDBR.
81
82--
83
84We don't optimize block memory operations.
85
86It's definitely worth using things like MVC, CLC, NC, XC and OC with
87constant lengths.  MVCIN may be worthwhile too.
88
89We should probably implement things like memcpy using MVC with EXECUTE.
90Likewise memcmp and CLC.  MVCLE and CLCLE could be useful too.
91
92--
93
94We don't optimize string operations.
95
96MVST, CLST, SRST and CUSE could be useful here.  Some of the TRANSLATE
97family might be too, although they are probably more difficult to exploit.
98
99--
100
101We don't take full advantage of builtins like fabsl because the calling
102conventions require f128s to be returned by invisible reference.
103
104--
105
106ADD LOGICAL WITH SIGNED IMMEDIATE could be useful when we need to
107produce a carry.  SUBTRACT LOGICAL IMMEDIATE could be useful when we
108need to produce a borrow.  (Note that there are no memory forms of
109ADD LOGICAL WITH CARRY and SUBTRACT LOGICAL WITH BORROW, so the high
110part of 128-bit memory operations would probably need to be done
111via a register.)
112
113--
114
115We don't use the halfword forms of LOAD REVERSED and STORE REVERSED
116(LRVH and STRVH).
117
118--
119
120We could take advantage of the various ... UNDER MASK instructions,
121such as ICM and STCM.
122
123--
124
125We could make more use of the ROTATE AND ... SELECTED BITS instructions.
126At the moment we only use RISBG, and only then for subword atomic operations.
127
128--
129
130DAGCombiner can detect integer absolute, but there's not yet an associated
131ISD opcode.  We could add one and implement it using LOAD POSITIVE.
132Negated absolutes could use LOAD NEGATIVE.
133
134--
135
136DAGCombiner doesn't yet fold truncations of extended loads.  Functions like:
137
138    unsigned long f (unsigned long x, unsigned short *y)
139    {
140      return (x << 32) | *y;
141    }
142
143therefore end up as:
144
145        sllg    %r2, %r2, 32
146        llgh    %r0, 0(%r3)
147        lr      %r2, %r0
148        br      %r14
149
150but truncating the load would give:
151
152        sllg    %r2, %r2, 32
153        lh      %r2, 0(%r3)
154        br      %r14
155
156--
157
158Functions like:
159
160define i64 @f1(i64 %a) {
161  %and = and i64 %a, 1
162  ret i64 %and
163}
164
165ought to be implemented as:
166
167        lhi     %r0, 1
168        ngr     %r2, %r0
169        br      %r14
170
171but two-address optimisations reverse the order of the AND and force:
172
173        lhi     %r0, 1
174        ngr     %r0, %r2
175        lgr     %r2, %r0
176        br      %r14
177
178CodeGen/SystemZ/and-04.ll has several examples of this.
179
180--
181
182Out-of-range displacements are usually handled by loading the full
183address into a register.  In many cases it would be better to create
184an anchor point instead.  E.g. for:
185
186define void @f4a(i128 *%aptr, i64 %base) {
187  %addr = add i64 %base, 524288
188  %bptr = inttoptr i64 %addr to i128 *
189  %a = load volatile i128 *%aptr
190  %b = load i128 *%bptr
191  %add = add i128 %a, %b
192  store i128 %add, i128 *%aptr
193  ret void
194}
195
196(from CodeGen/SystemZ/int-add-08.ll) we load %base+524288 and %base+524296
197into separate registers, rather than using %base+524288 as a base for both.
198
199--
200
201Dynamic stack allocations round the size to 8 bytes and then allocate
202that rounded amount.  It would be simpler to subtract the unrounded
203size from the copy of the stack pointer and then align the result.
204See CodeGen/SystemZ/alloca-01.ll for an example.
205
206--
207
208Atomic loads and stores use the default compare-and-swap based implementation.
209This is much too conservative in practice, since the architecture guarantees
210that 1-, 2-, 4- and 8-byte loads and stores to aligned addresses are
211inherently atomic.
212
213--
214
215If needed, we can support 16-byte atomics using LPQ, STPQ and CSDG.
216
217--
218
219We might want to model all access registers and use them to spill
22032-bit values.
221