History log of /art/compiler/dex/quick/local_optimizations.cc
Revision Date Author Comments
20f85597828194c12be10d3a927999def066555e 19-Mar-2015 Vladimir Marko <vmarko@google.com> Fixed layout for dex caches in boot image.

Define a fixed layout for dex cache arrays (type, method,
string and field arrays) for dex caches in the boot image.
This gives those arrays fixed offsets from the boot image
code and allows PC-relative addressing of their elements.

Use the PC-relative load on arm64 for relevant instructions,
i.e. invoke-static, invoke-direct, const-string,
const-class, check-cast and instance-of. This reduces the
arm64 boot.oat on Nexus 9 by 1.1MiB.

This CL provides the infrastructure and shows on the arm64
the gains that we can achieve by having fixed dex cache
arrays' layout. To fully use this for the boot images, we
need to implement the PC-relative addressing for other
architectures. To achieve similar gains for apps, we need
to move the dex cache arrays to a .bss section of the oat
file. These changes will be implemented in subsequent CLs.

(Also remove some compiler_driver.h dependencies to reduce
incremental build times.)

Change-Id: Ib1859fa4452d01d983fd92ae22b611f45a85d69b
0b9203e7996ee1856f620f95d95d8a273c43a3df 23-Jan-2015 Andreas Gampe <agampe@google.com> ART: Some Quick cleanup

Make several fields const in CompilationUnit. May benefit some Mir2Lir
code that repeats tests, and in general immutability is good.

Remove compiler_internals.h and refactor some other headers to reduce
overly broad imports (and thus forced recompiles on changes).

Change-Id: I898405907c68923581373b5981d8a85d2e5d185a
a7894cdb063edb88f1420a42207e0c4bd27ab4f9 06-Aug-2014 Dave Allison <dallison@google.com> Fix checks for kLiteral in local optimizations.

The check for kLiteral (literal load) just checked the kLiteral
bit in the def mask. The kEncodeAll mask has the kLiteral bit
set so this check was triggering. The fix is to check for
only the kLiteral bit being set and no other special bits.

The semantics of the special bits in the use/def mask is that
only one of them can be set at the same time.

Bug: 16824330

Change-Id: I0f1c1157e017870414ffef11767e5433d1fd4401
adc73cbe869f9560cf84bda2e953a2b267b1438f 06-Aug-2014 Dave Allison <dallison@google.com> Fix checks for kLiteral in local optimizations.

The check for kLiteral (literal load) just checked the kLiteral
bit in the def mask. The kEncodeAll mask has the kLiteral bit
set so this check was triggering. The fix is to check for
only the kLiteral bit being set and no other special bits.

The semantics of the special bits in the use/def mask is that
only one of them can be set at the same time.

Bug: 16824330

Change-Id: I0f1c1157e017870414ffef11767e5433d1fd4401
63999683329612292d534e6be09dbde9480f1250 15-Jul-2014 Serban Constantinescu <serban.constantinescu@arm.com> Revert "Revert "Enable Load Store Elimination for ARM and ARM64""

This patch refactors the implementation of the LoadStoreElimination
optimisation pass. Please note that this pass was disabled and not
functional for any of the backends.

The current implementation tracks aliases and handles DalvikRegs as well
as Heap memory regions. It has been tested and it is known to optimise
out the following:
* Load - Load
* Store - Load
* Store - Store
* Load Literals

Change-Id: I3aadb12a787164146a95bc314e85fa73ad91e12b
c32447bcc8c36ee8ff265ed678c7df86936a9ebe 27-Jul-2014 Bill Buzbee <buzbee@android.com> Revert "Enable Load Store Elimination for ARM and ARM64"

On extended testing, I'm seeing a CHECK failure at utility_arm.cc:1201.

This reverts commit fcc36ba2a2b8fd10e6eebd21ecb6329606443ded.

Change-Id: Icae3d49cd7c8fcab09f2f989cbcb1d7e5c6d137a
fcc36ba2a2b8fd10e6eebd21ecb6329606443ded 15-Jul-2014 Serban Constantinescu <serban.constantinescu@arm.com> Enable Load Store Elimination for ARM and ARM64

This patch refactors the implementation of the LoadStoreElimination
optimisation pass. Please note that this pass was disabled and not
functional for any of the backends.

The current implementation tracks aliases and handles DalvikRegs as well
as Heap memory regions. It has been tested and it is known to optimise
out the following:
* Load - Load
* Store - Load
* Store - Store
* Load Literals

Change-Id: Iefae9b696f87f833ef35c451ed4d49c5a1b6fde0
af263df7f643e699abf622c64447d31bacc14c34 12-Jul-2014 Andreas Gampe <agampe@google.com> ART: Change GenPCUseDefEncoding(), turn on Load Hoisting for ARM64

This defines the PC resource mask as empty, as the PC is not
accessible on ARM64.

Unify code paths with x86 in LoadStoreElimination and LoadHoisting.

Change-Id: Iea8b9e666f306c7a6ff52b6c5bf7e05b35346b2c
8dea81ca9c0201ceaa88086b927a5838a06a3e69 06-Jun-2014 Vladimir Marko <vmarko@google.com> Rewrite use/def masks to support 128 bits.

Reduce LIR memory usage by holding masks by pointers in the
LIR rather than directly and using pre-defined const masks
for the common cases, allocating very few on the arena.

Change-Id: I0f6d27ef6867acd157184c8c74f9612cebfe6c16
091cc408e9dc87e60fb64c61e186bea568fc3d3a 31-Mar-2014 buzbee <buzbee@google.com> Quick compiler: allocate doubles as doubles

Significant refactoring of register handling to unify usage across
all targets & 32/64 backends.

Reworked RegStorage encoding to allow expanded use of
x86 xmm registers; removed vector registers as a separate
register type. Reworked RegisterInfo to describe aliased
physical registers. Eliminated quite a bit of target-specific code
and generalized common code.

Use of RegStorage instead of int for registers now propagated down
to the NewLIRx() level. In future CLs, the NewLIRx() routines will
be replaced with versions that are explicit about what kind of
operand they expect (RegStorage, displacement, etc.). The goal
is to eventually use RegStorage all the way to the assembly phase.

TBD: MIPS needs verification.
TBD: Re-enable liveness tracking.

Change-Id: I388c006d5fa9b3ea72db4e37a19ce257f2a15964
6a58cb16d803c9a7b3a75ccac8be19dd9d4e520d 02-Apr-2014 Dmitry Petrochenko <dmitry.petrochenko@intel.com> art: Handle x86_64 architecture equal to x86

This patch forces FE/ME to treat x86_64 as x86 exactly.
The x86_64 logic will be revised later when assembly will be ready.

Change-Id: I4a92477a6eeaa9a11fd710d35c602d8d6f88cbb6
Signed-off-by: Dmitry Petrochenko <dmitry.petrochenko@intel.com>
2700f7e1edbcd2518f4978e4cd0e05a4149f91b6 07-Mar-2014 buzbee <buzbee@google.com> Continuing register cleanup

Ready for review.

Continue the process of using RegStorage rather than
ints to hold register value in the top layers of codegen.
Given the huge number of changes in this CL, I've attempted
to minimize the number of actual logic changes. With this
CL, the use of ints for registers has largely been eliminated
except in the lowest utility levels. "Wide" utility routines
have been updated to take a single RegStorage rather than
a pair of ints representing low and high registers.

Upcoming CLs will be smaller and more targeted. My expectations:
o Allocate float double registers as a single double rather than
a pair of float single registers.
o Refactor to push code which assumes long and double Dalvik
values are held in a pair of register to the target dependent
layer.
o Clean-up of the xxx_mir.h files to reduce the amount of #defines
for registers. May also do a register renumbering to bring all
of our targets' register naming more consistent. Possibly
introduce a target-independent float/non-float test at the
RegStorage level.

Change-Id: I646de7392bdec94595dd2c6f76e0f1c4331096ff
99ad7230ccaace93bf323dea9790f35fe991a4a2 26-Feb-2014 Razvan A Lupusoru <razvan.a.lupusoru@intel.com> Relaxed memory barriers for x86

X86 provides stronger memory guarantees and thus the memory barriers can be
optimized. This patch ensures that all memory barriers for x86 are treated
as scheduling barriers. And in cases where a barrier is needed (StoreLoad case),
an mfence is used.

Change-Id: I13d02bf3f152083ba9f358052aedb583b0d48640
Signed-off-by: Razvan A Lupusoru <razvan.a.lupusoru@intel.com>
83cc7ae96d4176533dd0391a1591d321b0a87f4f 12-Feb-2014 Vladimir Marko <vmarko@google.com> Create a scoped arena allocator and use that for LVN.

This saves more than 0.5s of boot.oat compilation time
on Nexus 5.

TODO: Move other stuff to the scoped allocator. This CL
alone increases the peak memory allocation. By reusing
the memory for other parts of the compilation we should
reduce this overhead.

Change-Id: Ifbc00aab4f3afd0000da818dfe68b96713824a08
1da1e2fceb0030b4b76b43510b1710a9613e0c2e 15-Nov-2013 buzbee <buzbee@google.com> More compile-time tuning

Another round of compile-time tuning, this time yeilding in the
vicinity of 3% total reduction in compile time (which means about
double that for the Quick Compile portion).

Primary improvements are skipping the basic block combine optimization
pass when using Quick (because we already have big blocks), combining
the null check elimination and type inference passes, and limiting
expensive local value number analysis to only those blocks which
might benefit from it.

Following this CL, the actual compile phase consumes roughly 60%
of the total dex2oat time on the host, and 55% on the target (Note,
I'm subtracting out the Deduping time here, which the timing logger
normally counts against the compiler).

A sample breakdown of the compilation time follows (this taken on
PlusOne.apk w/ a Nexus 4):

39.00% -> MIR2LIR: 1374.90 (Note: includes local optimization & scheduling)
10.25% -> MIROpt:SSATransform: 361.31
8.45% -> BuildMIRGraph: 297.80
7.55% -> Assemble: 266.16
6.87% -> MIROpt:NCE_TypeInference: 242.22
5.56% -> Dedupe: 196.15
3.45% -> MIROpt:BBOpt: 121.53
3.20% -> RegisterAllocation: 112.69
3.00% -> PcMappingTable: 105.65
2.90% -> GcMap: 102.22
2.68% -> Launchpads: 94.50
1.16% -> MIROpt:InitRegLoc: 40.94
1.16% -> Cleanup: 40.93
1.10% -> MIROpt:CodeLayout: 38.80
0.97% -> MIROpt:ConstantProp: 34.35
0.96% -> MIROpt:UseCount: 33.75
0.86% -> MIROpt:CheckFilters: 30.28
0.44% -> SpecialMIR2LIR: 15.53
0.44% -> MIROpt:BBCombine: 15.41

(cherry pick of 9e8e234af4430abe8d144414e272cd72d215b5f3)

Change-Id: I86c665fa7e88b75eb75629a99fd292ff8c449969
409fe94ad529d9334587be80b9f6a3d166805508 11-Oct-2013 buzbee <buzbee@google.com> Quick assembler fix

This CL re-instates the select pattern optimization disabled by
CL 374310, and fixes the underlying problem: improper handling of
the kPseudoBarrier LIR opcode. The bug was introduced in the
recent assembler restructuring. In short, LIR pseudo opcodes (which
have values < 0), should always have size 0 - and thus cause no
bits to be emitted during assembly. In this case, bad logic caused
us to set the size of a kPseudoBarrier opcode via lookup through the
EncodingMap.

Because all pseudo ops are < 0, this meant we did an array underflow
load, picking up whatever garbage was located before the EncodingMap.
This explains why this error showed up recently - we'd previuosly just
gotten a lucky layout.

This CL corrects the faulty logic, and adds DCHECKs to uses of
the EncodingMap to ensure that we don't try to access w/ a
pseudo op. Additionally, the existing is_pseudo_op() macro is
replaced with IsPseudoLirOp(), named similar to the existing
IsPseudoMirOp().

Change-Id: I46761a0275a923d85b545664cadf052e1ab120dc
b48819db07f9a0992a72173380c24249d7fc648a 15-Sep-2013 buzbee <buzbee@google.com> Compile-time tuning: assembly phase

Not as much compile-time gain from reworking the assembly phase as I'd
hoped, but still worthwhile. Should see ~2% improvement thanks to
the assembly rework. On the other hand, expect some huge gains for some
application thanks to better detection of large machine-generated init
methods. Thinkfree shows a 25% improvement.

The major assembly change was to establish thread the LIR nodes that
require fixup into a fixup chain. Only those are processed during the
final assembly pass(es). This doesn't help for methods which only
require a single pass to assemble, but does speed up the larger methods
which required multiple assembly passes.

Also replaced the block_map_ basic block lookup table (which contained
space for a BasicBlock* for each dex instruction unit) with a block id
map - cutting its space requirements by half in a 32-bit pointer
environment.

Changes:
o Reduce size of LIR struct by 12.5% (one of the big memory users)
o Repurpose the use/def portion of the LIR after optimization complete.
o Encode instruction bits to LIR
o Thread LIR nodes requiring pc fixup
o Change follow-on assembly passes to only consider fixup LIRs
o Switch on pc-rel fixup kind
o Fast-path for small methods - single pass assembly
o Avoid using cb[n]z for null checks (almost always exceed displacement)
o Improve detection of large initialization methods.
o Rework def/use flag setup.
o Remove a sequential search from FindBlock using lookup table of 16-bit
block ids rather than full block pointers.
o Eliminate pcRelFixup and use fixup kind instead.
o Add check for 16-bit overflow on dex offset.

Change-Id: I4c6615f83fed46f84629ad6cfe4237205a9562b4
252254b130067cd7a5071865e793966871ae0246 09-Sep-2013 buzbee <buzbee@google.com> More Quick compile-time tuning: labels & branches

This CL represents a roughly 3.5% performance improvement for the
compile phase of dex2oat. Move of the gain comes from avoiding
the generation of dex boundary LIR labels unless a debug listing
is requested. The other significant change is moving from a basic block
ending branch model of "always generate a fall-through branch, and then
delete it if we can" to a "only generate a fall-through branch if we need
it" model.

The data motivating these changes follow. Note that two area of
potentially attractive gain remain: restructing the assembler model and
reworking the register handling utilities. These will be addressed
in subsequent CLs.

--- data follows

The Quick compiler's assembler has shown up on profile reports a bit
more than seems reasonable. We've tried a few quick fixes to apparently
hot portions of the code, but without much gain. So, I've been looking at
the assembly process at a somewhat higher level. There look to be several
potentially good opportunities.

First, an analysis of the makeup of the LIR graph showed a surprisingly
high proportion of LIR pseudo ops. Using the boot classpath as a basis,
we get:

32.8% of all LIR nodes are pseudo ops.
10.4% are LIR instructions which require pc-relative fixups.
11.8% are LIR instructions that have been nop'd by the various
optimization passes.

Looking only at the LIR pseudo ops, we get:
kPseudoDalvikByteCodeBoundary 43.46%
kPseudoNormalBlockLabel 21.14%
kPseudoSafepointPC 20.20%
kPseudoThrowTarget 6.94%
kPseudoTarget 3.03%
kPseudoSuspendTarget 1.95%
kPseudoMethodExit 1.26%
kPseudoMethodEntry 1.26%
kPseudoExportedPC 0.37%
kPseudoCaseLabel 0.30%
kPseudoBarrier 0.07%
kPseudoIntrinsicRetry 0.02%
Total LIR count: 10167292

The standout here is the Dalvik opcode boundary marker. This is just a
label inserted at the beginning of the codegen for each Dalvik bytecode.
If we're also doing a verbose listing, this is also where we hang the
pretty-print disassembly string. However, this label was also
being used as a convenient way to find the target of switch case
statements (and, I think at one point was used in the Mir->GBC conversion
process).

This CL moves the use of kPseudoDalvikByteCodeBoundary labels to only
verbose listing runs, and replaces the codegen uses of the label with
the kPseudoNormalBlockLabel attached to the basic block that contains the
switch case target. Great savings here - 14.3% reduction in the number of
LIR nodes needed. After this CL, our LIR pseudo proportions drop to 21.6%
of all LIR. That's still a lot, but much better. Possible further
improvements via combining normal labels with kPseudoSafepointPC labels
where appropriate, and also perhaps reduce memory usage by using a
short-hand form for labels rather than a full LIR node. Also, many
of the basic block labels are no longer branch targets by the time
we get to assembly - cheaper to delete, or just ingore?

Here's the "after" LIR pseudo op breakdown:

kPseudoNormalBlockLabel 37.39%
kPseudoSafepointPC 35.72%
kPseudoThrowTarget 12.28%
kPseudoTarget 5.36%
kPseudoSuspendTarget 3.45%
kPseudoMethodEntry 2.24%
kPseudoMethodExit 2.22%
kPseudoExportedPC 0.65%
kPseudoCaseLabel 0.53%
kPseudoBarrier 0.12%
kPseudoIntrinsicRetry 0.04%
Total LIR count: 5748232

Not done in this CL, but it will be worth experimenting with actually
deleting LIR nodes from the graph when they are optimized away, rather
than just setting the NOP bit. Keeping them around is invaluable
during debugging - but when not debugging it may pay off if the cost of
node removal is less than the cost of traversing through dead nodes
in subsequent passes.

Next up (and partially in this CL - but mostly to be done in follow-on
CLs) is the overall assembly process. Inherited from the trace JIT,
the Quick compiler has a fairly simple-minded approach to instruction
assembly. First, a pass is made over the LIR list to assign offsets
to each instruction. Then, the assembly pass is made - which generates
the actual machine instruction bit patterns and pushes the instruction
data into the code_buffer. However, the code generator takes the "always
optimistic" approach to instruction selection and emits the shortest
instruction. If, during assembly, we find that a branch or load doesn't
reach, that short-form instruction is replaces with a longer sequence.

Of course, this invalidates the previously-computed offset calculations.
Assembly thus is an iterative process: compute offsets and then assemble
until we survive an assembly pass without invalidation. This seems
like a likely candidate for improvement. First, I analyzed the
number of retries required, and the reason for invalidation over the
boot classpath load.

The results: more than half of methods don't require a retry, and
very few require more than 1 extra pass:

5 or more: 6 of 96334
4 or more: 22 of 96334
3 or more: 140 of 96334
2 or more: 1794 of 96334 - 2%
1 or more: 40911 of 96334 - 40%
0 retries: 55423 of 96334 - 58%

The interesting group here is the one that requires 1 retry. Looking
at the reason, we see three typical reasons:

1. A cbnz/cbz doesn't reach (only 7 bits of offset)
2. A 16-bit Thumb1 unconditional branch doesn't reach.
3. An unconditional branch which branches to the next instruction
is encountered, and deleted.

The first 2 cases are the cost of the optimistic strategy - nothing
much to change there. However, the interesting case is #3 - dead
branch elimination. A further analysis of the single retry group showed
that 42% of the methods (16305) that required a single retry did so
*only* because of dead branch elimination. The big question here is
why so many dead branches survive to the assembly stage. We have
a dead branch elimination pass which is supposed to catch these - perhaps
it's not working correctly, should be moved later in the optimization
process, or perhaps run multiple times.

Other things to consider:

o Combine the offset generation pass with the assembly pass. Skip
pc-relative fixup assembly (other than assigning offset), but push
LIR* for them into work list. Following the main pass, zip through
the work list and assemble the pc-relative instructions (now that we
know the offsets). This would significantly cut back on traversal
costs.

o Store the assembled bits into both the code buffer and the LIR.
In the event we have to retry, only the pc-relative instructions
would need to be assembled, and we'd finish with a pass over the
LIR just to dumb the bits into the code buffer.

Change-Id: I50029d216fa14f273f02b6f1c8b6a0dde5a7d6a6
56c717860df2d71d66fb77aa77f29dd346e559d3 06-Sep-2013 buzbee <buzbee@google.com> Compile-time tuning

Specialized the dataflow iterators and did a few other minor tweaks.
Showing ~5% compile-time improvement in a single-threaded environment;
less in multi-threaded (presumably because we're blocked by something
else).

Change-Id: I2e2ed58d881414b9fc97e04cd0623e188259afd2
f6c4b3ba3825de1dbb3e747a68b809c6cc8eb4db 25-Aug-2013 Mathieu Chartier <mathieuc@google.com> New arena memory allocator.

Before we were creating arenas for each method. The issue with doing this
is that we needed to memset each memory allocation. This can be improved
if you start out with arenas that contain all zeroed memory and recycle
them for each method. When you give memory back to the arena pool you do
a single memset to zero out all of the memory that you used.

Always inlined the fast path of the allocation code.

Removed the "zero" parameter since the new arena allocator always returns
zeroed memory.

Host dex2oat time on target oat apks (2 samples each).
Before:
real 1m11.958s
user 4m34.020s
sys 1m28.570s

After:
real 1m9.690s
user 4m17.670s
sys 1m23.960s

Target device dex2oat samples (Mako, Thinkfree.apk):
Without new arena allocator:
0m26.47s real 0m54.60s user 0m25.85s system
0m25.91s real 0m54.39s user 0m26.69s system
0m26.61s real 0m53.77s user 0m27.35s system
0m26.33s real 0m54.90s user 0m25.30s system
0m26.34s real 0m53.94s user 0m27.23s system

With new arena allocator:
0m25.02s real 0m54.46s user 0m19.94s system
0m25.17s real 0m55.06s user 0m20.72s system
0m24.85s real 0m55.14s user 0m19.30s system
0m24.59s real 0m54.02s user 0m20.07s system
0m25.06s real 0m55.00s user 0m20.42s system

Correctness of Thinkfree.apk.oat verified by diffing both of the oat files.

Change-Id: I5ff7b85ffe86c57d3434294ca7a621a695bf57a9
0cd7ec2dcd8d7ba30bf3ca420b40dac52849876c 18-Jul-2013 Brian Carlstrom <bdc@google.com> Fix cpplint whitespace/blank_line issues

Change-Id: Ice937e95e23dd622c17054551d4ae4cebd0ef8a2
2ce745c06271d5223d57dbf08117b20d5b60694a 18-Jul-2013 Brian Carlstrom <bdc@google.com> Fix cpplint whitespace/braces issues

Change-Id: Ide80939faf8e8690d8842dde8133902ac725ed1a
7940e44f4517de5e2634a7e07d58d0fb26160513 12-Jul-2013 Brian Carlstrom <bdc@google.com> Create separate Android.mk for main build targets

The runtime, compiler, dex2oat, and oatdump now are in seperate trees
to prevent dependency creep. They can now be individually built
without rebuilding the rest of the art projects. dalvikvm and jdwpspy
were already this way. Builds in the art directory should behave as
before, building everything including tests.

Change-Id: Ic6b1151e5ed0f823c3dd301afd2b13eb2d8feb81