History log of /external/squashfs-tools/squashfs-tools/process_fragments.c
Revision Date Author Comments (<<< Hide modified files) (Show modified files >>>)
2eb7df04ccd4e2752e7d3fd876a2a45f9cfafc78 06-May-2014 Phillip Lougher <phillip@squashfs.org.uk> process_fragments: remove commented out debugging SQUASHFS_TRACE

Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
/external/squashfs-tools/squashfs-tools/process_fragments.c
943acada7a60a048b4be53a4df1e94e8b10e08a6 17-Apr-2014 Phillip Lougher <phillip@squashfs.org.uk> mksquashfs: fix a potential non-default option deadlock

Fix a potential deadlock in Mksquashfs that may be triggerable using
non-default options.

Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
/external/squashfs-tools/squashfs-tools/process_fragments.c
8bb17b0275fa35318ad35c8fd477023004f940aa 31-Mar-2014 Phillip Lougher <phillip@squashfs.org.uk> Mksquashfs: significantly optimise fragment duplicate checking

Remove the last remaining parallelisation bottleneck in
Mksquashfs - fragment duplicate checking which was previously done
on the single main thread.

Back in 2006 when I first parallelised Mksquashfs, doing this on the
main thread was initially not considered to be too much
of an issue. If you don't have (m)any duplicates then
you avoid the issue full stop. But even when you do have fragments
which need to be checked, the necessary work of memcmp (memory compare)
is not too arduous and is much faster than the upstream file reader thread,
and much much faster than the downstream fragment compressor thread(s),
and so if Mksquashfs is running slow then this is "not the bottleneck you're
looking for", that's going to either be fragment compression or
file reading. This is on the basis most duplicates are local and the
fragment referenced can be found in the fragment cache.

But often when I ran Mksquashfs and had a lot of duplicates the
performance of Mksquashfs would be disappointing, normally without
duplicates, I expected to get full processor utilisation, but with
duplicates you might get roughly 200% or even 100% (i.e. one processor
core), at least for the time it was hitting a run of duplicates in
the source filesystem. Increasing the size of the fragment
cache would reduce the performance hit. Which gave a substantial
hint the problem was fragment cache misses which caused fragment
blocks to be read back off disk and decompressed on the single
main thread. But it was evident that wasn't the whole story.

The culprit has always self-evidently been the single threaded
duplicate checking on the main thread, this has been apparent almost
since the initial parallelisation of Mksquashfs in 2006, but although
I've had my suspicions as to why (the hint above), with the
demands/prioritisation of extra functionality, this has remained on my
TODO list until now.

Analysis now has shown the problem to be a triple whammy:

1. With duplicates (and even without), there are substantial fragment
cache misses, which make the single main thread spend a lot of the
time duplicate checking reading in fragment blocks off disk,
decompressing them, and then memcmp'ing them for a match. This is
because with a large filesystem, many fragments match at the
checksum level even though they're not actually a match at the byte
level - the checksums eliminate most files, but if you've got a large
filesystem that still leaves multiple files which match, and this
match is random, and does not follow locality of reference. So
invariably these fragment blocks are no longer in the fragment
cache (if you're compressing 1Gbyte of files and have a 64Mbyte
(default) fragment cache, most checksum matches will invariably
not be in the cache, because they do not follow the "locality of
reference rules", the checksum matches can literally be anywhere
in the part of the filesystem already compressed and written to disk).
The checksum matches in theory could be reduced by improving the
discriminating power of the checksums, but this is a zero sum
game, the extra processing overhead of computing a more sophisticated
checksum for *all* blocks would easily outweigh the benefits of
less checksum matches.

2. Even with the knowledge the main thread spends a lot of the
time reading in and decompressing fragment blocks, we're left with
the fact the main thread has enough "bandwidth" to do this without
becoming a bottleneck, so there's more to the story.

The "more to the story" is that the main thread spends most of its
time asleep! As fragment compression is the bottleneck in any
Mksquashfs run, we run out of "empty fragment blocks" because all
of the fragment blocks become filled, and get queued on the
fragment compression threads waiting for them to be compressed. So
the main thread sleeps waiting for an "empty fragment block" even
though it has a queue of files which it could be duplicate checking.

3. When the main thread does wake up having got an "empty fragment block"
and it starts to do duplicate checking, if that duplicate checking takes
a long time (because it has to read in fragment blocks and decompress
them), then it 1. stops passing fragments to the fragment decompressor
threads, and 2. stops taking fragments from the reader thread... So
both the fragment compressor threads and the reader thread starve.

Now, because both the reader thread and the fragment compressor threads
have deep queues, this doesn't happen instantaenously, but only if
the main thread hits a run of files which need multiple fragment
blocks to be read off disk, and decompressed. Unfortunately, that
*does* happen.

So, we end up with the situation the main thread doesn't duplicate
check files ahead of time because it is blocked on the fragment
compressor threads. When it does wake up and do duplicate checking
(because it didn't do it ahead of time), it ends up starving the
fragment compressor threads and reader thread for that duration -
hence we get a CPU utilisation of 100% *or less* because only
that main thread is running.

The solution is to move duplicate checking to multiple
one per-core front end processing threads ahead of the main thread
(interposed between the reader thread and the main thread). So
the front-end threads do duplicate checking on behalf of the
main thread. This eliminates the main thread bottleneck at a stroke,
because the front-end threads can duplicate check ahead of time,
even though the main thread is blocked on the fragment
compressors.

In theory simple, in practice extremely difficult. Two issues have
to be dealt with:

1. It introduces a level of fragment cache synchronisation hitherto
avoided due to clever design in Mksquashfs. Mksquashfs parallelisation
is coded on the producer-consumer principle. The producer thread
creates buffers in the cache, fills them in, and then passes them
to the consumer thread via a queue, the consumer thread only "sees"
the buffers when they're read from the queue, at which time the
consumer and producer has inherently synchronised, because the
consumer only gets them once the producer thread has explicitly done
with the buffer. This technique AFAIK was introduced in CSP
(communicating sequential processes) and was adopted in the largely
forgotten about descendant OCCAM. This technique eliminates
explicit buffer locking.

The front-end threads break this model because we get multiple
threads opportunistically looking up fragments in the fragment
cache, and then creating them if they're not available. So we
get the problem threads can lookup buffers and get them whilst
they're still being filled in, and we get races where two
threads can simultaneously create the same buffers. This can,
obviously, be dealt with by introducing the concept of "locked"
buffers etc. but it means adding an additional set of cache APIs
only for the fragment processing threads.

2. The front-end threads have to synchronise with the main thread
to do duplicate checking. At the time the front-end threads
do duplicate checking, there may exist no duplicates, but the
duplicate may exist being duplicate checked itself. Think of the
case where we have two files alphabetically one after another, say
"a" and "b", "a" goes to front-end thread 1, and "b" goes to
front-end thread 2, at this time neither file "exists" because it's
being duplicate checked, thread 2 cannot determine file "b" is
a duplicate of file "a" because it doesn't "exist" at this time.

This has to be done without introducing an inherent synchronisation
point on the main thread, which will only reintroduce the main thread
bottleneck "by the back door".

But is actually more complex than that. There are two additional
points where synchronisation with the main thread "by the back door"
has to be avoided to get optimum performance. But, you'll have to look
at the code because this commit entry is too long as it is.

But, the upshot of this improvement, is Mksquashfs speeds up by 10% -
60% depemding on the ratio of duplicate files to non-duplicate
files in the source filesystem, which is a significant improvement.

Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
/external/squashfs-tools/squashfs-tools/process_fragments.c
c6424dc4d55a711ff862409fa07b9c143049aba7 13-Mar-2014 Phillip Lougher <phillip@squashfs.org.uk> process_fragments: move fragment checksumming to the process fragment threads

Move fragment checksumming to the process fragment threads, and away
from the main thread.

This has a couple of minor side effects:

1. Tail end fragments (fragments belonging to files larger than the
block size) are now checksummed up front.

2. This means add_non_dup() gains an extra checksum_frag_flag,
because we now have a combination of True/False statuses for
the block checksum and the fragment checksum. Previously, we
either had both the block checksum and fragment checksum, or neither.
(fragments pre-existing on disk on append are not checksummed up front).

3. duplicate() no longer needs the fragment checksum to be passed,
because it is contained within the file_buffer structure which
is always passed.

Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
/external/squashfs-tools/squashfs-tools/process_fragments.c
9de84acd2fcc5992704a433c0de402fcd8a5d143 20-Feb-2014 Phillip Lougher <phillip@squashfs.org.uk> Mksquashfs: introduce additional per CPU fragment process threads

The following commit

commit 287f3ef6b1cd670f3ff347dcaaa7b0bade03096b
Date: Sun May 19 04:27:38 2013 +0100
mksquashfs: queue fragment and empty file buffers directly
to main thread

moved fragment and empty file buffer processing away from the
deflate thread(s), and queued them directly to the main thread.
This was done to ensure the "queue dump" facility added to
Mksquashfs when it showed the "reader -> deflate queue size"
was showing the number of *file buffers* waiting to be
compressed by the deflate thread(s). Previously when the
queue also held fragments (because sparse checking of fragments
was piggy-backed onto the deflate threads), the queue size shown
included both fragments and file buffers waiting to be compressed.
This limited the usefulness of the queue dump in determining what
was happening within Mksquashfs.

A consequence of this, however, was that sparse checking of
fragments was moved to the main thread. In hindsight,
this could cause a performance regression in certain cases because
there is only one main thread (it is not per CPU) and it is tasked
with overall coordination of Mksquashfs, adding the additional
burden of sparse checking could bottleneck the main thread,
leading to bottlenecking of the other threads and a loss of
parallelism.

In general this change broke the general rule that the main thread
should only deal with coordination and processing should be
done on other per CPU threads.

So fix this by introducing an additional per CPU fragment processing
thread which deals with sparse checking of fragments.

Note: it is expected that moving sparse checking to the main thread may
only have caused a problem with duplicate checking where the main thread
becomes otherwise engaged, and even then only in the extremely rare
case many fragments are sparse. In general, fragments are mostly
trivially verifiable as non-sparse (first few bytes non-zero), and the
sparse check is therefore extremely fast.

Moving sparse checking to fragment processing threads is a first stage
iteration, the aim is to move the other more CPU intensive fragment
processing which is currently done on the main thread to the new
fragment processing threads too, as this should improve performance.

Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
/external/squashfs-tools/squashfs-tools/process_fragments.c