8bb17b0275fa35318ad35c8fd477023004f940aa |
|
31-Mar-2014 |
Phillip Lougher <phillip@squashfs.org.uk> |
Mksquashfs: significantly optimise fragment duplicate checking Remove the last remaining parallelisation bottleneck in Mksquashfs - fragment duplicate checking which was previously done on the single main thread. Back in 2006 when I first parallelised Mksquashfs, doing this on the main thread was initially not considered to be too much of an issue. If you don't have (m)any duplicates then you avoid the issue full stop. But even when you do have fragments which need to be checked, the necessary work of memcmp (memory compare) is not too arduous and is much faster than the upstream file reader thread, and much much faster than the downstream fragment compressor thread(s), and so if Mksquashfs is running slow then this is "not the bottleneck you're looking for", that's going to either be fragment compression or file reading. This is on the basis most duplicates are local and the fragment referenced can be found in the fragment cache. But often when I ran Mksquashfs and had a lot of duplicates the performance of Mksquashfs would be disappointing, normally without duplicates, I expected to get full processor utilisation, but with duplicates you might get roughly 200% or even 100% (i.e. one processor core), at least for the time it was hitting a run of duplicates in the source filesystem. Increasing the size of the fragment cache would reduce the performance hit. Which gave a substantial hint the problem was fragment cache misses which caused fragment blocks to be read back off disk and decompressed on the single main thread. But it was evident that wasn't the whole story. The culprit has always self-evidently been the single threaded duplicate checking on the main thread, this has been apparent almost since the initial parallelisation of Mksquashfs in 2006, but although I've had my suspicions as to why (the hint above), with the demands/prioritisation of extra functionality, this has remained on my TODO list until now. Analysis now has shown the problem to be a triple whammy: 1. With duplicates (and even without), there are substantial fragment cache misses, which make the single main thread spend a lot of the time duplicate checking reading in fragment blocks off disk, decompressing them, and then memcmp'ing them for a match. This is because with a large filesystem, many fragments match at the checksum level even though they're not actually a match at the byte level - the checksums eliminate most files, but if you've got a large filesystem that still leaves multiple files which match, and this match is random, and does not follow locality of reference. So invariably these fragment blocks are no longer in the fragment cache (if you're compressing 1Gbyte of files and have a 64Mbyte (default) fragment cache, most checksum matches will invariably not be in the cache, because they do not follow the "locality of reference rules", the checksum matches can literally be anywhere in the part of the filesystem already compressed and written to disk). The checksum matches in theory could be reduced by improving the discriminating power of the checksums, but this is a zero sum game, the extra processing overhead of computing a more sophisticated checksum for *all* blocks would easily outweigh the benefits of less checksum matches. 2. Even with the knowledge the main thread spends a lot of the time reading in and decompressing fragment blocks, we're left with the fact the main thread has enough "bandwidth" to do this without becoming a bottleneck, so there's more to the story. The "more to the story" is that the main thread spends most of its time asleep! As fragment compression is the bottleneck in any Mksquashfs run, we run out of "empty fragment blocks" because all of the fragment blocks become filled, and get queued on the fragment compression threads waiting for them to be compressed. So the main thread sleeps waiting for an "empty fragment block" even though it has a queue of files which it could be duplicate checking. 3. When the main thread does wake up having got an "empty fragment block" and it starts to do duplicate checking, if that duplicate checking takes a long time (because it has to read in fragment blocks and decompress them), then it 1. stops passing fragments to the fragment decompressor threads, and 2. stops taking fragments from the reader thread... So both the fragment compressor threads and the reader thread starve. Now, because both the reader thread and the fragment compressor threads have deep queues, this doesn't happen instantaenously, but only if the main thread hits a run of files which need multiple fragment blocks to be read off disk, and decompressed. Unfortunately, that *does* happen. So, we end up with the situation the main thread doesn't duplicate check files ahead of time because it is blocked on the fragment compressor threads. When it does wake up and do duplicate checking (because it didn't do it ahead of time), it ends up starving the fragment compressor threads and reader thread for that duration - hence we get a CPU utilisation of 100% *or less* because only that main thread is running. The solution is to move duplicate checking to multiple one per-core front end processing threads ahead of the main thread (interposed between the reader thread and the main thread). So the front-end threads do duplicate checking on behalf of the main thread. This eliminates the main thread bottleneck at a stroke, because the front-end threads can duplicate check ahead of time, even though the main thread is blocked on the fragment compressors. In theory simple, in practice extremely difficult. Two issues have to be dealt with: 1. It introduces a level of fragment cache synchronisation hitherto avoided due to clever design in Mksquashfs. Mksquashfs parallelisation is coded on the producer-consumer principle. The producer thread creates buffers in the cache, fills them in, and then passes them to the consumer thread via a queue, the consumer thread only "sees" the buffers when they're read from the queue, at which time the consumer and producer has inherently synchronised, because the consumer only gets them once the producer thread has explicitly done with the buffer. This technique AFAIK was introduced in CSP (communicating sequential processes) and was adopted in the largely forgotten about descendant OCCAM. This technique eliminates explicit buffer locking. The front-end threads break this model because we get multiple threads opportunistically looking up fragments in the fragment cache, and then creating them if they're not available. So we get the problem threads can lookup buffers and get them whilst they're still being filled in, and we get races where two threads can simultaneously create the same buffers. This can, obviously, be dealt with by introducing the concept of "locked" buffers etc. but it means adding an additional set of cache APIs only for the fragment processing threads. 2. The front-end threads have to synchronise with the main thread to do duplicate checking. At the time the front-end threads do duplicate checking, there may exist no duplicates, but the duplicate may exist being duplicate checked itself. Think of the case where we have two files alphabetically one after another, say "a" and "b", "a" goes to front-end thread 1, and "b" goes to front-end thread 2, at this time neither file "exists" because it's being duplicate checked, thread 2 cannot determine file "b" is a duplicate of file "a" because it doesn't "exist" at this time. This has to be done without introducing an inherent synchronisation point on the main thread, which will only reintroduce the main thread bottleneck "by the back door". But is actually more complex than that. There are two additional points where synchronisation with the main thread "by the back door" has to be avoided to get optimum performance. But, you'll have to look at the code because this commit entry is too long as it is. But, the upshot of this improvement, is Mksquashfs speeds up by 10% - 60% depemding on the ratio of duplicate files to non-duplicate files in the source filesystem, which is a significant improvement. Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
/external/squashfs-tools/squashfs-tools/process_fragments.c
|
9de84acd2fcc5992704a433c0de402fcd8a5d143 |
|
20-Feb-2014 |
Phillip Lougher <phillip@squashfs.org.uk> |
Mksquashfs: introduce additional per CPU fragment process threads The following commit commit 287f3ef6b1cd670f3ff347dcaaa7b0bade03096b Date: Sun May 19 04:27:38 2013 +0100 mksquashfs: queue fragment and empty file buffers directly to main thread moved fragment and empty file buffer processing away from the deflate thread(s), and queued them directly to the main thread. This was done to ensure the "queue dump" facility added to Mksquashfs when it showed the "reader -> deflate queue size" was showing the number of *file buffers* waiting to be compressed by the deflate thread(s). Previously when the queue also held fragments (because sparse checking of fragments was piggy-backed onto the deflate threads), the queue size shown included both fragments and file buffers waiting to be compressed. This limited the usefulness of the queue dump in determining what was happening within Mksquashfs. A consequence of this, however, was that sparse checking of fragments was moved to the main thread. In hindsight, this could cause a performance regression in certain cases because there is only one main thread (it is not per CPU) and it is tasked with overall coordination of Mksquashfs, adding the additional burden of sparse checking could bottleneck the main thread, leading to bottlenecking of the other threads and a loss of parallelism. In general this change broke the general rule that the main thread should only deal with coordination and processing should be done on other per CPU threads. So fix this by introducing an additional per CPU fragment processing thread which deals with sparse checking of fragments. Note: it is expected that moving sparse checking to the main thread may only have caused a problem with duplicate checking where the main thread becomes otherwise engaged, and even then only in the extremely rare case many fragments are sparse. In general, fragments are mostly trivially verifiable as non-sparse (first few bytes non-zero), and the sparse check is therefore extremely fast. Moving sparse checking to fragment processing threads is a first stage iteration, the aim is to move the other more CPU intensive fragment processing which is currently done on the main thread to the new fragment processing threads too, as this should improve performance. Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
/external/squashfs-tools/squashfs-tools/process_fragments.c
|