bcb10462e854ba7172bab840329b6d3126a2fdd4 |
|
16-May-2016 |
Tobias Thierer <tobiast@google.com> |
Fix race condition in accept() on socket that is concurrently closed. This CL is the result of long digging into the rabbit hole after observing a flaky test. I've added a unit test for behavior at the ServerSocket level, although the root cause of the bug sits lower. ServerSocket.accept() checks for a closed connection at the Java level before descending into deeper (including native) levels. The problem was that if the socket was closed() concurrently *after* this initial check at the Java level, later native code didn't deal with the closed state correctly and could block indefinitely. Specifically, a closed socket internally sets the value of its fd field to -1. Native code reads that field via (*env)->GetIntField(env, fdObj, IO_fd_fdID); and passes it to NET_Timeout, which then blocks. Other places in the code read the same field and pass it to NET_Accept and NET_Connect, which do not block when they get an fd value of -1. There are 7 places that I found which call NET_Timeout, and they all get their fd value from the Java field using the code snippet above. NET_Timeout is documented to be a "Wrapper for poll(s, timeout)". However, it was never a particularly thin wrapper: unlike NET_Poll, NET_Timeout already has a bunch of extra logic on top of the poll syscall, namely ignoring other signals and adjusting the timeout. All of this suggests that NET_Timeout is the right place to handle the case of fd AKA s being < 0. As discussed on the bug, observability in the sense of ServerSocketConcurrentCloseTest failing was caused by: https://android.googlesource.com/platform/libcore/+/eeb3b74 It's therefore probably a good idea to investigate whether our current mechanics for closing sockets are the right ones. Establishing correctness of such an alternative change would require very careful analysis, though. Therefore I went for this fix instead, which is much easier to verify as safe and corrects all observable issues that we currently know about. Tested: - Ran the test many times from Android Studio (adjusted test's packageName). This is the only test that I also did on devices without the fix. - on build with the fix, the test passes 25/25 times. - on build without the fix, the test fails 12/15 times. - Also ran a couple of times from CTS (25181x pass, 19x skipped): make cts && cts-tradefed run cts -m CtsLibcoreTestCases \ --test java.net.ServerSocketConcurrentCloseTest - Ran all of the libcore tests once through CTS (25181 tests pass, 19 tests not run): make cts && cts-tradefed run cts -m CtsLibcoreTestCases - make cts && cts-tradefed run cts -m CtsNetTestCases Tests were run on a Nexus 6P. I previously verified that the bug also occurs on Nexus 5X and Nexus 6 running N. Bug: 27763633 Change-Id: I074acebe3d3c8a252bf5107bf45be9589d098c6a
|