[proxy] web.archive.org← back | site home | direct (HTTPS) ↗ | proxy home | ◑ dark◐ light

Issue 39995: test_concurrent_futures: ProcessPoolSpawnExecutorDeadlockTest.test_crash() fails with OSError: [Errno 9] Bad file descriptor

Created on 2020-03-17 17:01 by vstinner, last changed 2020-04-29 09:27 by tomMoral.

Messages (32) msg364451 - (view) Author: STINNER Victor (vstinner) * Date: 2020-03-17 17:01
AMD64 Ubuntu Shared 3.x:
https://buildbot.python.org/all/#/builders/101/builds/532

test_crash (test.test_concurrent_futures.ProcessPoolSpawnExecutorDeadlockTest) ... 
Stdout:
15.51s 

Stderr:
Warning -- threading_cleanup() failed to cleanup 0 threads (count: 0, dangling: 3)
Dangling thread: <Thread(QueueFeederThread, stopped daemon 140540322510592)>
Dangling thread: <_MainThread(MainThread, started 140540525950528)>
Dangling thread: <_ExecutorManagerThread(Thread-111, stopped daemon 140540401628928)>

(...)

======================================================================
ERROR: test_crash (test.test_concurrent_futures.ProcessPoolSpawnExecutorDeadlockTest) [exit at task unpickle]
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_concurrent_futures.py", line 1119, in test_crash
    executor.shutdown(wait=True)
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 721, in shutdown
    self._executor_manager_thread_wakeup.wakeup()
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 93, in wakeup
    self._writer.send_bytes(b"")
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 205, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 416, in _send_bytes
    self._send(header + buf)
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 373, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

Stdout:
15.51s 

Stderr:
Warning -- threading_cleanup() failed to cleanup 0 threads (count: 0, dangling: 3)
Dangling thread: <Thread(QueueFeederThread, stopped daemon 140540322510592)>
Dangling thread: <_MainThread(MainThread, started 140540525950528)>
Dangling thread: <_ExecutorManagerThread(Thread-111, stopped daemon 140540401628928)>

--

On the same build, test_concurrent_futures timed out after 15 min, while running test_ressources_gced_in_workers():

0:29:08 load avg: 1.46 Re-running test_concurrent_futures in verbose mode
(...)
test_map_exception (test.test_concurrent_futures.ProcessPoolForkProcessPoolExecutorTest) ... ok
test_map_timeout (test.test_concurrent_futures.ProcessPoolForkProcessPoolExecutorTest) ... ok
test_max_workers_negative (test.test_concurrent_futures.ProcessPoolForkProcessPoolExecutorTest) ... ok
test_max_workers_too_large (test.test_concurrent_futures.ProcessPoolForkProcessPoolExecutorTest) ... skipped 'Windows-only process limit'
test_no_stale_references (test.test_concurrent_futures.ProcessPoolForkProcessPoolExecutorTest) ... ok
Timeout (0:15:00)!
Thread 0x00007f38bf766700 (most recent call first):
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/threading.py", line 303 in wait
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/queues.py", line 227 in _feed
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/threading.py", line 882 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/threading.py", line 944 in _bootstrap_inner
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/threading.py", line 902 in _bootstrap

Thread 0x00007f38bff67700 (most recent call first):
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/selectors.py", line 415 in select
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 936 in wait
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 372 in wait_result_broken_or_wakeup
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 319 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/threading.py", line 944 in _bootstrap_inner
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/threading.py", line 902 in _bootstrap

Thread 0x00007f38c7128640 (most recent call first):
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/threading.py", line 303 in wait
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/_base.py", line 434 in result
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_concurrent_futures.py", line 955 in test_ressources_gced_in_workers
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 616 in _callTestMethod
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 659 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 719 in __call__
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 122 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 84 in __call__
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 122 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 84 in __call__
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 122 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 84 in __call__
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/runner.py", line 176 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support/__init__.py", line 2079 in _run_suite
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support/__init__.py", line 2201 in run_unittest
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/runtest.py", line 209 in _test_module
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/runtest.py", line 234 in _runtest_inner2
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/runtest.py", line 270 in _runtest_inner
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/runtest.py", line 153 in _runtest
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/runtest.py", line 193 in runtest
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/main.py", line 318 in rerun_failed_tests
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/main.py", line 691 in _main
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/main.py", line 634 in main
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/libregrtest/main.py", line 712 in main
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/__main__.py", line 2 in <module>
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/runpy.py", line 87 in _run_code
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/runpy.py", line 194 in _run_module_as_main
test_ressources_gced_in_workers (test.test_concurrent_futures.ProcessPoolForkProcessPoolExecutorTest) ... Makefile:1171: recipe for target 'buildbottest' failed
make: *** [buildbottest] Error 1

command timed out: 1200 seconds without output running [b'make', b'buildbottest', b'TESTOPTS=-j2 --junit-xml test-results.xml ${BUILDBOT_TESTOPTS}', b'TESTPYTHONOPTS=', b'TESTTIMEOUT=900'], attempting to kill
program finished with exit code 2
elapsedTime=3849.667593
msg364546 - (view) Author: STINNER Victor (vstinner) * Date: 2020-03-18 18:24
Same bug on AMD64 FreeBSD Non-Debug 3.x:
https://buildbot.python.org/all/#/builders/214/builds/472

======================================================================
ERROR: test_crash (test.test_concurrent_futures.ProcessPoolSpawnExecutorDeadlockTest) [crash during func execution on worker]
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/test/test_concurrent_futures.py", line 1119, in test_crash
    executor.shutdown(wait=True)
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/concurrent/futures/process.py", line 721, in shutdown
    self._executor_manager_thread_wakeup.wakeup()
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/concurrent/futures/process.py", line 93, in wakeup
    self._writer.send_bytes(b"")
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/multiprocessing/connection.py", line 205, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/multiprocessing/connection.py", line 416, in _send_bytes
    self._send(header + buf)
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-9e36.nondebug/build/Lib/multiprocessing/connection.py", line 373, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

Stdout:
8.35s 

Stderr:
Warning -- threading_cleanup() failed to cleanup 0 threads (count: 0, dangling: 3)
Dangling thread: <_MainThread(MainThread, started 34369908736)>
Dangling thread: <_ExecutorManagerThread(Thread-114, stopped daemon 34396434432)>
Dangling thread: <Thread(QueueFeederThread, stopped daemon 34377508352)>

----------------------------------------------------------------------
msg364552 - (view) Author: STINNER Victor (vstinner) * Date: 2020-03-18 18:31
> Same bug on AMD64 FreeBSD Non-Debug 3.x:
> https://buildbot.python.org/all/#/builders/214/builds/472

Oh, test_crash failed twice, but not on the same test case:

* ERROR: test_crash (test.test_concurrent_futures.ProcessPoolSpawnExecutorDeadlockTest) [crash during func execution on worker]
* ERROR: test_crash (test.test_concurrent_futures.ProcessPoolForkExecutorDeadlockTest) [crash during func execution on worker]

The second failure was when test_concurrent_futures was re-run sequentially.
msg367434 - (view) Author: STINNER Victor (vstinner) * Date: 2020-04-27 16:02
See also bpo-30966 "Add multiprocessing.SimpleQueue.close()".
msg367449 - (view) Author: STINNER Victor (vstinner) * Date: 2020-04-27 19:36
New changeset 5d1f32d33ba24d0aa87235ae40207bb57778388b by Victor Stinner in branch 'master':
bpo-39995: Split test_concurrent_futures.test_crash() into sub-tests (GH-19739)
https://github.com/python/cpython/commit/5d1f32d33ba24d0aa87235ae40207bb57778388b
msg367450 - (view) Author: STINNER Victor (vstinner) * Date: 2020-04-27 19:46
AMD64 Ubuntu Shared 3.x:
https://buildbot.python.org/all/#/builders/101/builds/809

======================================================================
ERROR: test_shutdown_no_wait (test.test_concurrent_futures.ProcessPoolForkserverProcessPoolShutdownTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_concurrent_futures.py", line 542, in test_shutdown_no_wait
    executor.shutdown(wait=False)
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 724, in shutdown
    self._executor_manager_thread_wakeup.wakeup()
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 80, in wakeup
    self._writer.send_bytes(b"")
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 188, in send_bytes
    self._check_closed()
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 141, in _check_closed
    raise OSError("handle is closed")
OSError: handle is closed

(...)
0:32:37 load avg: 1.64 Re-running test_concurrent_futures in verbose mode
(...)

======================================================================
ERROR: test_shutdown_no_wait (test.test_concurrent_futures.ProcessPoolForkProcessPoolShutdownTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_concurrent_futures.py", line 542, in test_shutdown_no_wait
    executor.shutdown(wait=False)
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 724, in shutdown
    self._executor_manager_thread_wakeup.wakeup()
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/concurrent/futures/process.py", line 80, in wakeup
    self._writer.send_bytes(b"")
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 205, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 416, in _send_bytes
    self._send(header + buf)
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/multiprocessing/connection.py", line 373, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor
msg367451 - (view) Author: STINNER Victor (vstinner) * Date: 2020-04-27 19:54
> See also bpo-30966 "Add multiprocessing.SimpleQueue.close()".

I pushed a commit 1a275013d1ecc2e3778d64fda86174b2f13d6969: 
"Process.shutdown(wait=True) of concurrent.futures now closes
explicitly the result queue."

test_shutdown_deadlock_pickle() still rely on the queue to be closed implicitly. Queue created at:

  (...)
  File "/home/vstinner/python/master/Lib/test/test_concurrent_futures.py", lineno 1196
    with self.executor_type(max_workers=2,
  File "/home/vstinner/python/master/Lib/concurrent/futures/process.py", lineno 637
    self._result_queue = mp_context.SimpleQueue()
  File "/home/vstinner/python/master/Lib/multiprocessing/context.py", lineno 113
    return SimpleQueue(ctx=self.get_context())
  File "/home/vstinner/python/master/Lib/multiprocessing/queues.py", lineno 341
    self._reader, self._writer = connection.Pipe(duplex=False)
  File "/home/vstinner/python/master/Lib/multiprocessing/connection.py", lineno 539
    c2 = Connection(fd2, readable=False)
msg367453 - (view) Author: STINNER Victor (vstinner) * Date: 2020-04-27 20:09
AMD64 Fedora Stable Clang Installed 3.x:
https://buildbot.python.org/all/#/builders/127/builds/679

0:04:21 load avg: 1.29 [423/423/1] test_concurrent_futures failed (2 min 39 sec)
Warning -- threading_cleanup() failed to cleanup -1 threads (count: 0, dangling: 3)
Warning -- Dangling thread: <_MainThread(MainThread, started 139673296918336)>
Warning -- Dangling thread: <Thread(QueueFeederThread, stopped daemon 139673045362432)>
Warning -- Dangling thread: <_ExecutorManagerThread(Thread-145, stopped 139673053914880)>
Warning -- threading_cleanup() failed to cleanup 0 threads (count: 0, dangling: 3)
Warning -- Dangling thread: <_MainThread(MainThread, started 139673296918336)>
Warning -- Dangling thread: <Thread(QueueFeederThread, stopped daemon 139673045362432)>
Warning -- Dangling thread: <_ExecutorManagerThread(Thread-145, stopped 139673053914880)>
/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 5 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Warning -- multiprocessing.process._dangling was modified by test_concurrent_futures
Warning -- threading._dangling was modified by test_concurrent_futures
test_cancel (test.test_concurrent_futures.FutureTests) ... ok
test_cancelled (test.test_concurrent_futures.FutureTests) ... ok
test_done (test.test_concurrent_futures.FutureTests) ... ok
(...)
test_first_exception_some_already_complete (test.test_concurrent_futures.ThreadPoolWaitTests) ... 1.60s ok
test_pending_calls_race (test.test_concurrent_futures.ThreadPoolWaitTests) ... 0.11s ok
test_timeout (test.test_concurrent_futures.ThreadPoolWaitTests) ... 6.11s ok
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory

======================================================================
ERROR: test_killed_child (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/test/test_concurrent_futures.py", line 130, in tearDown
    self.executor.shutdown(wait=True)
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/concurrent/futures/process.py", line 724, in shutdown
    self._executor_manager_thread_wakeup.wakeup()
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/concurrent/futures/process.py", line 80, in wakeup
    self._writer.send_bytes(b"")
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/connection.py", line 205, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
    self._send(header + buf)
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/connection.py", line 373, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

----------------------------------------------------------------------

Ran 193 tests in 159.805s

FAILED (errors=1, skipped=6)
  Before: set()
  After:  {<weakref at 0x7f0828b17bd0; to 'SpawnProcess' at 0x7f082ace7400>, <weakref at 0x7f0828b17b80; to 'SpawnProcess' at 0x7f082ace7880>, <weakref at 0x7f0828b17b30; to 'SpawnProcess' at 0x7f0828bbc0a0>, <weakref at 0x7f0828b17720; to 'SpawnProcess' at 0x7f082ace72e0>, <weakref at 0x7f0828b17a90; to 'SpawnProcess' at 0x7f082ace7910>} 
  Before: {<weakref at 0x7f082b5ad400; to '_MainThread' at 0x7f082bcd8850>}
  After:  {<weakref at 0x7f0828b4d900; to 'Thread' at 0x7f0828bbc4c0>, <weakref at 0x7f0828b4d0e0; to '_MainThread' at 0x7f082bcd8850>, <weakref at 0x7f0828b17a40; to '_ExecutorManagerThread' at 0x7f0828bbc8e0>} 
test test_concurrent_futures failed

== Tests result: FAILURE ==
msg367455 - (view) Author: STINNER Victor (vstinner) * Date: 2020-04-27 20:34
x86 Gentoo Installed with X 3.x:
https://buildbot.python.org/all/#/builders/128/builds/726

test_del_shutdown (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolShutdownTest) ...

Warning -- Unraisable exception
Exception ignored in: <function _ExecutorManagerThread.__init__.<locals>.weakref_cb at 0xb5067898>
Traceback (most recent call last):
  File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/concurrent/futures/process.py", line 281, in weakref_cb
    thread_wakeup.wakeup()
  File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/concurrent/futures/process.py", line 80, in wakeup
    self._writer.send_bytes(b"")
  File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/multiprocessing/connection.py", line 205, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
    self._send(header + buf)
  File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/multiprocessing/connection.py", line 373, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

0.04s ok
msg367458 - (view) Author: STINNER Victor (vstinner) * Date: 2020-04-27 21:15
ERROR: test_killed_child (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest)
  (...)
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.9/multiprocessing/connection.py", line 373, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

It seems like Connection.close() was called while Connection._send() was called. I added debug logs:

* self._handle was equal to 4 at the function entry
* self._handle was equal to None when write() was called
msg367460 - (view) Author: STINNER Victor (vstinner) * Date: 2020-04-27 21:26
> It seems like Connection.close() was called while Connection._send() was called. I added debug logs:

The connection was closed by terminate_broken() called by _ExecutorManagerThread.run() thread:

test_killed_child (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... close handle 4
  File "/home/vstinner/python/master/Lib/threading.py", line 908, in _bootstrap
    self._bootstrap_inner()
  File "/home/vstinner/python/master/Lib/threading.py", line 950, in _bootstrap_inner
    self.run()
  File "/home/vstinner/python/master/Lib/concurrent/futures/process.py", line 313, in run
    self.terminate_broken(cause)
  File "/home/vstinner/python/master/Lib/concurrent/futures/process.py", line 456, in terminate_broken
    self.join_executor_internals()
  File "/home/vstinner/python/master/Lib/concurrent/futures/process.py", line 503, in join_executor_internals
    self.thread_wakeup.close()
  File "/home/vstinner/python/master/Lib/concurrent/futures/process.py", line 75, in close
    self._writer.close()
msg367462 - (view) Author: STINNER Victor (vstinner) * Date: 2020-04-27 21:46
terminate_broken() method was added by:

commit 0e89076247580ba0e570c4816f0e5628a7e36e83
Author: Thomas Moreau <thomas.moreau.2010@gmail.com>
Date:   Sun Mar 1 21:49:14 2020 +0100

    bpo-39678: refactor queue manager thread (GH-18551)
msg367463 - (view) Author: STINNER Victor (vstinner) * Date: 2020-04-27 21:48
> ERROR: test_killed_child (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest)

The patch below makes this test failure more likely:

diff --git a/Lib/multiprocessing/connection.py b/Lib/multiprocessing/connection.py
index 510e4b5aba..63518e55d9 100644
--- a/Lib/multiprocessing/connection.py
+++ b/Lib/multiprocessing/connection.py
@@ -370,6 +370,7 @@ class Connection(_ConnectionBase):
     def _send(self, buf, write=_write):
         remaining = len(buf)
         while True:
+            time.sleep(0.050)
             n = write(self._handle, buf)
             remaining -= n
             if remaining == 0:
msg367465 - (view) Author: STINNER Victor (vstinner) * Date: 2020-04-27 22:05
It seems like test_killed_child() race condition was introduced by:

commit a5cbab552d294d99fde864306632d7e511a75d3c (refs/bisect/bad)
Author: Thomas Moreau <thomas.moreau.2010@gmail.com>
Date:   Sun Feb 16 19:09:26 2020 +0100

    bpo-39104: Fix hanging ProcessPoolExecutor on shutdown nowait with pickling failure (GH-17670)
msg367486 - (view) Author: Kyle Stanley (aeros) * Date: 2020-04-28 03:09
I can't be certain for the other failures, but I'm currently exploring a potential solution for addressing the `test_killed_child` failure. To me, it seems to be a race condition with attempting to call _ThreadWakeup.close() while there are still bytes being sent. IMO, we should wait until closing the pipe's reader and writer until it's finished sending or receiving bytes. Here's one way to implement that w/ threading.Event:

diff --git a/Lib/concurrent/futures/process.py b/Lib/concurrent/futures/process.py
index 8e9b69a8f0..9bf073fc34 100644
--- a/Lib/concurrent/futures/process.py
+++ b/Lib/concurrent/futures/process.py
@@ -68,21 +68,30 @@ class _ThreadWakeup:
     def __init__(self):
         self._closed = False
         self._reader, self._writer = mp.Pipe(duplex=False)
+        # Used to ensure pipe is not closed while sending or receiving bytes
+        self._not_running = threading.Event()
+        # Initialize event as True
+        self._not_running.set()
 
     def close(self):
         if not self._closed:
+            self._not_running.wait()
             self._closed = True
             self._writer.close()
             self._reader.close()
 
     def wakeup(self):
         if not self._closed:
+            self._not_running.clear()
             self._writer.send_bytes(b"")
+            self._not_running.set()
 
     def clear(self):
         if not self._closed:
+            self._not_running.clear()
             while self._reader.poll():
                 self._reader.recv_bytes()
+            self._not_running.set()


From using Victor's method of replicating the failure with inserting a `time.sleep(0.050)` in multiprocessing.Connection._send(), it appears to fix the failure in test_killed_child. I believe it would also fix the other failures since they appear to be caused by the same core issue, but I've been unable to replicate them locally so I'm not 100% certain.
msg367487 - (view) Author: Kyle Stanley (aeros) * Date: 2020-04-28 03:16
After writing the above out and a bit of further consideration, I think it might make more sense to wait for the event after setting `self._closed = True` so that it prevents future wakeup() and clear() calls from reading/writing to the pipe, while still allowing ones that are currently ongoing to finish.

Thoughts?
msg367503 - (view) Author: Thomas Moreau (tomMoral) * Date: 2020-04-28 07:05
Sorry I just saw this. It seems that I introduced this regression.

One of the goal of having a `ThreadWakeup` and not a `SimpleQueue` is to avoid using locks that can hinder the performance of the executor. I don't remember the exact details but I think I did some benchmark and it was giving lower performances for large set of small tasks (not so sure anymore). If a fully synchronous method is chosen, maybe it is safer to rely on a `SimpleQueue` as it will be lower maintenance. If the solution proposed by aeros is chosen, the event can probably be replaced by a lock no? It would be easier to understand I guess.

From the failures, it seems to be a race condition between `shutdown` and `terminate_broken`. The race condition should not be possible in `submit` as in principle, the `join_executor_internals` is only called when no new task can be submitted.
One solution would be to use the `self._shutdown_lock` from the executor to protect the call to `close` in `terminate_broken` and the call to `self._thread_wakeup.wakeup` in `shutdown`. That way, the lock is only acquired at critical points without being used all the time. This could also be done by adding `lock=True/False` to only lock the potentially dangerous calls.
msg367508 - (view) Author: Antoine Pitrou (pitrou) * Date: 2020-04-28 08:08
How about the following (untested):

diff --git a/Lib/concurrent/futures/process.py b/Lib/concurrent/futures/process.py
index 8e9b69a8f0..c0c2eb3032 100644
--- a/Lib/concurrent/futures/process.py
+++ b/Lib/concurrent/futures/process.py
@@ -66,23 +66,29 @@ _global_shutdown = False
 
 class _ThreadWakeup:
     def __init__(self):
-        self._closed = False
         self._reader, self._writer = mp.Pipe(duplex=False)
 
     def close(self):
-        if not self._closed:
-            self._closed = True
-            self._writer.close()
-            self._reader.close()
+        r, w = self._reader, self._writer
+        self._reader = self._writer = None
+        if r is not None:
+            r.close()
+            w.close()
 
     def wakeup(self):
-        if not self._closed:
+        try:
             self._writer.send_bytes(b"")
+        except AttributeError:
+            # Closed
+            pass
 
     def clear(self):
-        if not self._closed:
+        try:
             while self._reader.poll():
                 self._reader.recv_bytes()
+        except AttributeError:
+            # Closed
+            pass
 
 
 def _python_exit():
msg367512 - (view) Author: Kyle Stanley (aeros) * Date: 2020-04-28 09:10
Oops, it seems that I opened PR-19751 a bit preemptively. When I get the chance, I'll see if Antoine's implementation can address the failures and do some comparisons.
msg367513 - (view) Author: Kyle Stanley (aeros) * Date: 2020-04-28 09:21
I decided to close PR-19751. Both because it does not correctly address the race condition (due to an oversight on my part) and it would add substantial overhead to _ThreadWakeup. Instead, I agree that we should explore a non-locking solution.
msg367539 - (view) Author: STINNER Victor (vstinner) * Date: 2020-04-28 16:06
Antoine Pitrou: "How about the following (untested): (...)"

Using Antoine's patch, test_killed_child() still fails (I used my msg367463 patch to make the failure more likely).
msg367540 - (view) Author: Antoine Pitrou (pitrou) * Date: 2020-04-28 16:09
With the same traceback and error message?
msg367542 - (view) Author: STINNER Victor (vstinner) * Date: 2020-04-28 16:38
With my msg367463 patch (add sleep), test_cancel_futures() fails. Example:

======================================================================
FAIL: test_cancel_futures (test.test_concurrent_futures.ProcessPoolForkProcessPoolShutdownTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/vstinner/python/master/Lib/test/test_concurrent_futures.py", line 353, in test_cancel_futures
    self.assertTrue(len(cancelled) >= 35, msg=f"{len(cancelled)=}")
AssertionError: False is not true : len(cancelled)=0
msg367543 - (view) Author: STINNER Victor (vstinner) * Date: 2020-04-28 16:43
Thomas Moreau: "One solution would be to use the `self._shutdown_lock` from the executor to protect the call to `close` in `terminate_broken` and the call to `self._thread_wakeup.wakeup` in `shutdown`. That way, the lock is only acquired at critical points without being used all the time. This could also be done by adding `lock=True/False` to only lock the potentially dangerous calls."

I wrote a conservative PR 19760 which always lock ProcessPoolExecutor._shutdown_lock while accessing _ThreadWakeup.

PR 19760 fix test_killed_child(): it doesn't fail anymore, even with my msg367463 patch (add sleep).
msg367544 - (view) Author: STINNER Victor (vstinner) * Date: 2020-04-28 17:00
> With my msg367463 patch (add sleep), test_cancel_futures() fails.

The test uses sleep() as a synchronization primitive:

   executor.submit(time.sleep, .1).

That's bad, but it doesn't *have to* be fixed now. My msg367463 patch adds an artifical sleep: the test looks fine in practice. I prefer to wait until it fails on a buildbot worker before spending time to make the test more reliable.
msg367605 - (view) Author: STINNER Victor (vstinner) * Date: 2020-04-29 01:32
New changeset a4dfe8ede5a37576e17035dccfe109ba7752237e by Victor Stinner in branch 'master':
bpo-39995: Fix concurrent.futures _ThreadWakeup (GH-19760)
https://github.com/python/cpython/commit/a4dfe8ede5a37576e17035dccfe109ba7752237e
msg367608 - (view) Author: STINNER Victor (vstinner) * Date: 2020-04-29 01:34
I'm still getting more and more buildbot emails about test_concurrent_futures, so I merged my PR 19760 to fix buildbots.

Please revert or modify my PR 19760 if you have a better approach, but please check that test_killed_child() and ProcessPoolForkExecutorDeadlockTest tests don't fail with my msg367463 patch.

I would still appreciated a post-commit review of my change, since I don't know well concurrent.futures code :

bpo-39995: Fix concurrent.futures _ThreadWakeup (GH-19760)
https://github.com/python/cpython/commit/a4dfe8ede5a37576e17035dccfe109ba7752237e
msg367616 - (view) Author: Kyle Stanley (aeros) * Date: 2020-04-29 03:00
I am a bit concerned about the performance implication of accessing the thread wakeup through a lock in the call queue, but for now I think it's reasonable to address the buildbot failure with a locking solution while we try to find a better one.

I'm not certain if we'll be able to find one that correctly addresses the failures with zero additional locking, but I think we may be able to reduce the number of times we use the lock compared to the implementation in GH-19760. I'll spend some time exploring that as I find the time to, and report back with any significant findings.
msg367617 - (view) Author: Kyle Stanley (aeros) * Date: 2020-04-29 03:05
Also, thanks to Victor for coming up with an interim solution, and for the feedback from Antoine and Thomas. GH-19760 is a significant improvement from my initial proposal in GH-19751.
msg367631 - (view) Author: Antoine Pitrou (pitrou) * Date: 2020-04-29 08:02
I looked at the change and it seemed ok to me. Perhaps Thomas can give it a look too.
msg367632 - (view) Author: Thomas Moreau (tomMoral) * Date: 2020-04-29 08:20
I think this is a reasonable way to move on.Some of the locks can probably be removed but this needs careful investigation and in the mean time, it hinders everyone. Thanks victor for the fast fix up! 

To me, an interesting observation is that the failure seems to only happen when the executor is in broken state. If we can find a way to adapt the behavior to be more conservative on these states (which are not impacting perf) that would be nice.

I will try to look at it today and see if I can remove some of the locks while still not failing with Victor's patch. We can probably remove the lock on `clear` safely. For the others, it might be more complex.
msg367635 - (view) Author: Thomas Moreau (tomMoral) * Date: 2020-04-29 09:27
I did GH 19788 with a few modifications. There is only one lock that seems to mater for the perf, and I actually added one other (the one in _python_exit, which necessitate another bug fix for fork context).

I did not benchmark to see if it was worth it in term of perf.