Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% CPU on all instances when http-fast-listen is enabled #188

Open
cillianderoiste opened this issue Jan 11, 2023 · 39 comments
Open

100% CPU on all instances when http-fast-listen is enabled #188

cillianderoiste opened this issue Jan 11, 2023 · 39 comments

Comments

@cillianderoiste
Copy link

We've been troubled by an issue for the past few months where all instances on our machines get stuck at 100% CPU usage. A typical machine has 5 instances and zeo, 2 CPUs (2.4GHz) and 8G RAM. We could avoid the issue by stopping all instances and then starting each one slowly, waiting for the CPU load to go down before starting the next one. Restarting all instances via supervisor reliably reproduced the issue. As soon as one instance got into this state, starting or restarting additional instances would cause them to also use all available CPU. Today we tried adding http-fast-listen = off to the buildout configuration for the instances and now we can restart all instances without any problem.

While investigating the issue we noticed that having a high CPU load when starting the instances made matters worse. For example, by running dd if=/dev/zero of=/dev/null three times in parallel, we could trigger the issue by starting 3 instances at the same time, rather than 5. We have also had this issue on a machine which runs 5 instances, each in a separate docker container.

I attempted to use the plone docker image to make a reproducible test case, by restricting the available CPU for the container and then creating a high load (with dd, as above) before starting the instance, but it was more difficult than I had hoped.

@davisagli
Copy link
Sponsor Member

@cillianderoiste Try using py-spy's dump command to find out what the instances are doing when they are stuck.

@cillianderoiste
Copy link
Author

Thanks, here's what I see:

Thread 1539468 (idle): "MainThread"
    handle_write_event (waitress/wasyncore.py:514)
    write (waitress/wasyncore.py:113)
    poll (waitress/wasyncore.py:189)
    loop (waitress/wasyncore.py:245)
    run (waitress/server.py:322)
    serve (waitress/__init__.py:19)
    serve_paste (plone/recipe/zope2instance/ctl.py:917)
    serve (plone/recipe/zope2instance/ctl.py:942)
    serve (Zope2/Startup/serve.py:203)
    run (Zope2/Startup/serve.py:217)
    main (Zope2/Startup/serve.py:251)
    <module> (Zope2/Startup/serve.py:255)
    <module> (interpreter:342)
Thread 1539618 (idle): "zeostorage zeo client networking thread"
    select (selectors.py:468)
    _run_once (asyncio/base_events.py:1823)
    run_forever (asyncio/base_events.py:570)
    run (ZEO/asyncio/client.py:892)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 1539787 (idle): "waitress-0"
    wait (threading.py:302)
    handler_thread (waitress/task.py:72)
    run (threading.py:870)
    run (sentry_sdk/integrations/threading.py:67)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 1539788 (idle): "waitress-1"
    wait (threading.py:302)
    handler_thread (waitress/task.py:72)
    run (threading.py:870)
    run (sentry_sdk/integrations/threading.py:67)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 1539789 (idle): "waitress-2"
    wait (threading.py:302)
    handler_thread (waitress/task.py:72)
    run (threading.py:870)
    run (sentry_sdk/integrations/threading.py:67)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 1539791 (idle): "waitress-3"
    wait (threading.py:302)
    handler_thread (waitress/task.py:72)
    run (threading.py:870)
    run (sentry_sdk/integrations/threading.py:67)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 1539806 (idle): "slowlog_monitor"
    wait (threading.py:302)
    get (queue.py:170)
    run (slowlog/monitor.py:85)
    run (sentry_sdk/integrations/threading.py:67)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)

We are using the latest version of waitress 2.1.2, this is where it seems to be stuck: https://github.com/Pylons/waitress/blob/v2.1.2/src/waitress/wasyncore.py#L514

@cillianderoiste
Copy link
Author

I don't know how to debug this further, but I did notice that if I pin waitress to 2.0.0 I can still reproduce the issue, but with 1.4.4 I cannot.

@digitalresistor
Copy link

waitress.wasyncore is not public API.

@fredvd
Copy link
Sponsor Member

fredvd commented Jan 25, 2023

Not seen this issue at all with a few Plone 6.0.0.2 setups that are actively in use, but the setup is smaller and uses only one zeoclient / zeoserver. One of those sites runs on a server with 3 other active Plone sites. Default recipe settings for threads and http-fast-listen . Could hint on a race conditions with multiple zeoclients?

@yurj
Copy link

yurj commented Jan 26, 2023

@tschorr

@tschorr
Copy link
Contributor

tschorr commented Jan 26, 2023

@tschorr

I haven't used waitress for a while now. Maybe in addition to the py-spy dump you could try strace -p <pid> on one of the hanging processes to get more information on what's going on.

@tschorr
Copy link
Contributor

tschorr commented Jan 26, 2023

and btw what is the output of ulimit -a?

@yurj
Copy link

yurj commented Jan 26, 2023

@tschorr

I haven't used waitress for a while now. Maybe in addition to the py-spy dump you could try strace -p <pid> on one of the hanging processes to get more information on what's going on.

Pylons/waitress#396 (comment)

Here an analysis on the code.

@tschorr
Copy link
Contributor

tschorr commented Jan 26, 2023

@yurj I have read it :-)

@petschki
Copy link
Member

Coming from here I'm still fighting with this even when http-fast-listen is turned off. My strace ouput looks like this:

select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})

and runs forever.

ulimit output:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256616
max locked memory       (kbytes, -l) 65536
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 256616
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

would be glad for hints

@tschorr
Copy link
Contributor

tschorr commented Jan 27, 2023

@petschki before anything else did you ever try increasing the allowed number of open files? Sockets are files and 1024 is not really a lot.
The strace dump seems to confirm it is stuck in waitress, because asyncio (ZEO ) is supposed to use poll. You could try and toggle waitresses asyncore_use_poll parameter and see if the strace dump switches to poll.

@jensens
Copy link
Sponsor Member

jensens commented Jan 27, 2023

For buildout-less Plone 6, I factored out (some time ago) this code snippet to here https://github.com/plone/waitress_fastlisten/blob/main/waitress_fastlisten.py

Probably the same problem may pop up.

@petschki
Copy link
Member

@tschorr thanks for the pointers. Switching asyncore_use_poll = True changes the output indeed to:

poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)

The site is running well but I will increase the number of open files and switch back to the default settings for now. Can this flag be set as environment or config option somehow?

@tschorr
Copy link
Contributor

tschorr commented Jan 27, 2023

@petschki this time it seems to be polling only for readable sockets.

Can this flag be set as environment or config option somehow?

Isn't it possible to set this in the paste config (zope.ini / wsgi.ini)? But I haven't tried myself.

@tschorr
Copy link
Contributor

tschorr commented Jan 27, 2023

select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})

@petschki out of curiosity if you see this again could you inspect that one file descriptor that needs to become writable (14 in this case) with lsof -d 14? If there's a lot of output, maybe the lines starting with python will do.

@petschki
Copy link
Member

petschki commented Jan 29, 2023

@tschorr this is the lsof -d of the current process which got stuck in select()

root@fischereiverband:~# lsof -d 14 | grep python
python3.11 113676 fischereiverband   14u     IPv4            3906056      0t0        TCP localhost:8090->localhost:54788 (ESTABLISHED)
python3.11 144977 fischereiverband   14u     sock                0,9      0t0    4367294 protocol: TCP

process:

root@fischereiverband:~# ps uax | grep 144977
fischer+  144977 12.1  0.3 595628 229708 ?       Rl   01:11 126:03 /home/fischereiverband/.pyenv/versions/3.11.1/envs/python-3.11.1/bin/python3.11 /home/fischereiverband/plone6/parts/zeoclient4/bin/interpreter /home/fischereiverband/plone6/eggs/Zope-5.7.3-py3.11.egg/Zope2/Startup/serve.py /home/fischereiverband/plone6/parts/zeoclient4/etc/wsgi.ini

strace:

select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})

@petschki
Copy link
Member

petschki commented Jan 29, 2023

Isn't it possible to set this in the paste config (zope.ini / wsgi.ini)? But I haven't tried myself.

Yes you can set asyncore_use_poll = true in the [server:main] section of wsgi.ini

@tschorr
Copy link
Contributor

tschorr commented Jan 30, 2023

poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)

The site is running well

So assuming that changing the allowed number of open files didn't fix the issue (@petschki can you confirm?), but switching to asyncore_use_poll = true does, one possibility to fix this would be to add this setting in p.r.zope2instance e.g. here. To me this looks like a reasonable default but I might be missing something.

@tschorr this is the lsof -d of the current process which got stuck in select()

python3.11 113676 fischereiverband 14u IPv4 3906056 0t0 TCP localhost:8090->localhost:54788 (ESTABLISHED)
python3.11 144977 fischereiverband 14u sock 0,9 0t0 4367294 protocol: TCP

It's waiting for a client connection to become writable and select is indeed reporting a writable connection. So why doesn't waitress finish writing the response and drop the connection? Maybe to further investigate this you could start adding log statements in waitress, e.g. here and/or here.

@petschki
Copy link
Member

I've added log statements to the recommended lines in waitress/wasyncore.py and keep you informed.

@digitalresistor
Copy link

digitalresistor commented Jan 30, 2023

python3.11 144977 fischereiverband   14u     sock                0,9      0t0    4367294 protocol: TCP

Since your lsof output showed two python processes each with a file descriptor 14... but you ran ps on 144977, I wanted to point out that this shows a socket that is not connected, and hasn't had listen(), bind() or connect() called on it.

Add debug statements that show what sockets are being created in the fast-listen loop, then turned into a string, which is then turned back into a list of sockets that are passed to waitress so you know which sockets waitress is supposed to listen on.

Then I would recommend adding a debug statement to print the sockets that waitress has accepted here:

https://github.com/Pylons/waitress/blob/v2.1.2/src/waitress/server.py#L304

or here:

https://github.com/Pylons/waitress/blob/v2.1.2/src/waitress/channel.py#L56

So you know which sockets that waitress has accepted.

Next up, you might want to log what type is self in handle_write_event, since that will tell you whether it's a waitress channel or server.

@digitalresistor
Copy link

Let me simplify the code down a bit, and what fast-listen is doing here, it looks like this:

import socket


def prebind_socket():
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    # We can ignore calling bind() cause we just care that this socket is in listening mode
    print(f"pre_bound socket is: {s}")
    s.listen(5)
    print(f"Called listen on pre bound socket")

    fileno = s.fileno()

    return str(fileno)


def app_startup():
    # We are now going to do app startup, which includes creating a new socket and
    # connecting to a remote host (we simulate that by calling socket + close) to
    # fetch some data

    fetch_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    # fetch_socket.connect()
    print(f"Our fetch socket is: {fetch_socket}")
    fetch_socket.close()


prebound_socket_fileno = prebind_socket()

print(prebound_socket_fileno)

# Simulate app startup doing a whole bunch of socket goodness
app_startup()

# Take our pre-bound socket created as dangling and turn it back into a socket

prebound_socket = socket.fromfd(
    int(prebound_socket_fileno), socket.AF_INET, socket.SOCK_STREAM
)

print(f"Our socket we are listening on is: {prebound_socket}")

# This should block waiting for a connection to arrive
print("calling accept() on socket")
prebound_socket.accept()

We create the socket with dispatcher() which is in scope for a little bit, turn it into a file descriptor number, after that we turn that into a long string of numbers. That is prebind_socket above.

Then the next step is that we do app startup. Now the application startup does a bunch of socket stuff itself, it may make connections to the outside world, open files, all that fun stuff. During that time the garbage collector may or may not clean up the dispatcher() that was temporarily created, and along that any related garbage (which includes the socket we created, since we don't store the actual socket anywhere else, just a stringified file descriptor number).

So in app_startup() I did a mock call to socket to create a new socket that would connect to a remote service to fetch some information. Then after app_startup() is complete, I use the previously prebound_socket_fileno and turn that back into a socket, then I call accept() on the socket since it was already ready for that since we called .listen().

This is the output:

% python3.11 race-socket-cleanup.py
pre_bound socket is: <socket.socket fd=3, family=2, type=1, proto=0, laddr=('0.0.0.0', 0)>
Called listen on pre bound socket
3
Our fetch socket is: <socket.socket fd=3, family=2, type=1, proto=0, laddr=('0.0.0.0', 0)>
Traceback (most recent call last):
  File "/Users/xistence/Projects/temp/race-socket-cleanup.py", line 36, in <module>
    prebound_socket = socket.fromfd(
                      ^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/socket.py", line 546, in fromfd
    nfd = dup(fd)
          ^^^^^^^
OSError: [Errno 9] Bad file descriptor

That's because as you can see from the output, the prebound socket is on file descriptor 3, but since it goes out of scope and we return just the stringified number, Python garbage collection closes the socket. Then our fetch socket (in app_startup()) creates a new socket, which gets the exact same file descriptor. App startup completes and we attempt to take our prebound socket and re-use it, but we can't since that file descriptor is not valid.

However what happens if we happen to create a whole bunch of sockets/files and thus they are valid?

prebound_socket_fileno = prebind_socket()

print(prebound_socket_fileno)

# Simulate app startup doing a whole bunch of socket goodness
app_startup()

other = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
print(f"Test socket: {other}")

# Take our pre-bound socket created as dangling and turn it back into a socket

prebound_socket = socket.fromfd(
    int(prebound_socket_fileno), socket.AF_INET, socket.SOCK_STREAM
)

print(f"Our socket we are listening on is: {prebound_socket}")

# This should block waiting for a connection to arrive
print("calling accept() on socket")
prebound_socket.accept()

notice other where we create a new socket, which stays alive since it is in scope. Now this is the output from the script run:

% python3.11 race-socket-cleanup.py
pre_bound socket is: <socket.socket fd=3, family=2, type=1, proto=0, laddr=('0.0.0.0', 0)>
Called listen on pre bound socket
3
Our fetch socket is: <socket.socket fd=3, family=2, type=1, proto=0, laddr=('0.0.0.0', 0)>
Test socket: <socket.socket fd=3, family=2, type=1, proto=0, laddr=('0.0.0.0', 0)>
Our socket we are listening on is: <socket.socket fd=4, family=2, type=1, proto=0, laddr=('0.0.0.0', 0)>
calling accept() on socket
Traceback (most recent call last):
  File "/Users/xistence/Projects/temp/race-socket-cleanup.py", line 47, in <module>
    prebound_socket.accept()
  File "/opt/homebrew/Cellar/[email protected]/3.11.1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/socket.py", line 294, in accept
    fd, addr = self._accept()
               ^^^^^^^^^^^^^^
OSError: [Errno 22] Invalid argument

Calling .accept() failed since the socket is not the original listen socket, but it has the same file descriptor so socket from fileno works because it can call dup2() on the file descriptor.

Anyway, this should hopefully showcase why the current code is bad. You are basically racing the Python garbage collection system to try and have

socket.fromfd(
    int(prebound_socket_fileno), socket.AF_INET, socket.SOCK_STREAM
)

turn the file descriptor/socket created in dispatcher() into a valid socket that you can pass to waitress. And if you are late, or the file descriptor has been re-used in the mean time for another socket, it'll fail in fun and interesting ways, or may not even be a valid socket.

The surprise to me is that apparently you are winning the race often enough for this to not have been an issue yet, but the code as it is written right now is wrong and should be fixed. Create the sockets, and hold on to them as a list of sockets, and don't do the socket -> fd -> stringify -> fd -> socket dance, it is unnecessary.

@digitalresistor
Copy link

I still have no idea why or how .connected would be False, and why it would be stuck in a hard loop there since it should call handle_write but I have no idea what the type of self is there.

@petschki
Copy link
Member

petschki commented Jan 31, 2023

@bertjwregeer thanks for your hints! I've put some logging statements to your suggested places and try to help out here. But the original authors of the zeoclient control script might have more ideas whats going wrong here (@ale-rt @tschorr ?)

So I have this zeoclient which says this after startup: (lines 304, 306 are in waitress/server.py, line 57 is in waitress/channel.py)

2023-01-31 07:49:05,281 INFO    [waitress:304][MainThread] self = <waitress.server.TcpWSGIServer listening 0.0.0.0:8082 at 0x7fdba9ccf210>
2023-01-31 07:49:05,281 INFO    [waitress:306][MainThread] self.accept() = (<socket.socket fd=15, family=2, type=1, proto=0, laddr=('127.0.0.1', 8082), raddr=('127.0.0.1', 56994)>, ('127.0.0.1', 56994))
2023-01-31 07:49:05,281 INFO    [waitress:57][MainThread] server=<waitress.server.TcpWSGIServer listening 0.0.0.0:8082 at 0x7fdba9ccf210>, sock=<socket.socket fd=15, family=2, type=1, proto=0, laddr=('127.0.0.1', 8082), raddr=('127.0.0.1', 56994)>, addr=('127.0.0.1', 56994), adj=<waitress.adjustments.Adjustments object at 0x7fdba1ca5bd0>, map={12: <waitress.trigger.trigger connected at 0x7fdba9ccf150>, 13: <waitress.server.TcpWSGIServer listening 0.0.0.0:8082 at 0x7fdba9ccf210>, 14: <waitress.channel.HTTPChannel 127.0.0.1:36024 at 0x7fdba9ccec10>}

with lsof -d 15 | grep python

python3.1 37728 fischereiverband   15u     IPv4             824934      0t0      TCP localhost:8090->localhost:36164 (ESTABLISHED)
python3.1 37729 fischereiverband   15u     sock                0,9      0t0   822209 protocol: TCP

Now I click through some folders and tried to delete one with 10k items and make a second request on the same zeoclient, the log says:

2023-01-31 08:11:16,110 INFO    [waitress:304][MainThread] self = <waitress.server.TcpWSGIServer listening 0.0.0.0:8082 at 0x7fdba9ccf210>
2023-01-31 08:11:16,110 INFO    [waitress:306][MainThread] self.accept() = (<socket.socket fd=16, family=2, type=1, proto=0, laddr=('127.0.0.1', 8082), raddr=('127.0.0.1', 49162)>, ('127.0.0.1', 49162))
2023-01-31 08:11:16,110 INFO    [waitress:57][MainThread] server=<waitress.server.TcpWSGIServer listening 0.0.0.0:8082 at 0x7fdba9ccf210>, sock=<socket.socket fd=16, family=2, type=1, proto=0, laddr=('127.0.0.1', 8082), raddr=('127.0.0.1', 49162)>, addr=('127.0.0.1', 49162), adj=<waitress.adjustments.Adjustments object at 0x7fdba1ca5bd0>, map={12: <waitress.trigger.trigger connected at 0x7fdba9ccf150>, 13: <waitress.server.TcpWSGIServer listening 0.0.0.0:8082 at 0x7fdba9ccf210>, 14: <waitress.channel.HTTPChannel 127.0.0.1:36024 at 0x7fdba9ccec10>, 15: <waitress.channel.HTTPChannel connected 127.0.0.1:56994 at 0x7fdba82de650>}

but the lsof -d 16 (<socked.socket fd=16 ...> ?) has no python process:

superviso  2387 fischereiverband   16r     FIFO               0,13      0t0   820045 pipe
varnishd  37733 fischereiverband   16r      DIR                9,4     4096 38821481 /home/fischereiverband/plone6/parts/varnish-build/var/varnish/fischereiverband/_.vsm_child
cache-mai 37746 fischereiverband   16r      DIR                9,4     4096 38821481 /home/fischereiverband/plone6/parts/varnish-build/var/varnish/fischereiverband/_.vsm_child

maybe also interesting for you, the strace -p 37729

select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999998})
futex(0x7fdbb0b36310, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=81276, tv_nsec=2419173}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fdbb0b36310, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 1
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999998})
futex(0x7fdbb0b36314, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=81276, tv_nsec=2566080}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7fdbb0b362fc, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fdbb0b36314, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=81276, tv_nsec=2638657}, FUTEX_BITSET_MATCH_ANY) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 0
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999998})
futex(0x7fdbb0b36314, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=81276, tv_nsec=2750479}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fdbb0b36314, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 1
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999998})
futex(0x7fdbb0b36310, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=81276, tv_nsec=2904310}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7fdbb0b362f8, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fdbb0b36310, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=81276, tv_nsec=3053702}, FUTEX_BITSET_MATCH_ANY) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fdbb0b36314, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 1
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999998})
futex(0x7fdbb0b36310, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=81276, tv_nsec=3187325}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fdbb0b36310, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 1
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999998})
futex(0x7fdbb0b36314, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=81276, tv_nsec=3334703}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7fdbb0b362fc, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fdbb0b36314, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=81276, tv_nsec=3407330}, FUTEX_BITSET_MATCH_ANY) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 0
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999998})

and lsof -d 14 | grep python which is obviously the zeoserver:

python3.1 37728 fischereiverband   14u     IPv4             824932      0t0        TCP localhost:8090->localhost:36154 (ESTABLISHED)
python3.1 37729 fischereiverband   14u     sock                0,9      0t0     816020 protocol: TCP
python3.1 37731 fischereiverband   14u     sock                0,9      0t0     826721 protocol: TCP

@tschorr
Copy link
Contributor

tschorr commented Jan 31, 2023

But the original authors of the zeoclient control script might have more ideas whats going wrong here (@ale-rt @tschorr ?)

;-) nice try, but I really don't count myself as one of the original authors of the zeoclient control script. Github shows 20+ contributors, and that is only since 2007 when it moved there supposedly from svn.

@tschorr
Copy link
Contributor

tschorr commented Jan 31, 2023

but the lsof -d 16 (<socked.socket fd=16 ...> ?) has no python process:

This is the socket used to accept new connections. It is yet not connected and lsof may not consider it an open file. But the map:

2023-01-31 08:11:16,110 INFO [waitress:57][MainThread] server=<waitress.server.TcpWSGIServer listening 0.0.0.0:8082 at 0x7fdba9ccf210>, sock=<socket.socket fd=16, family=2, type=1, proto=0, laddr=('127.0.0.1', 8082), raddr=('127.0.0.1', 49162)>, addr=('127.0.0.1', 49162), adj=<waitress.adjustments.Adjustments object at 0x7fdba1ca5bd0>, map={12: <waitress.trigger.trigger connected at 0x7fdba9ccf150>, 13: <waitress.server.TcpWSGIServer listening 0.0.0.0:8082 at 0x7fdba9ccf210>, 14: <waitress.channel.HTTPChannel 127.0.0.1:36024 at 0x7fdba9ccec10>, 15: <waitress.channel.HTTPChannel connected 127.0.0.1:56994 at 0x7fdba82de650>}

shows file descriptors 12, 13, 14, 15 and you see those in the select line in the strace -p output. You should be able to find python processes for them.

and lsof -d 14 | grep python which is obviously the zeoserver:

python3.1 37728 fischereiverband 14u IPv4 824932 0t0 TCP localhost:8090->localhost:36154 (ESTABLISHED)
python3.1 37729 fischereiverband 14u sock 0,9 0t0 816020 protocol: TCP
python3.1 37731 fischereiverband 14u sock 0,9 0t0 826721 protocol: TCP

Yes, and the socket monitored by waitress is obviously one of the unconnected sockets as @bertjwregeer has pointed out.

@petschki
Copy link
Member

haha ... sry @tschorr didn't want to blame someone here. Here are the file descriptors:

root@fischereiverband:~# lsof -d 12,13,14,15 | grep python
python3.1 37728 fischereiverband   12r      REG                9,4 220346931   38801191 /home/fischereiverband/plone6/var/filestorage/Data.fs
python3.1 37728 fischereiverband   13u     IPv4             854933       0t0        TCP localhost:8090->localhost:49694 (ESTABLISHED)
python3.1 37728 fischereiverband   14u     IPv4             859839       0t0        TCP localhost:8090->localhost:49700 (ESTABLISHED)
python3.1 37728 fischereiverband   15u     IPv4             859843       0t0        TCP localhost:8090->localhost:49704 (ESTABLISHED)
python3.1 40446 fischereiverband   12r     FIFO               0,13       0t0     854935 pipe
python3.1 40446 fischereiverband   13u     IPv4             854936       0t0        TCP *:8082 (LISTEN)
python3.1 40447 fischereiverband   12r     FIFO               0,13       0t0     833188 pipe
python3.1 40447 fischereiverband   13u     IPv4             833189       0t0        TCP *:8083 (LISTEN)
python3.1 40448 fischereiverband   12r     FIFO               0,13       0t0     842172 pipe
python3.1 40448 fischereiverband   13u     IPv4             842173       0t0        TCP *:8084 (LISTEN)
python3.1 40450 fischereiverband   12r     FIFO               0,13       0t0     851867 pipe
python3.1 40450 fischereiverband   13u     IPv4             851868       0t0        TCP *:8085 (LISTEN)

@tschorr
Copy link
Contributor

tschorr commented Jan 31, 2023

haha ... sry @tschorr didn't want to blame someone here.

No worries :-D and did I mention I don't use waitress?

@petschki:

Coming from here I'm still fighting with this even when http-fast-listen is turned off.

@bertjwregeer:

Let me simplify the code down a bit, and what fast-listen is doing here

But the data provided was gathered with fast-listen turned off and the problem persists?

@petschki
Copy link
Member

and did I mention I don't use waitress?

no you didn't ... what do you use instead? bjoern is mentioned here https://zope.readthedocs.io/en/latest/operation.html#waitress-the-default-and-recommended-choice for example ...

@tschorr
Copy link
Contributor

tschorr commented Jan 31, 2023

no you didn't ... what do you use instead?

I think I did. I'm using pyruvate (I'm the author). I'm not saying waitress is a bad choice though.

@digitalresistor
Copy link

@tschorr if all the data is collected without fast-listen enabled then you've gotta show more information on how you are starting waitress, what parameters are provided, more information on what the code is doing.

Also your lsof commands are just looking for file descriptor, but could you please dump by pid instead? That's going to give you far better information. Your latest lsof command shows 5 pid's. If all of those are python processes are you forking() somewhere in the code? Can you show your ps output? Can you show a pstree output as well just in case there is a parent child relationship? Each thread technically gets a PID in waitress, but they are all owned by the main thread and shared with the others.

If you aren't using fast-listen, drop using the recipe's server_factory entirely since it's not netting you anything, and instead just launch waitress directly.

The fast-listen code is still broken though, that doesn't change.

@tschorr
Copy link
Contributor

tschorr commented Jan 31, 2023

@tschorr if all the data is collected without fast-listen enabled then you've gotta show more information

@bertjwregeer you are asking the wrong person - I'm not using waitress and I cannot collect any data on the issue. It's @petschki you need to ask.

@digitalresistor
Copy link

Okay, but you claimed that fast-listen is turned off for this whole debug session which would invalidate the title of this ticket and makes the debugging session useless.

Still I urge you to fix the fast-listen code.

I am unsubscribing from this ticket now. Please don't use wasyncore from waitress, it is not and never will be public API, and I will be moving it to _wasyncore next release to make that even more clear.

@tschorr
Copy link
Contributor

tschorr commented Jan 31, 2023

Okay, but you claimed that fast-listen is turned off for this whole debug session

@bertjwregeer no, I didn't claim this. Note the question mark in my post. Shooting the messenger will not help to resolve this. Reading other people's posts maybe would.

@petschki
Copy link
Member

petschki commented Jan 31, 2023

I fully agree that the title isn't suitable for my debugging session at all ... sorry for that - my mistake. My setup has fast-listen turned off the whole time. I also do not have enough knowledge of the zeoclient control script or the underlying waitress technology to help debugging further here.
Reading the last comment of @bertjwregeer seems that the zope/plone community should think about refactoring the control scripts using waitress though.

@tschorr
Copy link
Contributor

tschorr commented Feb 3, 2023

that the zope/plone community should think about refactoring the control scripts using waitress though.

The fast-listen part that is using wasyncore.dispatcher was added when the asyncore module was still part of the python stdlib (so it was definitely 'public' then). I think I remember it was originally meant to be used for development. The import statement switched to using wasyncore after asyncore got removed from the stdlib and waitress 'vendored' it as wasyncore. Obviously this wasn't considered an issue since fast-listen only ever worked with waitress anyway. Eventually it became the default setting (why?) and eventually people wanted more than one prebound socket (not sure about the order of events but it doesn't matter).

If now waitress chooses to declare a deprecated stdlib module their private API (module comments in version 2.1.2 don't give any hint on that), a std socket could replace wasyncore.dispatcher and of course the list of strings could be a list of sockets. Other refactoring options include changing the asyncore import statement again to stick with the dispatcher, turning fast-listen off by default or entirely remove the option.

All this will most likely not help to explain why this happens with fast-listen turned off, why it doesn't happen with waitress 1.4.x or why setting asyncore_use_poll = true also fixes the problem. But if nobody is interested in clarifying this (I'm not) then this is definitely an option.

Maybe it would be interesting to bypass zdaemon to see if this changes anything. But you could use the runwsgi script provided with Zope to try this (or maybe use pserve) before starting to refactor this recipe.

@jensens
Copy link
Sponsor Member

jensens commented Feb 3, 2023

@tschorr Does it work with pure WSGI, w/o buildout and w/o this recipe involved? I use https://pypi.org/project/waitress-fastlisten/ which is based on the code from here and did not run into any problems.

# This file was generated by "cookiecutter-zope-instance"
[app:zope]
use = egg:Zope#main
zope_conf = %(here)s/zope.conf

[server:main]
paste.server_factory = waitress_fastlisten:main
use = egg:waitress_fastlisten#main
fast-listen = 0.0.0.0:8080
threads = 4
clear_untrusted_proxy_headers = false
max_request_body_size = 1073741824

[filter:translogger]
use = egg:Paste#translogger
setup_console_handler = False

[pipeline:main]
pipeline =
    translogger
    egg:Zope#httpexceptions
    zope

# ...

@tschorr
Copy link
Contributor

tschorr commented Feb 3, 2023

@jensens you need to ask @petschki .

@petschki
Copy link
Member

petschki commented Feb 3, 2023

I'll try it out next week in Innsbruck.

ale-rt added a commit to euphorie/euphorie-project that referenced this issue Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants