-
-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
100% CPU on all instances when http-fast-listen is enabled #188
Comments
@cillianderoiste Try using py-spy's dump command to find out what the instances are doing when they are stuck. |
Thanks, here's what I see:
We are using the latest version of waitress 2.1.2, this is where it seems to be stuck: https://github.com/Pylons/waitress/blob/v2.1.2/src/waitress/wasyncore.py#L514 |
I don't know how to debug this further, but I did notice that if I pin waitress to 2.0.0 I can still reproduce the issue, but with 1.4.4 I cannot. |
|
Not seen this issue at all with a few Plone 6.0.0.2 setups that are actively in use, but the setup is smaller and uses only one zeoclient / zeoserver. One of those sites runs on a server with 3 other active Plone sites. Default recipe settings for threads and http-fast-listen . Could hint on a race conditions with multiple zeoclients? |
I haven't used waitress for a while now. Maybe in addition to the py-spy dump you could try |
and btw what is the output of |
Here an analysis on the code. |
@yurj I have read it :-) |
Coming from here I'm still fighting with this even when
and runs forever. ulimit output:
would be glad for hints |
@petschki before anything else did you ever try increasing the allowed number of open files? Sockets are files and 1024 is not really a lot. |
For buildout-less Plone 6, I factored out (some time ago) this code snippet to here https://github.com/plone/waitress_fastlisten/blob/main/waitress_fastlisten.py Probably the same problem may pop up. |
@tschorr thanks for the pointers. Switching
The site is running well but I will increase the number of open files and switch back to the default settings for now. Can this flag be set as environment or config option somehow? |
@petschki this time it seems to be polling only for readable sockets.
Isn't it possible to set this in the paste config ( |
@petschki out of curiosity if you see this again could you inspect that one file descriptor that needs to become writable (14 in this case) with |
@tschorr this is the
process:
strace:
|
Yes you can set |
So assuming that changing the allowed number of open files didn't fix the issue (@petschki can you confirm?), but switching to
It's waiting for a client connection to become writable and |
I've added log statements to the recommended lines in waitress/wasyncore.py and keep you informed. |
Since your Add debug statements that show what sockets are being created in the fast-listen loop, then turned into a string, which is then turned back into a list of sockets that are passed to waitress so you know which sockets waitress is supposed to listen on. Then I would recommend adding a debug statement to print the sockets that waitress has accepted here: https://github.com/Pylons/waitress/blob/v2.1.2/src/waitress/server.py#L304 or here: https://github.com/Pylons/waitress/blob/v2.1.2/src/waitress/channel.py#L56 So you know which sockets that waitress has accepted. Next up, you might want to log what type is |
Let me simplify the code down a bit, and what
We create the socket with Then the next step is that we do app startup. Now the application startup does a bunch of socket stuff itself, it may make connections to the outside world, open files, all that fun stuff. During that time the garbage collector may or may not clean up the So in This is the output:
That's because as you can see from the output, the prebound socket is on file descriptor 3, but since it goes out of scope and we return just the stringified number, Python garbage collection closes the socket. Then our fetch socket (in However what happens if we happen to create a whole bunch of sockets/files and thus they are valid?
notice
Calling Anyway, this should hopefully showcase why the current code is bad. You are basically racing the Python garbage collection system to try and have
turn the file descriptor/socket created in The surprise to me is that apparently you are winning the race often enough for this to not have been an issue yet, but the code as it is written right now is wrong and should be fixed. Create the sockets, and hold on to them as a list of sockets, and don't do the socket -> fd -> stringify -> fd -> socket dance, it is unnecessary. |
I still have no idea why or how |
@bertjwregeer thanks for your hints! I've put some logging statements to your suggested places and try to help out here. But the original authors of the zeoclient control script might have more ideas whats going wrong here (@ale-rt @tschorr ?) So I have this zeoclient which says this after startup: (lines 304, 306 are in waitress/server.py, line 57 is in waitress/channel.py)
with
Now I click through some folders and tried to delete one with 10k items and make a second request on the same zeoclient, the log says:
but the
maybe also interesting for you, the
and
|
;-) nice try, but I really don't count myself as one of the original authors of the zeoclient control script. Github shows 20+ contributors, and that is only since 2007 when it moved there supposedly from svn. |
This is the socket used to accept new connections. It is yet not connected and
shows file descriptors 12, 13, 14, 15 and you see those in the
Yes, and the socket monitored by waitress is obviously one of the unconnected sockets as @bertjwregeer has pointed out. |
haha ... sry @tschorr didn't want to blame someone here. Here are the file descriptors:
|
No worries :-D and did I mention I don't use waitress?
But the data provided was gathered with |
no you didn't ... what do you use instead? |
I think I did. I'm using pyruvate (I'm the author). I'm not saying waitress is a bad choice though. |
@tschorr if all the data is collected without fast-listen enabled then you've gotta show more information on how you are starting waitress, what parameters are provided, more information on what the code is doing. Also your lsof commands are just looking for file descriptor, but could you please dump by pid instead? That's going to give you far better information. Your latest lsof command shows 5 pid's. If all of those are python processes are you forking() somewhere in the code? Can you show your ps output? Can you show a pstree output as well just in case there is a parent child relationship? Each thread technically gets a PID in waitress, but they are all owned by the main thread and shared with the others. If you aren't using fast-listen, drop using the recipe's server_factory entirely since it's not netting you anything, and instead just launch waitress directly. The fast-listen code is still broken though, that doesn't change. |
@bertjwregeer you are asking the wrong person - I'm not using waitress and I cannot collect any data on the issue. It's @petschki you need to ask. |
Okay, but you claimed that fast-listen is turned off for this whole debug session which would invalidate the title of this ticket and makes the debugging session useless. Still I urge you to fix the fast-listen code. I am unsubscribing from this ticket now. Please don't use |
@bertjwregeer no, I didn't claim this. Note the question mark in my post. Shooting the messenger will not help to resolve this. Reading other people's posts maybe would. |
I fully agree that the title isn't suitable for my debugging session at all ... sorry for that - my mistake. My setup has |
The If now waitress chooses to declare a deprecated stdlib module their private API (module comments in version 2.1.2 don't give any hint on that), a std socket could replace All this will most likely not help to explain why this happens with Maybe it would be interesting to bypass |
@tschorr Does it work with pure WSGI, w/o buildout and w/o this recipe involved? I use https://pypi.org/project/waitress-fastlisten/ which is based on the code from here and did not run into any problems.
|
I'll try it out next week in Innsbruck. |
We've been troubled by an issue for the past few months where all instances on our machines get stuck at 100% CPU usage. A typical machine has 5 instances and zeo, 2 CPUs (2.4GHz) and 8G RAM. We could avoid the issue by stopping all instances and then starting each one slowly, waiting for the CPU load to go down before starting the next one. Restarting all instances via supervisor reliably reproduced the issue. As soon as one instance got into this state, starting or restarting additional instances would cause them to also use all available CPU. Today we tried adding
http-fast-listen = off
to the buildout configuration for the instances and now we can restart all instances without any problem.While investigating the issue we noticed that having a high CPU load when starting the instances made matters worse. For example, by running
dd if=/dev/zero of=/dev/null
three times in parallel, we could trigger the issue by starting 3 instances at the same time, rather than 5. We have also had this issue on a machine which runs 5 instances, each in a separate docker container.I attempted to use the plone docker image to make a reproducible test case, by restricting the available CPU for the container and then creating a high load (with dd, as above) before starting the instance, but it was more difficult than I had hoped.
The text was updated successfully, but these errors were encountered: