Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RUN_ONCE and many flows fails #291

Open
bagder opened this issue Feb 10, 2017 · 17 comments
Open

RUN_ONCE and many flows fails #291

bagder opened this issue Feb 10, 2017 · 17 comments
Assignees
Labels

Comments

@bagder
Copy link
Member

bagder commented Feb 10, 2017

./client_http_run_once -n 80 bsd10.fh-muenster.de

and

./client_http_run_once -n 80 localhost

both show the same symptom (bsd10 is at a 42ms RTT from my test machine). It quickly handles the first 55-56 flows and then gets stuck on 25 or 24 flows remaining, spending 100% CPU time until after some time on_error is called and the application exits.

[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
on_error

Running on Linux kernel 4.9.2, no working IPv6, no SCTP.

@weinrank
Copy link
Member

Hey Daniel,
I'm facing the same problem and will try to solve it.

@weinrank
Copy link
Member

weinrank commented Feb 14, 2017

I found the problem, it was a bug in my application, not related to the library.

@weinrank
Copy link
Member

Hey Daniel,

https://github.com/NEAT-project/neat/blob/master/examples/client_http_run_once.c#L110

Here you call neat_stop_event_loop - why?
Is this useful if a ctx ist used by several flows?

@bagder
Copy link
Member Author

bagder commented Apr 19, 2017

I honestly can't remember any good reason for that. Possibly I added it during some debug session to see if it helped and forgot to remove it... !

@weinrank
Copy link
Member

Should be fixed by #334

@bagder
Copy link
Member Author

bagder commented Jul 5, 2017

It works differently now, but it's still not working fine.

./client_http_run_once -n 80 bsd10.fh-muenster.de on my machine reaches further than before, but it gets stuck and never exits properly. The final lines of the output (before I pressed control-c) is below.

The cutoff line for me: at -n 65 it works, but at -n 66 it hangs. The last few lines in the hanging -n 66 case:

[1 flows to go]
Waiting...
[1 flows to go]
Waiting...
[1 flows to go]
Waiting...
[1 flows to go]
Waiting...
[1 flows to go]
Waiting...
[1 flows to go]
Waiting...
[1 flows to go]
Waiting...
[1 flows to go]
[Flow 29 ended, 0 to go, 0 for ctx 9]
on_close - no flows left for ctx, stopping event loop
[   0.890973][WRN] neat_notify_close - flow already closed - skipping
Cleanup!

Running it with gdb and breaking at that "Cleanup!" point, the stack trace looks like this:

^C
Program received signal SIGINT, Interrupt.
syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
38      ../sysdeps/unix/sysv/linux/x86_64/syscall.S: No such file or directory.
(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007ffff7970aea in ?? () from /usr/lib/x86_64-linux-gnu/libuv.so.1
#2  0x00007ffff796ee34 in ?? () from /usr/lib/x86_64-linux-gnu/libuv.so.1
#3  0x00007ffff7960924 in uv_run () from /usr/lib/x86_64-linux-gnu/libuv.so.1
#4  0x00007ffff7b813ad in neat_close_loop (nc=0x6276e0) at /home/daniel/src/neat/neat_core.c:281
#5  neat_core_cleanup (nc=0x6276e0) at /home/daniel/src/neat/neat_core.c:290
#6  neat_free_ctx (nc=<optimized out>) at /home/daniel/src/neat/neat_core.c:329
#7  0x0000000000401779 in main (argc=<optimized out>, argv=<optimized out>)
    at /home/daniel/src/neat/examples/client_http_run_once.c:324
(gdb) 

@karlgrin
Copy link
Contributor

@bagder @naeemk Not able to recreate the problem. Actually, it seems to work for me. Have anyone checked this issue recently?

@weinrank
Copy link
Member

@karlgrin
I have and I still see the problem.
I've used the following command on a Ubuntu 17.10 machine.
./client_http_run_once -n 80 bsd10.fh-muenster.de

@karlgrin
Copy link
Contributor

karlgrin commented Jan 4, 2018

@weinrank @bagder
Have continued my investigation of issue #291. When I did a trial with a timeout of 10 ms for the poll() in client_http_run_once it seemed to work ok for very large values. For example, I ran './client_http_run_once -n 200 bsd10.fh-muenster.de' and it took outp.txt a bit less than 4 seconds to complete. You find the output enclosed with this email. My question is basically that I don't really see the problem.

@karlgrin
Copy link
Contributor

karlgrin commented Jan 9, 2018

@weinrank have you some time to start looking at this?

@weinrank
Copy link
Member

weinrank commented Jan 9, 2018

How did you change the timeout to 10ms?

@karlgrin
Copy link
Contributor

karlgrin commented Jan 9, 2018

@weinrank Hardcoded '10 ms' in poll at line 293 in file client_http_run_once.c
( int rc = poll(fds, num_ctxs, 10) )

@tuexen
Copy link
Contributor

tuexen commented Jan 9, 2018

Just wondering: Does this really fix a bug? So can you explain why it did not work without the 10 ms and does work with 10 ms.

I just want to make sure that this is not a work around protecting against a race condition in some scenarios...

@karlgrin
Copy link
Contributor

karlgrin commented Jan 9, 2018

@tuexen I don't consider this a fix of the bug. I only consider this an observation. Since I cannot explain the problem, I don't consider it fixed. I'll continue looking into this issue tomorrow. This problem is that I have not been able to replicate the problem myself. I'll set up a new machine with Ubuntu 17.10 on tomorrow and see if I am able to replicate the problem on that machine.

@tuexen
Copy link
Contributor

tuexen commented Jan 10, 2018

@karlgrin OK, makes sense. Thanks for the clarification.

I also was able to reproduce the problem sometimes (means that I can run the program multiple times, one only sometimes it hangs, sometimes it works) using FreeBSD head. So the problem is not platform dependent. The system I was using to test has IPv4 and IPv6 addresses. Possibly this is relevant.
When the system was hanging, I used sockstat to figure out in which state the TCP connections were. I saw about 78 or so in the state CLOSE_WAIT, which means that there was no close() call issued for them. Using truss to figure out what the program was doing resulted in a loop of kevent() calls.

@perhurt
Copy link
Contributor

perhurt commented Jan 26, 2018

The problem is related to the resolver.

When avoiding to use the resolver (i.e., by using an IP-address directly), the problem goes away. For example (towards bsd10.fh-muenster.de):

./client_http_run_once -n 85 212.201.121.100

One of the problems, when opening a large number of connections, is that the maximum number of open files is reached in neat_resolver.c:772. When increasing this number (in /etc/security/limits.conf on Linux) from 1024 to 100000 the problems went away.

I'm not that familiar with the "resolver business" so I hope that someone else could take a look at this @naeemk !?

@kristrev
Copy link
Contributor

This is not a resolve-issue per se, it is an issue of the resolver in combination with operating system limits. The resolver that is currently in NEAT is the first version and quite naive, which means that there are roughly zero optimizations. I haven't looked at the code in detail for a long time, but I suspect that what happens is that one resolver is started for every connection. With 85 connections and multiple interfaces, this will exhaust the system limit on per process file descriptors pretty fast.

There were plans to make the resolver smarter, by for example implementing some sort of caching or do something smarter than attaching the resolver to a flow, but this has never materialized. I unfortunately doubt that I will have time to do anything about the resolver before NEAT is over, but if anyone else wants to have a go then that would be great.

The best solution would probably be to have a small cache, say 10 entries with short TTL, attach the resolver to the context and change the resolver from being on-demand to process a request queue (with a limit N on how many requests can be processed in parallel). We could also attach flows to DNS requests in process, if there are multiple requests for the same domain. The first request to a given domain will then trigger a resolve, and the others will hit the cache/be attached to this request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants