Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket not closed when remote is "unplugged". #741

Open
stephan57160 opened this issue Aug 2, 2024 · 2 comments
Open

Socket not closed when remote is "unplugged". #741

stephan57160 opened this issue Aug 2, 2024 · 2 comments

Comments

@stephan57160
Copy link
Contributor

As a preamble, I found one of our ZYRE servers with more than 700 sockets open since its last restart (a few weeks).
Among them, more than 270 were established with the same remote (Android device).

Then, I went deeper and started to investigate a bit.

Finally, I came to reproduce this, with a the C source code below and a basic scenario described later:

int main(void) {
	zyre_t *node = zyre_new("LINUX-SERVER");
	if (!node) {
		fprintf(stderr, "Error: Failed to create ZYRE node.\n");
		return -1;
	}

	zyre_set_port(15670);             // To not interfer with other nodes in lab.
	zyre_set_beacon_peer_port(25670); // For practical reasons with NETSTAT output.
	zyre_start(node);
	zyre_join(node, "dummy-group");

	while (!zsys_interrupted) {
		// Receive ZYRE event
		zyre_event_t *event = zyre_event_new(node);
		if (event) {
			const char *event_type = zyre_event_type(event);
			const char *peer_id = zyre_event_peer_uuid(event);
			
			if (streq(event_type, "ENTER") || streq(event_type, "EXIT")) {
				printf("%s - %-10s\n", peer_id, event_type);
			}

			zyre_event_destroy(&event);
		} else {
 			// No event --> wait a little bit.
 			zclock_sleep(100);
		}
	}

	zyre_leave(node, "dummy-group");
	zyre_stop(node);
	zyre_destroy(&node);

	return 0;

Scenario to reproduce

  • Start this program on a Linux machine A
  • Start a similar one, on a different machine, named B.

On A, I observe 2 active TCP connexions (to simplify):

tcp        0      0 192.168.57.130:25670    0.0.0.0:*               LISTEN      2196653/zre-server
tcp        0      0 192.168.57.130:39564    192.168.57.172:25670    ESTABLISHED 2196653/zre-server    A --> B
tcp        0      0 192.168.57.130:25670    192.168.57.172:36072    ESTABLISHED 2196653/zre-server    A <-- B

Now, unplug the Ethernet cable on B. After a few seconds, A shows an event "EXIT" from
B and 1 socket is automatically closed, the 2nd socket (A <-- B) remains active:

tcp        0      0 192.168.57.130:25670    0.0.0.0:*               LISTEN      2196653/zre-server
tcp        0      0 192.168.57.130:25670    192.168.57.172:36072    ESTABLISHED 2196653/zre-server    A <-- B

If the cable is plugged back, 2 new sockets are created, but the former A <-- B is still present:

tcp        0      0 192.168.57.130:25670    0.0.0.0:*               LISTEN      2196653/zre-server
tcp        0      0 192.168.57.130:41530    192.168.57.172:25670    ESTABLISHED 2196653/zre-server    A --> B
tcp        0      0 192.168.57.130:25670    192.168.57.172:36072    ESTABLISHED 2196653/zre-server    A <-- B (former)
tcp        0      0 192.168.57.130:25670    192.168.57.172:52104    ESTABLISHED 2196653/zre-server    A <-- B

Repeat the operation and more sockets are seen.

I was hoping that if ZYRE (or the layers below) are able to close the socket A --> B, it could close the 2nd one as well.
At least, something is detected "correctly", as 2 new sockets are created when B comes back.

This comes more problematic when B is a laptop (or an Android device), coming in and out of WIFI coverage,
or if the laptop is closed (hybernate) but not shut down.

I tried to play with TCP_KEEPALIVE, but without any kind of success so far.

The issue below looks related, actually:

@stephan57160
Copy link
Contributor Author

OK.
I digged deeper in LIBZMQ:

  • the remote connects to the local node with a standard TCP connection.
  • the local calls accept() or accept4() in libzmq/src/tcp_listener.cpp.
  • with added traces, the new socket has TCP_KEEPALIVE disabled.

As I found no way to get this socket from ZYRE, I hard-coded a few setsockopts() after accept()/accept4() call,
to reset the connection after 3mn40.

@sphaero
Copy link
Contributor

sphaero commented Sep 12, 2024

Are you suggesting to fix this in libzmq in zyre or in czmq?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants