9244 – async conns marked as connected too early

Issue 9244 - async conns marked as connected too early

Summary: async conns marked as connected too early

Status:	CONFIRMED

Alias:	None

Product:	OpenLDAP
Classification:	Unclassified
Component:	libraries (show other issues)
Version:	2.4.49
Hardware:	All All

Importance:	--- normal
Target Milestone:	2.7.0
Assignee:	OpenLDAP project

URL:
Keywords:

Depends on:	10023
Blocks:
	Show dependency tree / graph

Reported:	2020-04-26 19:46 UTC by Ryan Tandy
Modified:	2023-11-07 16:57 UTC (History)
CC List:	0 users

See Also:	8957 8968 8980

Attachments
async connect test without TLS (1.02 KB, text/x-csrc) 2020-04-26 19:46 UTC, Ryan Tandy	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description Ryan Tandy 2020-04-26 19:46:34 UTC

Created attachment 721 [details]
async connect test without TLS

My understanding of LDAP_OPT_CONNECT_ASYNC is that the attached program should not block. If the connection does not establish fast enough, the bind call is supposed to return LDAP_X_CONNECTING.

(At least that's how I understand it, based on the original behaviour (circa 2.4.23 up to 2.4.40) as well as the bind loop in back-meta. On the other hand, the man page does "Subsequent calls to library routines will poll for completion of the connect before performing further operations" which might be interpreted as meaning they would block...)

In current releases it does block, as demonstrated by strace on Linux (latency added using 'tc qdisc'):

[...]
connect(3, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("192.168.1.204")}, 16) = -1 EINPROGRESS (Operation now in progress)
write(3, "0\f\2\1\1`\7\2\1\3\4\0\200\0", 14) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLOUT|POLLERR|POLLHUP}], 1, -1) = 1 ([{fd=3, revents=POLLOUT}])
write(3, "0\f\2\1\1`\7\2\1\3\4\0\200\0", 14) = 14
poll([{fd=3, events=POLLIN|POLLPRI}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
read(3, "0\f\2\1\1a\7\n", 8)            = 8
read(3, "\1\0\4\0\4\0", 6)              = 6
write(2, "OK: ldap_simple_bind_returned 0 "..., 42OK: ldap_simple_bind_returned 0 (Success)
) = 42
[...]

As discussed in IRC, I believe I bisected this down to commit ae6347bac, from bug 8022. The reasoning is sound, but ldap_int_open_connection does not actually return -2, only -1 or 0.

The patch is simple enough, but I'm also looking at some later commits that were probably done to work around this, and might not be needed now (bug 8957, bug 8968, bug 8980). Also need to test all setups thoroughly (ldap, ldaps, STARTTLS, not to mention back-meta/asyncmeta).

I also notice that LDAP_OPT_CONNECT_ASYNC is not effective unless LDAP_OPT_NETWORK_TIMEOUT is also set. It might be intentional, but the man page doesn't mention this specifically, and I don't see why it would be necessary...

Comment 1 Howard Chu 2020-04-27 00:12:06 UTC

(In reply to Ryan Tandy from comment #0)
> Created attachment 721 [details]
> async connect test without TLS
> 
> My understanding of LDAP_OPT_CONNECT_ASYNC is that the attached program
> should not block. If the connection does not establish fast enough, the bind
> call is supposed to return LDAP_X_CONNECTING.

The Bind should not block for longer than the connect timeout.

> I also notice that LDAP_OPT_CONNECT_ASYNC is not effective unless
> LDAP_OPT_NETWORK_TIMEOUT is also set. It might be intentional, but the man
> page doesn't mention this specifically, and I don't see why it would be
> necessary...

That is the design. If you don't specify a connect timeout, then
it doesn't know how long to wait for in that polling loop.

Comment 2 Ryan Tandy 2020-04-27 03:22:45 UTC

(In reply to Howard Chu from comment #1)
> The Bind should not block for longer than the connect timeout.

Just to confirm: I'm talking about an initial request with implicit connect. Not an explicit ldap_connect() or ldap_open_defconn() followed by a real request. Are you saying the implicit connect inside ldap_simple_bind_s() should block for up to the network timeout?

I don't think it has ever worked that way. The calls to ldap_int_poll() in ldap_send_initial_request() (inside ldap_int_check_async_open()) and ldap_send_server_request() both use a timeout of 0.

In ldap_send_server_request(), if the network timeout is exceeded, then it falls through and returns LDAP_SERVER_DOWN. If it's meant to block during the connect timeout, when should I actually expect to see LDAP_X_CONNECTING?

In any case, right now it never reaches that code at all, so the network timeout isn't even respected with async. (With a timeout, but not async, it works fine: blocks for the network timeout and then returns LDAP_SERVER_DOWN.)

Comment 3 Ryan Tandy 2020-05-10 18:12:44 UTC

A few notes from investigation so far.

50b33cc6 is wrong and masked some issues. ENOTCONN is telling us that we're trying to send on the socket before it finished connecting - meaning we actually have a bug somewhere earlier. Solaris 10 is actually a better platform for testing this, because Linux just returns EAGAIN in this case. (Also noticed by others, e.g. <https://lists.freebsd.org/pipermail/freebsd-bugs/2008-November/033159.html>.)

Other patches by the same author are similar workarounds to issues that shouldn't happen - need to identify and fix the root cause. The changes to avoid sockets being set to blocking should all be unnecessary, because none of them should be reached until after the connection is established. (For example, I think ldap_int_start_tls shouldn't need to consider the async setting at all, only the timeout.)

ldap_int_check_async_open doesn't check the network timeout. The code that _does_ check it can't be reached before the connection is established, so as far as I can tell, the connect timeout is currently just ignored when async.

ldap_send_initial_request calls ldap_int_check_async_open(ld_defconn). But when chasing a referral, that's a different connection! Again, Linux masks this bug by returning EAGAIN, but on Solaris 10 I can see that the referral connection sends its Bind too early and gets ENOTCONN. That one should be an easy fix.