9210 – Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued

Issue 9210 - Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued

Summary: Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued

Status:	VERIFIED DUPLICATE of issue 8650

Alias:	None

Product:	OpenLDAP
Classification:	Unclassified
Component:	libraries (show other issues)
Version:	2.4.47
Hardware:	All All

Importance:	--- normal
Target Milestone:	2.4.50
Assignee:	OpenLDAP project

URL:
Keywords:

Depends on:
Blocks:

Reported:	2020-04-12 15:29 UTC by lukas.juhrich@agdsn.de
Modified:	2020-04-13 14:45 UTC (History)
CC List:	0 users

See Also:	8650

Attachments
Patch adding errno resets (2.04 KB, patch) 2020-04-12 15:29 UTC, lukas.juhrich@agdsn.de	Details
test program (1.94 KB, text/x-csrc) 2020-04-12 20:55 UTC, Ryan Tandy	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description lukas.juhrich@agdsn.de 2020-04-12 15:29:23 UTC

Created attachment 706 [details]
Patch adding errno resets

*tl;dr* single-stepping revealed a missing `errno` reset in `ber_int_sb_write`s retry loop.

An sssd-setup of ours, which we use for basic-auth on one of our services, issues ldap calls.  When under load, i.e. when many `ldap_search_ext` calls had to be issued due to many requests, we observed that the corresponding process/thread went up to 100% CPU usage and stayed there.

- This was the [flamegraph](https://helios.wh2.tu-dresden.de/~shreyder/sssd_be%20--domain%20dom-http-wiki.svg), where you can see that it was stuck below `ber_int_sb_write`.
- Single-Stepping with GDB revealed that we are stuck in the `for(;;)`-Retry-loop.  Indeed, we could observe that the `sbi_write` was successful, but the `errno` continued to be `EINTR` every time I hit that breakpoint.
- Patching `sockbuf.c` as attached and rebuilding resolved the issue.

I also noticed similar sections with such a loop in `sockbuf.c` and added `errno = 0;` at the beginning of each iteration.  In principle, they should suffer from the same problem.

The reasoning for why this happened under load is that with many requests being issued, the probability that the write happens when the process gets an _actual_ interrupt is much higher, and once that happens, we're stuck in the infinite loop.

Comment 1 Howard Chu 2020-04-12 16:14:44 UTC

Sorry, your explanation for this patch makes no sense. If the sbi_write() actually succeeds, then ret will be >= 0. And the loop condition is
   if ((ret<0) && (errno==EINTR)) continue;

So if the write succeeds, errno is irrelevant. If the write actually fails,
errno will be set to an actual value. As such, a forced reset of errno is
not relevant.

Comment 2 lukas.juhrich@agdsn.de 2020-04-12 16:28:52 UTC

If you wish, I can try to reproduce this again.
Unfortunately, this has been two months ago and I've only come around to the bugreporting now, so I will need to downgrade and set some things up again.

The basic observation still stands though, adding the errno fixes the symptom.

Comment 3 Howard Chu 2020-04-12 16:56:35 UTC

(In reply to lukas.juhrich@agdsn.de from comment #2)
> If you wish, I can try to reproduce this again.
> Unfortunately, this has been two months ago and I've only come around to the
> bugreporting now, so I will need to downgrade and set some things up again.
> 
> The basic observation still stands though, adding the errno fixes the
> symptom.

Yes, please try to reproduce it.

But you understand, the behavior you've reported is not how errno is documented to work. Your flamegraph shows that gnutls is underneath, perhaps it is mis-handling its reporting of success/failure. In any case, when an underlying call succeeds, its return value must be >= 0, and the value of errno is ignored. The value of errno is only significant if the return value is < 0.

Comment 4 lukas.juhrich@agdsn.de 2020-04-12 17:13:29 UTC

> Yes, please try to reproduce it.

I'm on it, after downgrading I can observe the behavior again, setting gdb up.

Thanks for your quick response so far and sorry for my sloppy analysis.

Comment 5 lukas.juhrich@agdsn.de 2020-04-12 17:31:13 UTC

Quick update, I found a document from myself from the last debugging.

The return value in the loop is always `-408`, which is `GNUTLS_E_UNAVAILABLE_DURING_HANDSHAKE`.

Comment 6 Howard Chu 2020-04-12 17:52:19 UTC

(In reply to lukas.juhrich@agdsn.de from comment #5)
> Quick update, I found a document from myself from the last debugging.
> 
> The return value in the loop is always `-408`, which is
> `GNUTLS_E_UNAVAILABLE_DURING_HANDSHAKE`.

Which is quite different from your original report saying there was no error.

This still sounds like a GnuTLS bug. What version of that are you using? Can you reproduce this workload using OpenSSL instead?

Comment 7 Ryan Tandy 2020-04-12 17:58:48 UTC

Hi Lukas, can you check whether gnutls_handshake() returns GNUTLS_E_AGAIN in your case? If so then this is simply #8650 again. Also, what version of GnuTLS are you on, please?

Comment 8 lukas.juhrich@agdsn.de 2020-04-12 18:30:40 UTC

Gnutls version (debian packages): libgnutls28-dev=3.6.7-4+deb10u3, libgnutls30=3.6.7-4+deb10u2

Unfortunately, my reproduction stopped working, so I cannot give any more details right now.  I'm trying to find an alternative.

Comment 9 lukas.juhrich@agdsn.de 2020-04-12 18:33:11 UTC

Clarification: By „My reproduction stopped working“, I mean

1. It worked a few minutes ago, and I had a gdb session with precisely the same situation (errno == EINTR, ret == GNUTLS_E_UNAVAILABLE_DURING_HANDSHAKE). No, it was not EAGAIN.  Yes, it was the same stacktrace as in the flamegraph.
2. Then, it stopped working, and the CPU didn't go to 100% anymore.

Comment 10 Ryan Tandy 2020-04-12 18:58:42 UTC

I just noticed your original report did mention sssd. So yes, this is exactly bug 8650. This case was previously documented as <https://bugs.debian.org/849756> and <https://github.com/SSSD/sssd/pull/67>. The "fix" for that issue was to change the socket to blocking which is more like a workaround for the libldap bug. It's the same bug and happens equally if libldap sets the socket non-blocking itself (for example when doing an async connect). The GNUTLS_E_UNAVAILABLE_DURING_HANDSHAKE error code was introduced in GnuTLS 3.6.0 <https://gitlab.com/gnutls/gnutls/issues/158> exactly to highlight users like libldap doing the wrong thing with gnutls_handshake.

*** This issue has been marked as a duplicate of issue 8650 ***

Comment 11 lukas.juhrich@agdsn.de 2020-04-12 19:28:02 UTC

Thanks Ryan, your explanation makes sense (although I can't verify this is what actually happens due to lack of reproduction).

However, that sounds to me like a separate phenomenon from the missing `errno` reset:  Shouldn't the retry condition be `( ret < 0 ) || ( errno == EINTR )`, so that in any of both scenarios (stuck in handshake or similar vs. interrupt during request) the retry is triggered?

Because if I would change the condition that way, my patch would instead _not_ fix my symptoms, which would be more consistent with your explanation (given that #8650 is still open).

Sorry if I misunderstand something, I'm not a C programmer and this whole library-ecosystem is quite foreign to me.

Comment 12 Howard Chu 2020-04-12 19:30:14 UTC

(In reply to lukas.juhrich@agdsn.de from comment #11)

> However, that sounds to me like a separate phenomenon from the missing
> `errno` reset:  Shouldn't the retry condition be `( ret < 0 ) || ( errno ==
> EINTR )`, so that in any of both scenarios (stuck in handshake or similar
> vs. interrupt during request) the retry is triggered?

No.

Comment 13 Ryan Tandy 2020-04-12 20:46:44 UTC


*** This issue has been marked as a duplicate of issue 8650 ***

Comment 14 Ryan Tandy 2020-04-12 20:55:38 UTC

Created attachment 707 [details]
test program

Here's a test program that exercises connecting with a non-blocking socket, like sssd does. Currently it fails on 2.4 with LDAP_SERVER_DOWN and on 2.5 with LDAP_TIMEOUT, but succeeds if you comment out the fcntl().

This only tests the non-blocking scenario. Any patch should also be tested with a blocking socket per the scenario from bug 8650.