[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: Trouble with “Connection Refused” errors and timeouts on server with high Thread Backload



Hi OpenLdap folks,

I ran into an issue with OpenLdap 2.4.44 that I am having trouble finding the root cause of.


I run Openldap in syncrepl mode. I have one machine which serves as a write endpoint (let’s call it the master node), and many machines which sync from it, and serve as read-replicas.


To ensure that they are in-sync with the Master, each read-replica runs ldapsearch against the Master node every minute. It looks at the entryCSN values for a bunch of objects on the Master, and compares against its own entryCSNs for its copy of these objects. It searches a bunch of different objects, and in total takes about 3 seconds for a read replica to do this search (I have duration logging on LDAP operations enabled by merging in this patch (http://www.openldap.org/its/index.cgi/Software%20Enhancements?id=8054;page=9). About 20 MB are transferred to each read-replica when they run this script.


NOTE: I prefer not to use the contextCSN for this sync because I only care about certain objects of the database being in-sync, and I need to know specifically which objects are in-sync vs out-of-sync.


I doubled the amount of times this script runs per read-replica. Therefore instead of each read-replica running this script once per minute, it was running it twice per minute.


Shortly thereafter, I started getting reports from someone who writes to the LDAP Master regularly that they are seeing a high amount of write operations failing with timeouts and “Connection Refused” errors. I reduced the frequency of the script back to once per minute, and the writer reported that they were no longer seeing these errors.


I assumed that this Connection Refused error was due to the fact that Openldap 2.4 uses a single thread for incoming connections (sources: https://lwn.net/Articles/755207/, https://www.openldap.org/pub/slim/OpenLDAP_Conn_Mgmt.pdf (section 3)), and the pending connection backlog on the socket was too high. Therefore the syscall is returning Connection Refused. This may be similar to the frontend contention issue described in this post: (http://www.openldap.org/lists/openldap-devel/201308/msg00003.html). 


I noticed that the values for cn=Backload,cn=Threads,cn=Monitor as well as cn=Pending,cn=Threads,cn=Monitor got very high when the read-replicas were running the script twice as much. For example, Pending is usually sitting around 5-6, but during the time of high read traffic, I saw Pending count increase by over 1000 times (my graph looks very spiky, with pending threads shooting up to 1000x, then down to 10x or 100x the next minute, then back up, etc.). I understand that cn=Backload is simply Active + Pending Threads, and interestingly Active threads stayed at normal levels. I am wondering what Pending threads means exactly, and how is Pending Threads different from Read/Write Waiters? (Interestingly, Read/Write Waiters stayed at normal levels.)


I attempted to reproduce this issue by running the script concurrently from a few different clients, hoIver, I was unable to get the Pending/Backload Threads up to similar levels (this value hovered around 16, which seems healthy. I did not see it spike up to similarly high levels). I observed that the latency of the Master from the read-replica’s perspective increased quite a bit during this test, but was unable to observe Connection Refused issues.


Is my assumption about the cause of this issue (single thread for incoming connections) down the right track? Is this behavior (high Pending/Backload Threads, Connection Refused errors) a known occurrence? Are there any other metrics that I can observe which would indicate what is the cause of the Connection Refused errors? Is there a reliable way to repro this issue (without doubling the frequency of the read-replica script)?


NOTE: I have the following settings configured, which I suspect may be relevant:
olcConcurrency: 0
olcConnMaxPending: 100
olcConnMaxPendingAuth: 1000
olcGentleHUP: FALSE
olcIdleTimeout: 60
olcIndexSubstrIfMaxLen: 4
olcIndexSubstrIfMinLen: 2
olcIndexSubstrAnyLen: 4
olcIndexSubstrAnyStep: 2
olcIndexIntLen: 4
olcListenerThreads: 1
olcLocalSSF: 71
olcLogLevel: Stats
olcLogLevel: Sync
olcSizeLimit: unlimited
olcSockbufMaxIncoming: 262143
olcSockbufMaxIncomingAuth: 16777215
olcThreads: 16
olcToolThreads: 1
olcWriteTimeout: 0


Thanks,



Sent with ProtonMail Secure Email.