[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: ldap deadlock?

Ok, I think my eyes just popped open a bit here, I was assuming something could not/would not happen.

Let's say I have a pristinely clean db, no outstanding locks, I start up slapd, it runs fine, never dies, never is stopped and started, is not accessed by anything but slapd, slapcat, and db_checkpoint (each always successfully) while it is running, can a lock go stale in that environment?

My gut feeling here is that the answer is going to be yes, thus being the root cause of these occurrences. In my ideal world I never expect that to/could happen...

Furthermore, I see these on occasion:

connection_read(62): no connection!

Sometimes this occurs right after the ACCEPT without a corresponding op= ENTRY for that fd. Other times I see it after one or more op= ENTRY operations. This appears to me that the client is not gracefully disconnecting, the "no connection" message can be time stamp the same second as the ACCEPT so I know it's not due to idle timeout. I'm pretty sure the culprit in most of these "no connection" messages is via sendmail on our MTA doing lookups.

Could instances of these be causing stale locks??

Curt Blank wrote:

Howard Chu wrote:

Curt Blank wrote:

I'm looking for ideas here. ldap seems to deadlock once in a while whereby it continues to accept connections as noted in the log file but it does not return anything to the query, the query just hangs.

It's openldap 2.2.28 using Berkley db 4.2.52 as the backend on a SuSE 9.3 platform. All patches are up to snuff on the OS side.

I'm hoping for pointers to help see what might be going on.

As of today I started running db_deadlock in the background wit the -a y option to see if that helps.

This deadlocking is getting people up in arms here because it is disrupting authentication for the whole campus and I guess I can't blame them.

There have been no deadlocks reported in OpenLDAP 2.2 after 2.2.20. More likely you had an unclean shutdown and restarted without running db_recover, so you have stale locks in the environment. You should upgrade to 2.3 which does recovery automatically.

No, I know that isn't/wasn't the case, I manually ran db_recover with the -v option ~16 hours before the last occurrence of this and the server did not/was not shutdown in between nor did the slapd die and it wasn't stopped/started. This last time (last Friday) our backup started 12 minutes after it was only accepting connections and not responding with data and that really compounded the problem. The backup does a db_checkpoint and it hung and stopping the slapd daemon did not correct the problem. slapd stopped cleanly but when restarted it just sat there and would not even accept connections. The db_checkpoint would not complete and after about 10 minutes was killed. I know I know not the best thing to do but when you have people on campus pissed because they can't login time is one luxury that we do not have, and yes db_recover was successfully run again before slapd was started. But, I'm a bit leery of it right now....

One thing I failed to mention is that it appeared that a slurp replication to this slave server started at the time slapd started only accepting connections and not responding with data. So that's a write and that is what got me to start thinking about a deadlock situation.