[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: LMDB killed process and LOCK_MUTEX_W()



Dimitrios Apostolou wrote:
Hello,

in my program using LMDB, I've experienced rare deadlocks in highly
concurrent mixed (read/write/cursor iteration) workloads. The end result
is that hundreds of threads are hanging waiting on LOCK_MUTEX_W().
Unfortunately I'm not quite sure why this happens.

If my understanding is correct, this mutex is locked from the beginning of
the transaction, until the commit/abort, effectively serialising writers.
So I assume that somehow a writer dies or is violently killed, so he is
not able to run its atexit() cleanups, and this shared mutex remains
locked forever.

What would you suggest for such a situation? I'm thinking of patching LMDB
to lock with mutex_timedwait() and periodically check if the PID having
taken the mutex is still alive. Is the writer PID stored somewhere, or a
change of format will be needed? Any other ideas are welcome!

We have a patch to use robust mutexes. They're a few percent slower but will allow recovery from this situation.

But aside from that, either your software has a bug, or someone is messing with your system, and you need to find out what's going on and stop that.

Thanks in advance,
Dimitris




--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/