[Date Prev][Date Next] [Chronological] [Thread] [Top]

(ITS#8102) syncrepl_entry: be_modify failed (16)



Hi,

Following up on this ITS i opened a while back.

With multi-master, normal syncrepl, i would sometimes receive:

slapd: null_callback : error code 0x10
slapd: syncrepl_entry: rid=106 be_modify failed (16)

Triggering a syncrepl connection drop/retry, whilst playing the
sessionlog when a server with multiple providers was started.

I am now testing with 2.4.44 and have had a chance to look at this
annoying, but seemingly not destructive issue in some more detail.

As i partially referenced previously, this occurs within
syncrepl_entry, for modifications, a diff of old_entry to new_entry is
performed. Then if changes are needed a be_modify is performed.
There is however, no locking which prevents two, or more, threads from
performing these diffs, and then mods, in an interleaved fashion
within this function itself.

Looking in do_syncrep2, if the cookie tag is present the cs_pmutex is
acquired and held for the duration of modifications. This mutex
protects from syncrepl_entry race conditions and serializes
modifications.

I have also noticed this issue during normal operations (ie all
syncrepl in persist) when out of order writes are occurring on a
master which are relatively easy to reproduce on an hdb backend
server.

When a cookie is not sent with an entry the cs_pmutex is not acquired.
Without having some protection, non-cookie modifications will race
each other between syncrepl threads.

So, i am testing surrounding the syncrepl_entry "if" block (line 1036)
with a cs_pmutex lock/release (when punlock < 0) to serialize
non_cookie mods just like the cookie ones.
So far this is running tests and i haven't seen the null_callback
issue, either when catching up from the session log, or running with
ongoing out of order writes being replicated (running alongside
unmodified 2.4.44 to compare differences).

When acquiring the cs_pmutex i have used the same logic as at line 958
(using trylock, with a shutdown check). I wonder if it is safe to
acquire the mutex with a standard ldap_pvt_thread_mutex_lock at this
point without spinning.

line numbers from RELENG_2_4 (721a038b7bc9732f52eeef5324c180c4f137cd75)

Thanks

Tom