Full_Name: Tom Pressnell Version: 2.4.40 OS: debian wheezy URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (5.63.147.4) Hi, In testing i have encountered the following when restoring replication from multiple providers. slapd: slapd starting slapd: null_callback : error code 0x10 slapd: syncrepl_entry: rid=106 be_modify failed (16) AsAslapd: do_syncrepl: rid=106 rc 16 retrying (9 retries left) Running 2.4.40 on debian wheezy amd64. This occurs with both the mdb and hdb backends. While the consumer was offline i removed all values of an indexed attribute from hundreds of entries, all providers are synced. Using standard syncrepl (not delta), issue occurs at process startup where the consumer has been offline for a short while, i.e. replaying the sessionlog not falling back to present. Having a read of syncrepl.c it would seem that syncrepl_diff_entry is called for a entry received from a provider in a race with the same entry from another provider in a different thread. Both diff calls result in an attribute delete operation, when be_modify is hit for the second time it fails with no such attribute as the first thread has already removed it (uses DELETE, not SOFTDEL). Thanks Tom
--On Friday, April 10, 2015 5:12 PM +0000 tpretz@gmail.com wrote: > Full_Name: Tom Pressnell > Version: 2.4.40 > OS: debian wheezy > URL: ftp://ftp.openldap.org/incoming/ > Submission from: (NULL) (5.63.147.4) > > > Hi, > > In testing i have encountered the following when restoring replication > from multiple providers. > > slapd: slapd starting > slapd: null_callback : error code 0x10 > slapd: syncrepl_entry: rid=106 be_modify failed (16)AsAslapd: do_syncrepl: > rid=106 rc 16 retrying (9 retries left) > > Running 2.4.40 on debian wheezy amd64. > This occurs with both the mdb and hdb backends. > > While the consumer was offline i removed all values of an indexed > attribute from hundreds of entries, all providers are synced. > > Using standard syncrepl (not delta), issue occurs at process startup > where the consumer has been offline for a short while, i.e. replaying the > sessionlog not falling back to present. > > Having a read of syncrepl.c it would seem that syncrepl_diff_entry is > called for a entry received from a provider in a race with the same entry > from another provider in a different thread. > Both diff calls result in an attribute delete operation, when be_modify > is hit for the second time it fails with no such attribute as the first > thread has already removed it (uses DELETE, not SOFTDEL). Please test against current RE24, as there have been a number of fixes to syncrepl. Thanks! --Quanah -- Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration
Now running RE24 head (commit 1101d4b7dbdb2453076c1075d1ae2375c36f06f9) Turned on a bit of logging (sync trace args) and grepped out some lines of interest. @(#) $OpenLDAP: slapd 2.4.X (Apr 10 2015 20:07:53) $ ... skipping ... syncrepl_message_to_entry: rid=100 DN: cn=3,ou=stest,dc=example,dc=com, UUID: 0f89951e-94fe-445b-91f8-9e8589c4d1e3 mdb_modify_internal: 0x00ebd167: cn=2,ou=stest,dc=example,dc=com mdb_modify_internal: replace AttrComment mdb_modify_internal: replace entryCSN mdb_modify_internal: replace modifiersName mdb_modify_internal: replace modifyTimestamp syncrepl_entry: rid=100 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD) syncrepl_entry: rid=100 inserted UUID 0f89951e-94fe-445b-91f8-9e8589c4d1e3 syncrepl_entry: rid=100 be_search (0) syncrepl_entry: rid=100 cn=3,ou=stest,dc=example,dc=com mdb_modify: cn=3,ou=stest,dc=example,dc=com mdb_modify: updated id=00ebd167 dn="cn=2,ou=stest,dc=example,dc=com" syncrepl_entry: rid=101 be_modify cn=2,ou=stest,dc=example,dc=com (0) syncrepl_message_to_entry: rid=101 DN: cn=3,ou=stest,dc=example,dc=com, UUID: 0f89951e-94fe-445b-91f8-9e8589c4d1e3 mdb_modify_internal: 0x00ebd168: cn=3,ou=stest,dc=example,dc=com mdb_modify_internal: delete AttrComment mdb_modify_internal: replace entryCSN mdb_modify_internal: replace modifiersName mdb_modify_internal: replace modifyTimestamp syncrepl_entry: rid=101 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD) syncrepl_entry: rid=101 inserted UUID 0f89951e-94fe-445b-91f8-9e8589c4d1e3 syncrepl_entry: rid=101 be_search (0) syncrepl_entry: rid=101 cn=3,ou=stest,dc=example,dc=com mdb_modify: cn=3,ou=stest,dc=example,dc=com mdb_modify: updated id=00ebd168 dn="cn=3,ou=stest,dc=example,dc=com" syncrepl_entry: rid=100 be_modify cn=3,ou=stest,dc=example,dc=com (0) syncrepl_message_to_entry: rid=100 DN: cn=4,ou=stest,dc=example,dc=com, UUID: 25f4703a-b73e-4d7c-81b8-9f139e267023 mdb_modify_internal: 0x00ebd168: cn=3,ou=stest,dc=example,dc=com mdb_modify_internal: delete AttrComment mdb_modify_internal: 16 modify/delete: AttrComment: no such attribute mdb_modify: modify failed (16) null_callback : error code 0x10 syncrepl_entry: rid=101 be_modify cn=3,ou=stest,dc=example,dc=com (16) syncrepl_entry: rid=101 be_modify failed (16) Thanks Tom
Hi, Following up on this ITS i opened a while back. With multi-master, normal syncrepl, i would sometimes receive: slapd: null_callback : error code 0x10 slapd: syncrepl_entry: rid=106 be_modify failed (16) Triggering a syncrepl connection drop/retry, whilst playing the sessionlog when a server with multiple providers was started. I am now testing with 2.4.44 and have had a chance to look at this annoying, but seemingly not destructive issue in some more detail. As i partially referenced previously, this occurs within syncrepl_entry, for modifications, a diff of old_entry to new_entry is performed. Then if changes are needed a be_modify is performed. There is however, no locking which prevents two, or more, threads from performing these diffs, and then mods, in an interleaved fashion within this function itself. Looking in do_syncrep2, if the cookie tag is present the cs_pmutex is acquired and held for the duration of modifications. This mutex protects from syncrepl_entry race conditions and serializes modifications. I have also noticed this issue during normal operations (ie all syncrepl in persist) when out of order writes are occurring on a master which are relatively easy to reproduce on an hdb backend server. When a cookie is not sent with an entry the cs_pmutex is not acquired. Without having some protection, non-cookie modifications will race each other between syncrepl threads. So, i am testing surrounding the syncrepl_entry "if" block (line 1036) with a cs_pmutex lock/release (when punlock < 0) to serialize non_cookie mods just like the cookie ones. So far this is running tests and i haven't seen the null_callback issue, either when catching up from the session log, or running with ongoing out of order writes being replicated (running alongside unmodified 2.4.44 to compare differences). When acquiring the cs_pmutex i have used the same logic as at line 958 (using trylock, with a shutdown check). I wonder if it is safe to acquire the mutex with a standard ldap_pvt_thread_mutex_lock at this point without spinning. line numbers from RELENG_2_4 (721a038b7bc9732f52eeef5324c180c4f137cd75) Thanks Tom
Can you test again with current RE24? Massive changes have been made to (delta-)syncrepl, particularly around MRR since this was filed.
(In reply to tpretz@gmail.com from comment #3) > When a cookie is not sent with an entry the cs_pmutex is not acquired. > Without having some protection, non-cookie modifications will race > each other between syncrepl threads. > > So, i am testing surrounding the syncrepl_entry "if" block (line 1036) > with a cs_pmutex lock/release (when punlock < 0) to serialize > non_cookie mods just like the cookie ones. > So far this is running tests and i haven't seen the null_callback > issue, either when catching up from the session log, or running with > ongoing out of order writes being replicated (running alongside > unmodified 2.4.44 to compare differences). > > When acquiring the cs_pmutex i have used the same logic as at line 958 > (using trylock, with a shutdown check). I wonder if it is safe to > acquire the mutex with a standard ldap_pvt_thread_mutex_lock at this > point without spinning. Sounds like you've done things correctly. It's not safe to do a normal mutex_lock here because the wait time could be quite long, and interfere with other the pool pause or shutdown operations. We'll be adding the same code now. > > line numbers from RELENG_2_4 (721a038b7bc9732f52eeef5324c180c4f137cd75) > > Thanks > > Tom
Commits: • e02b1d94 by Howard Chu at 2020-09-09T15:35:59+00:00 ITS#8102 serialize plain syncrepl Using cs_pmutex. Reverts the addition of cs_modmutex in ITS#9330, use cs_pmutex for both delta and plain writes. Note that plain syncrepl already used cs_pmutex when a cookie CSN was present in the op. Now it is used for all writes, regardless of presence of cookie.
RE24: • 178ca23e by Howard Chu at 2020-09-09T16:39:15+00:00 ITS#8102 serialize plain syncrepl Using cs_pmutex. Reverts the addition of cs_modmutex in ITS#9330, use cs_pmutex for both delta and plain writes. Note that plain syncrepl already used cs_pmutex when a cookie CSN was present in the op. Now it is used for all writes, regardless of presence of cookie.
Commits: • 62ecd38b by Howard Chu at 2020-09-22T21:27:15+01:00 ITS#8102 syncrepl: only use trylock on the cn=config DB
RE24: Commits: • 584858eb by Howard Chu at 2020-09-22T21:37:09+00:00 ITS#8102 syncrepl: only use trylock on the cn=config DB
head: • 868309c9 by Ondřej Kuzník at 2023-01-30T12:06:24+00:00 ITS#8102 Do not continue if deconfigured during pause RE26: • 0b2f5ad7 by Ondřej Kuzník at 2023-01-30T19:01:00+00:00 ITS#8102 Do not continue if deconfigured during pause RE25: • 6733fe4d by Ondřej Kuzník at 2023-01-30T19:02:48+00:00 ITS#8102 Do not continue if deconfigured during pause