8102 – syncrepl_entry: be_modify failed (16)

Issue 8102 - syncrepl_entry: be_modify failed (16)

Summary: syncrepl_entry: be_modify failed (16)

Status:	VERIFIED FIXED

Alias:	None

Product:	OpenLDAP
Classification:	Unclassified
Component:	slapd (show other issues)
Version:	2.4.40
Hardware:	All All

Importance:	--- normal
Target Milestone:	2.4.54
Assignee:	OpenLDAP project

URL:
Keywords:

Depends on:
Blocks:

Reported:	2015-04-10 16:12 UTC by tpretz@gmail.com
Modified:	2023-01-30 19:30 UTC (History)
CC List:	0 users

See Also:	9584

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description tpretz@gmail.com 2015-04-10 16:12:57 UTC

Full_Name: Tom Pressnell
Version: 2.4.40
OS: debian wheezy
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (5.63.147.4)


Hi,

In testing i have encountered the following when restoring replication from
multiple providers.

slapd: slapd starting
slapd: null_callback : error code 0x10
slapd: syncrepl_entry: rid=106 be_modify failed (16)
AsAslapd: do_syncrepl:
rid=106 rc 16 retrying (9 retries left)

Running 2.4.40 on debian wheezy amd64.
This occurs with both the mdb and hdb backends.

While the consumer was offline i removed all values of an indexed attribute from
hundreds of entries, all providers are synced.

Using standard syncrepl (not delta), issue occurs at process startup where the
consumer has been offline for a short while, i.e. replaying the sessionlog not
falling back to present.

Having a read of syncrepl.c it would seem that syncrepl_diff_entry is called for
a entry received from a provider in a race with the same entry from another
provider in a different thread.
Both diff calls result in an attribute delete operation, when be_modify is hit
for the second time it fails with no such attribute as the first thread has
already removed it (uses DELETE, not SOFTDEL).


Thanks

Tom

Comment 1 Quanah Gibson-Mount 2015-04-10 16:44:11 UTC

--On Friday, April 10, 2015 5:12 PM +0000 tpretz@gmail.com wrote:

> Full_Name: Tom Pressnell
> Version: 2.4.40
> OS: debian wheezy
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (5.63.147.4)
>
>
> Hi,
>
> In testing i have encountered the following when restoring replication
> from multiple providers.
>
> slapd: slapd starting
> slapd: null_callback : error code 0x10
> slapd: syncrepl_entry: rid=106 be_modify failed (16)AsAslapd: do_syncrepl:
> rid=106 rc 16 retrying (9 retries left)
>
> Running 2.4.40 on debian wheezy amd64.
> This occurs with both the mdb and hdb backends.
>
> While the consumer was offline i removed all values of an indexed
> attribute from hundreds of entries, all providers are synced.
>
> Using standard syncrepl (not delta), issue occurs at process startup
> where the consumer has been offline for a short while, i.e. replaying the
> sessionlog not falling back to present.
>
> Having a read of syncrepl.c it would seem that syncrepl_diff_entry is
> called for a entry received from a provider in a race with the same entry
> from another provider in a different thread.
> Both diff calls result in an attribute delete operation, when be_modify
> is hit for the second time it fails with no such attribute as the first
> thread has already removed it (uses DELETE, not SOFTDEL).

Please test against current RE24, as there have been a number of fixes to 
syncrepl.  Thanks!

--Quanah

--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration

Comment 2 tpretz@gmail.com 2015-04-11 12:44:40 UTC

Now running RE24 head (commit 1101d4b7dbdb2453076c1075d1ae2375c36f06f9)

Turned on a bit of logging (sync trace args) and grepped out some
lines of interest.

@(#) $OpenLDAP: slapd 2.4.X (Apr 10 2015 20:07:53) $
... skipping ...
syncrepl_message_to_entry: rid=100 DN:
cn=3,ou=stest,dc=example,dc=com, UUID:
0f89951e-94fe-445b-91f8-9e8589c4d1e3
mdb_modify_internal: 0x00ebd167: cn=2,ou=stest,dc=example,dc=com
mdb_modify_internal: replace AttrComment
mdb_modify_internal: replace entryCSN
mdb_modify_internal: replace modifiersName
mdb_modify_internal: replace modifyTimestamp
syncrepl_entry: rid=100 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
syncrepl_entry: rid=100 inserted UUID 0f89951e-94fe-445b-91f8-9e8589c4d1e3
syncrepl_entry: rid=100 be_search (0)
syncrepl_entry: rid=100 cn=3,ou=stest,dc=example,dc=com
mdb_modify: cn=3,ou=stest,dc=example,dc=com
mdb_modify: updated id=00ebd167 dn="cn=2,ou=stest,dc=example,dc=com"
syncrepl_entry: rid=101 be_modify cn=2,ou=stest,dc=example,dc=com (0)
syncrepl_message_to_entry: rid=101 DN:
cn=3,ou=stest,dc=example,dc=com, UUID:
0f89951e-94fe-445b-91f8-9e8589c4d1e3
mdb_modify_internal: 0x00ebd168: cn=3,ou=stest,dc=example,dc=com
mdb_modify_internal: delete AttrComment
mdb_modify_internal: replace entryCSN
mdb_modify_internal: replace modifiersName
mdb_modify_internal: replace modifyTimestamp
syncrepl_entry: rid=101 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
syncrepl_entry: rid=101 inserted UUID 0f89951e-94fe-445b-91f8-9e8589c4d1e3
syncrepl_entry: rid=101 be_search (0)
syncrepl_entry: rid=101 cn=3,ou=stest,dc=example,dc=com
mdb_modify: cn=3,ou=stest,dc=example,dc=com
mdb_modify: updated id=00ebd168 dn="cn=3,ou=stest,dc=example,dc=com"
syncrepl_entry: rid=100 be_modify cn=3,ou=stest,dc=example,dc=com (0)
syncrepl_message_to_entry: rid=100 DN:
cn=4,ou=stest,dc=example,dc=com, UUID:
25f4703a-b73e-4d7c-81b8-9f139e267023
mdb_modify_internal: 0x00ebd168: cn=3,ou=stest,dc=example,dc=com
mdb_modify_internal: delete AttrComment
mdb_modify_internal: 16 modify/delete: AttrComment: no such attribute
mdb_modify: modify failed (16)
null_callback : error code 0x10
syncrepl_entry: rid=101 be_modify cn=3,ou=stest,dc=example,dc=com (16)
syncrepl_entry: rid=101 be_modify failed (16)

Thanks

Tom

Comment 3 tpretz@gmail.com 2016-05-28 18:00:45 UTC

Hi,

Following up on this ITS i opened a while back.

With multi-master, normal syncrepl, i would sometimes receive:

slapd: null_callback : error code 0x10
slapd: syncrepl_entry: rid=106 be_modify failed (16)

Triggering a syncrepl connection drop/retry, whilst playing the
sessionlog when a server with multiple providers was started.

I am now testing with 2.4.44 and have had a chance to look at this
annoying, but seemingly not destructive issue in some more detail.

As i partially referenced previously, this occurs within
syncrepl_entry, for modifications, a diff of old_entry to new_entry is
performed. Then if changes are needed a be_modify is performed.
There is however, no locking which prevents two, or more, threads from
performing these diffs, and then mods, in an interleaved fashion
within this function itself.

Looking in do_syncrep2, if the cookie tag is present the cs_pmutex is
acquired and held for the duration of modifications. This mutex
protects from syncrepl_entry race conditions and serializes
modifications.

I have also noticed this issue during normal operations (ie all
syncrepl in persist) when out of order writes are occurring on a
master which are relatively easy to reproduce on an hdb backend
server.

When a cookie is not sent with an entry the cs_pmutex is not acquired.
Without having some protection, non-cookie modifications will race
each other between syncrepl threads.

So, i am testing surrounding the syncrepl_entry "if" block (line 1036)
with a cs_pmutex lock/release (when punlock < 0) to serialize
non_cookie mods just like the cookie ones.
So far this is running tests and i haven't seen the null_callback
issue, either when catching up from the session log, or running with
ongoing out of order writes being replicated (running alongside
unmodified 2.4.44 to compare differences).

When acquiring the cs_pmutex i have used the same logic as at line 958
(using trylock, with a shutdown check). I wonder if it is safe to
acquire the mutex with a standard ldap_pvt_thread_mutex_lock at this
point without spinning.

line numbers from RELENG_2_4 (721a038b7bc9732f52eeef5324c180c4f137cd75)

Thanks

Tom

Comment 4 Quanah Gibson-Mount 2020-03-20 22:41:59 UTC

Can you test again with current RE24?  Massive changes have been made to (delta-)syncrepl, particularly around MRR since this was filed.

Comment 5 Howard Chu 2020-09-09 10:18:18 UTC

(In reply to tpretz@gmail.com from comment #3)

> When a cookie is not sent with an entry the cs_pmutex is not acquired.
> Without having some protection, non-cookie modifications will race
> each other between syncrepl threads.
> 
> So, i am testing surrounding the syncrepl_entry "if" block (line 1036)
> with a cs_pmutex lock/release (when punlock < 0) to serialize
> non_cookie mods just like the cookie ones.
> So far this is running tests and i haven't seen the null_callback
> issue, either when catching up from the session log, or running with
> ongoing out of order writes being replicated (running alongside
> unmodified 2.4.44 to compare differences).
> 
> When acquiring the cs_pmutex i have used the same logic as at line 958
> (using trylock, with a shutdown check). I wonder if it is safe to
> acquire the mutex with a standard ldap_pvt_thread_mutex_lock at this
> point without spinning.

Sounds like you've done things correctly. It's not safe to do a normal
mutex_lock here because the wait time could be quite long, and interfere
with other the pool pause or shutdown operations.

We'll be adding the same code now.
> 
> line numbers from RELENG_2_4 (721a038b7bc9732f52eeef5324c180c4f137cd75)
> 
> Thanks
> 
> Tom

Comment 6 Quanah Gibson-Mount 2020-09-09 16:17:35 UTC

Commits: 
  • e02b1d94 
by Howard Chu at 2020-09-09T15:35:59+00:00 
ITS#8102 serialize plain syncrepl

Using cs_pmutex. Reverts the addition of cs_modmutex in ITS#9330,
use cs_pmutex for both delta and plain writes.

Note that plain syncrepl already used cs_pmutex when a cookie CSN
was present in the op. Now it is used for all writes, regardless
of presence of cookie.

Comment 7 Quanah Gibson-Mount 2020-09-09 16:50:50 UTC

RE24:

  • 178ca23e 
by Howard Chu at 2020-09-09T16:39:15+00:00 
ITS#8102 serialize plain syncrepl

Using cs_pmutex. Reverts the addition of cs_modmutex in ITS#9330,
use cs_pmutex for both delta and plain writes.

Note that plain syncrepl already used cs_pmutex when a cookie CSN
was present in the op. Now it is used for all writes, regardless
of presence of cookie.

Comment 8 Quanah Gibson-Mount 2020-09-22 21:36:22 UTC

Commits: 
  • 62ecd38b 
by Howard Chu at 2020-09-22T21:27:15+01:00 
ITS#8102 syncrepl: only use trylock on the cn=config DB

Comment 9 Quanah Gibson-Mount 2020-09-22 21:37:44 UTC

RE24:

Commits: 
  • 584858eb 
by Howard Chu at 2020-09-22T21:37:09+00:00 
ITS#8102 syncrepl: only use trylock on the cn=config DB

Comment 10 Quanah Gibson-Mount 2023-01-30 19:30:21 UTC

head:

  • 868309c9 
by Ondřej Kuzník at 2023-01-30T12:06:24+00:00 
ITS#8102 Do not continue if deconfigured during pause


RE26:

  • 0b2f5ad7 
by Ondřej Kuzník at 2023-01-30T19:01:00+00:00 
ITS#8102 Do not continue if deconfigured during pause


RE25:

  • 6733fe4d 
by Ondřej Kuzník at 2023-01-30T19:02:48+00:00 
ITS#8102 Do not continue if deconfigured during pause