Issue 4589 - syncrepl DELs sometimes don't propagate
Summary: syncrepl DELs sometimes don't propagate
Status: VERIFIED FIXED
Alias: None
Product: OpenLDAP
Classification: Unclassified
Component: slapd (show other issues)
Version: unspecified
Hardware: All All
: --- normal
Target Milestone: ---
Assignee: OpenLDAP project
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-06-13 19:15 UTC by richton@nbcs.rutgers.edu
Modified: 2014-08-01 21:06 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description richton@nbcs.rutgers.edu 2006-06-13 19:15:33 UTC
Full_Name: Aaron Richton
Version: 2.3.24
OS: Solaris 9
URL: 
Submission from: (NULL) (128.6.31.135)


Something not quite right with session log or nonpresent list or.......

Pretty much the same as the last time I played the syncrepl game, except this
time instead of entries missing from the slave, there are excessive entries on
the slave. Here's an example, although it's hardly illuminating:

[...]
Jun  8 15:18:09 master.rutgers.edu slapd[29239]: [ID 379477 local4.debug]
conn=361133 op=2 DEL dn="uid=VICTIM,ou=People,dc=rci,dc=rutgers,dc=edu"
[...]
Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 848112 local4.debug]
conn=361315 fd=100 ACCEPT from IP=192.168.4.67:53492 (IP=0.0.0.0:389)
Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 560212 local4.debug]
conn=361315 op=0 STARTTLS
Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 875301 local4.debug]
conn=361315 op=0 RESULT oid= err=0 text=
Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 105384 local4.debug]
conn=361315 fd=100 TLS established tls_ssf=256 ssf=256
Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 215403 local4.debug]
conn=361315 op=1 BIND dn="cn=readEverything,dc=rci,dc=rutgers,dc=edu"
method=128
Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 600343 local4.debug]
conn=361315 op=1 BIND dn="cn=readEverything,dc=rci,dc=rutgers,dc=edu"
mech=SIMPLE ssf=0
Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 588225 local4.debug]
conn=361315 op=1 RESULT tag=97 err=0 text=
Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 469902 local4.debug]
conn=361315 op=2 SRCH base="dc=rci,dc=rutgers,dc=edu" scope=2 deref=0
filter="(objectClass=*)"
Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 744844 local4.debug]
conn=361315 op=2 SRCH attr=* +
Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 545199 local4.debug] Entry
uid=BYSTANDER,ou=People,dc=rci,dc=rutgers,dc=edu CSN
20060608190309Z#000000#00#000000 matches ctx 20060608190309Z#000000#00#000000
Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 167594 local4.debug]
conn=361315 op=2 SEARCH RESULT tag=101 err=0 nentries=0 text=
Jun  8 15:22:44 slave5.rutgers.edu slapd[12966]: [ID 496601 local4.debug]
do_syncrep2: LDAP_RES_SEARCH_RESULT
Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 218904 local4.debug]
conn=361315 op=3 UNBIND
Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 952275 local4.debug]
conn=361315 fd=100 closed

and at the end of this, slave5 still has uid=VICTIM, which is undesirable.
slave5 is on "stats sync" debug. I think I should see a del_nonpresent in there,
no? So I'd assume the master is telling it something wrong. Not sure what else I
can do. I have:

-d -1 from slave5 (HUGE, running for weeks. I can try to slim it down.)
"stats sync" syslogs from all slaves
"stats sync" syslogs from the master

Everything's still running as it was/is, so if you want to dbx some syncprov
struct or something like that, I can do that. Of course it might not be in scope
still, but I can always delete more! (Although I'm fairly certain this isn't
100% of the time. I'll look harder. Might be like last time, need a certain
number of ops per sync or similar.)

Comment 1 Justin Pratt 2006-06-14 17:58:15 UTC
richton@nbcs.rutgers.edu wrote:

> Something not quite right with session log or nonpresent list or.......
> 
> Pretty much the same as the last time I played the syncrepl game, except this
> time instead of entries missing from the slave, there are excessive entries on
> the slave.

Yes, I can reproduce this problem as well. At a cursory glance it
appears to be a sessionlog issue as the problem disappears if I disable
the sessionlog. Can provide additional info if needed.

Comment 2 Howard Chu 2006-06-14 22:14:40 UTC
richton@nbcs.rutgers.edu wrote:
> Full_Name: Aaron Richton
> Version: 2.3.24
> OS: Solaris 9
> URL: 
> Submission from: (NULL) (128.6.31.135)
>
>
> Something not quite right with session log or nonpresent list or.......
>
> Pretty much the same as the last time I played the syncrepl game, except this
> time instead of entries missing from the slave, there are excessive entries on
> the slave. Here's an example, although it's hardly illuminating:
>
> [...]
> Jun  8 15:18:09 master.rutgers.edu slapd[29239]: [ID 379477 local4.debug]
> conn=361133 op=2 DEL dn="uid=VICTIM,ou=People,dc=rci,dc=rutgers,dc=edu"
> [...]
> Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 848112 local4.debug]
> conn=361315 fd=100 ACCEPT from IP=192.168.4.67:53492 (IP=0.0.0.0:389)
> Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 560212 local4.debug]
> conn=361315 op=0 STARTTLS
> Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 875301 local4.debug]
> conn=361315 op=0 RESULT oid= err=0 text=
> Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 105384 local4.debug]
> conn=361315 fd=100 TLS established tls_ssf=256 ssf=256
> Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 215403 local4.debug]
> conn=361315 op=1 BIND dn="cn=readEverything,dc=rci,dc=rutgers,dc=edu"
> method=128
> Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 600343 local4.debug]
> conn=361315 op=1 BIND dn="cn=readEverything,dc=rci,dc=rutgers,dc=edu"
> mech=SIMPLE ssf=0
> Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 588225 local4.debug]
> conn=361315 op=1 RESULT tag=97 err=0 text=
> Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 469902 local4.debug]
> conn=361315 op=2 SRCH base="dc=rci,dc=rutgers,dc=edu" scope=2 deref=0
> filter="(objectClass=*)"
> Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 744844 local4.debug]
> conn=361315 op=2 SRCH attr=* +
> Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 545199 local4.debug] Entry
> uid=BYSTANDER,ou=People,dc=rci,dc=rutgers,dc=edu CSN
> 20060608190309Z#000000#00#000000 matches ctx 20060608190309Z#000000#00#000000
> Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 167594 local4.debug]
> conn=361315 op=2 SEARCH RESULT tag=101 err=0 nentries=0 text=
> Jun  8 15:22:44 slave5.rutgers.edu slapd[12966]: [ID 496601 local4.debug]
> do_syncrep2: LDAP_RES_SEARCH_RESULT
> Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 218904 local4.debug]
> conn=361315 op=3 UNBIND
> Jun  8 15:22:44 master.rutgers.edu slapd[29239]: [ID 952275 local4.debug]
> conn=361315 fd=100 closed
>
> and at the end of this, slave5 still has uid=VICTIM, which is undesirable.
> slave5 is on "stats sync" debug. I think I should see a del_nonpresent in there,
> no? So I'd assume the master is telling it something wrong. Not sure what else I
> can do. I have:
>
> -d -1 from slave5 (HUGE, running for weeks. I can try to slim it down.)
> "stats sync" syslogs from all slaves
> "stats sync" syslogs from the master
>
> Everything's still running as it was/is, so if you want to dbx some syncprov
> struct or something like that, I can do that. Of course it might not be in scope
> still, but I can always delete more! (Although I'm fairly certain this isn't
> 100% of the time. I'll look harder. Might be like last time, need a certain
> number of ops per sync or similar.)
>
>
>
>
>   
It would probably help to just look at the logs for all of the write 
operations that occurred since the last refresh, in the refresh pass 
that is missing the Delete. If re-issuing the same sequence of 
operations always reproduces the bug, then we'll have something to go on.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/

Comment 3 Howard Chu 2006-06-15 20:38:12 UTC
pratt@cs.utk.edu wrote:
> richton@nbcs.rutgers.edu wrote:
>
>   
>> Something not quite right with session log or nonpresent list or.......
>>
>> Pretty much the same as the last time I played the syncrepl game, except this
>> time instead of entries missing from the slave, there are excessive entries on
>> the slave.
>>     
>
> Yes, I can reproduce this problem as well. At a cursory glance it
> appears to be a sessionlog issue as the problem disappears if I disable
> the sessionlog. Can provide additional info if needed.
>
>   
I've been able to reproduce this problem with an optimized build, 
setting syncprov-sessionlog 2. In this case, single deletes after the 
first don't get sent by syncprov. However, with a non-optimized debug 
build, the problem doesn't occur. I may have a buggy optimizer here (gcc 
4.0.2). What compiler versions are you using, and can you try with a 
non-optimized syncprov overlay?

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/

Comment 4 Howard Chu 2006-06-15 21:59:04 UTC
hyc@symas.com wrote:
> pratt@cs.utk.edu wrote:
>   
>> richton@nbcs.rutgers.edu wrote:
>>
>>   
>>     
>>> Something not quite right with session log or nonpresent list or.......
>>>
>>> Pretty much the same as the last time I played the syncrepl game, except this
>>> time instead of entries missing from the slave, there are excessive entries on
>>> the slave.
>>>     
>>>       
>> Yes, I can reproduce this problem as well. At a cursory glance it
>> appears to be a sessionlog issue as the problem disappears if I disable
>> the sessionlog. Can provide additional info if needed.
>>
>>   
>>     
> I've been able to reproduce this problem with an optimized build, 
> setting syncprov-sessionlog 2. In this case, single deletes after the 
> first don't get sent by syncprov. However, with a non-optimized debug 
> build, the problem doesn't occur. I may have a buggy optimizer here (gcc 
> 4.0.2). What compiler versions are you using, and can you try with a 
> non-optimized syncprov overlay?
>
>   
This is now fixed in HEAD, slapd/ldapsync.c  rev 1.34


-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/

Comment 5 Howard Chu 2006-06-15 21:59:59 UTC
changed notes
changed state Open to Test
moved from Incoming to Software Bugs
Comment 6 richton@nbcs.rutgers.edu 2006-07-25 18:01:59 UTC
I'm still seeing this. I just noticed the

                          delcsn.bv_val[delcsn.bv_len] = '\0';

in syncprov.c 1.152. Any chance that might be manifested with these 
symptoms? (I'll put it in production anyway, looks like a good idea.)

Comment 7 Howard Chu 2006-07-25 18:27:37 UTC
richton@nbcs.rutgers.edu wrote:
> I'm still seeing this. I just noticed the
>
>                           delcsn.bv_val[delcsn.bv_len] = '\0';
>
> in syncprov.c 1.152. Any chance that might be manifested with these 
> symptoms? (I'll put it in production anyway, looks like a good idea.)
>
>   
It's a good idea anyway, but the patch to ldapsync.c already addressed 
any problem this may have caused.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/

Comment 8 richton@nbcs.rutgers.edu 2006-07-25 22:09:51 UTC
> It's a good idea anyway, but the patch to ldapsync.c already addressed any 
> problem this may have caused.

Hmm, that's not good. Things are very similar to the way MODs were before: 
an individual delete won't go through; I see a batch of 15 DELs, and ops 
2-15 go through. So the first one is either getting chopped off or undone.

Comment 9 Justin Pratt 2006-07-26 17:49:42 UTC
OK, I'm still seeing this problem as well with the updated ldapsync.c  
from HEAD.

On Jul 25, 2006, at 6:02 PM, richton@nbcs.rutgers.edu wrote:

> I'm still seeing this. I just noticed the
>
>                           delcsn.bv_val[delcsn.bv_len] = '\0';
>
> in syncprov.c 1.152. Any chance that might be manifested with these
> symptoms? (I'll put it in production anyway, looks like a good idea.)
>
>

Comment 10 Kurt Zeilenga 2006-07-28 13:12:42 UTC
changed notes
changed state Test to Release
Comment 11 Buchan Milne 2006-08-01 16:46:20 UTC
On Wednesday 26 July 2006 19:49, pratt@cs.utk.edu wrote:
> OK, I'm still seeing this problem as well with the updated ldapsync.c
> from HEAD.
>
> On Jul 25, 2006, at 6:02 PM, richton@nbcs.rutgers.edu wrote:
> > I'm still seeing this. I just noticed the
> >
> >                           delcsn.bv_val[delcsn.bv_len] = '\0';
> >
> > in syncprov.c 1.152. Any chance that might be manifested with these
> > symptoms? (I'll put it in production anyway, looks like a good idea.)

I have a master (using glue, with syncprov on the parent and subordinate) with 
4 slaves which run with the patch to ldapsync.c, and another 2 slaves which 
run my previous set of packages (only difference is not having the patch).

I sent through a set of approx 400 MODRDNs.

The 2 slaves without the patch received the set of MODRDNs, the 4 slaves with 
the patch did not receive the MODRDNs.

I won't have time to look into this further today, but will try and reproduce 
it tomorrow.

-- 
Buchan Milne
ISP Systems Specialist
B.Eng,RHCE(803004789010797),LPIC-2(LPI000074592)
Comment 12 Kurt Zeilenga 2006-08-01 17:28:06 UTC
changed state Release to Closed
Comment 13 Howard Chu 2009-02-17 05:18:47 UTC
moved from Software Bugs to Archive.Software Bugs
Comment 14 OpenLDAP project 2014-08-01 21:06:44 UTC
fixed in HEAD/re23