[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: syncRepl master loses syncrepl entry (ITS#2928)




--On Thursday, January 22, 2004 11:34 AM -0500 Jong <jongchoi@OpenLDAP.org> 
wrote:

> 	
>
>> I found that after repeated use and lockups of the slaves (which
>> required them
>
> Is this the same one as reported in ITS#2910 ?
> Is the refreshAndPersist mode used ?
> and the size of the replication content ?

This locking issue is the same as #2910, yes... I see it is noted as fixed, 
I would be happy to pull down the patch if someone can point me to the 
files changed, and test that it works.

I am using refreshAndPersist

The replication content size was fairly small (about 5 new entries between 
the master and the replica's).

>
>> to be killed, and db_recover run on their systems), that the master
>> eventually lost its syncrepl entry (????), making it so that it was
>> impossible for them to
>
> From the above context, isn't it the slave (consumer) who is supposed to
> lose its syncrepl entry ?

Yeah, I think I searched on the wrong thing -- I looked for *repl in the 
database dump (not *sync).

>
>> query the master for further updates, and resulting in the following
>> errors in the ldap log on the replica:
>>
>> Jan 21 21:36:19 ldap-dev1.Stanford.EDU slapd[13222]: [ID 100111
>> local4.debug] slapd starting
>> Jan 21 21:36:20 ldap-dev1.Stanford.EDU slapd[13222]: [ID 166296
>> local4.debug] null_callback : error code 0x42
>> Jan 21 21:36:20 ldap-dev1.Stanford.EDU slapd[13222]: [ID 749340
>> local4.debug] syncrepl_entry : be_search failed (66)
>> Jan 21 21:36:23 ldap-dev1.Stanford.EDU slapd[13222]: [ID 166296
>> local4.debug] null_callback : error code 0x42
>> Jan 21 21:36:23 ldap-dev1.Stanford.EDU slapd[13222]: [ID 749340
>> local4.debug] syncrepl_entry : be_search failed (66)
>> Jan 21 21:36:23 ldap-dev1.Stanford.EDU slapd[13222]: [ID 166296
>> local4.debug] null_callback : error code 0x42
>> Jan 21 21:36:23 ldap-dev1.Stanford.EDU slapd[13222]: [ID 749340
>> local4.debug] syncrepl_entry : be_search failed (66)
>> Jan 21 21:36:23 ldap-dev1.Stanford.EDU slapd[13222]: [ID 166296
>> local4.debug] null_callback : error code 0x42
>> Jan 21 21:36:23 ldap-dev1.Stanford.EDU slapd[13222]: [ID 749340
>> local4.debug] syncrepl_entry : be_search failed (66)
>
> This sequence of error revealed a need to update the present mode
> processing routine to make it delete the non-present entries from the
> leaf entries first. A direct approach would be to remove the leaf entries
> first by examining the hasSubordinates operational attributes. The
> disadvantage of this approach is that it is required to repeat the
> non-present entry deletion process when all to-be-deleted entries become
> leaf entries.
> Another approach would be to sort the entries in the createTimestamp
> order. Because server side sorting is not supported, the syncrepl engine
> needs to make si_nonpresentlist be sorted in the createTimestamp order.
> I'm more inclined to the latter. Any comments or other possibilities ?

Okay, I think this is the real problem then -- Basically, after the lockup 
of syncRepl on the slaves, I've had to do a db_recover.  Whatever changes 
it dumped from the DB probably caused this scenario.  I think the second 
solution sounds a little better, since it avoids the repeate on the 
non-present entry deletion.

--Quanah

--
Quanah Gibson-Mount
Principal Software Developer
ITSS/TSS/Computing Systems
ITSS/TSS/Infrastructure Operations
Stanford University
GnuPG Public Key: http://www.stanford.edu/~quanah/pgp.html