Issue 5470 - Sporadic failures with RE24
Summary: Sporadic failures with RE24
Status: VERIFIED FIXED
Alias: None
Product: OpenLDAP
Classification: Unclassified
Component: slapd (show other issues)
Version: unspecified
Hardware: All All
: --- normal
Target Milestone: ---
Assignee: OpenLDAP project
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-04-16 06:53 UTC by Michael Ströder
Modified: 2020-07-08 13:05 UTC (History)
0 users

See Also:


Attachments
sporadic_failures_re24.tar.gz (37.94 KB, application/x-gzip)
2008-04-16 06:55 UTC, Michael Ströder
Details
bt.txt (17.24 KB, text/plain)
2008-04-30 12:47 UTC, Luca Scamoni
Details
slapd.1.log.gz (21.21 KB, application/x-gzip)
2008-04-17 11:59 UTC, Raphael Ouazana
Details
slapd.2.log.gz (20.92 KB, application/x-gzip)
2008-04-17 11:59 UTC, Raphael Ouazana
Details
slapd.3.log.gz (21.96 KB, application/x-gzip)
2008-04-17 11:59 UTC, Raphael Ouazana
Details
test.out.gz (376 bytes, application/x-gzip)
2008-04-17 11:59 UTC, Raphael Ouazana
Details
test.out.gz (420 bytes, application/x-gzip)
2008-04-16 12:49 UTC, Raphael Ouazana
Details
slapd.1.log.gz (14.34 KB, application/x-gzip)
2008-04-16 12:49 UTC, Raphael Ouazana
Details
slapd.2.log.gz (10.95 KB, application/x-gzip)
2008-04-16 12:49 UTC, Raphael Ouazana
Details
slapd.3.log.gz (7.52 KB, application/x-gzip)
2008-04-16 12:49 UTC, Raphael Ouazana
Details

Note You need to log in before you can comment on or make changes to this issue.
Description Michael Ströder 2008-04-16 06:53:04 UTC
Full_Name: Michael Str�der
Version: RE24
OS: OpenSUSE Linux 10.3
URL: 
Submission from: (NULL) (84.163.88.4)


HI!

I already reported different test scripts failing every now and then.
I ran 'make test' a couple of times and finally it failed once in test050 (see
below). Running test050 again succeeds.

I've also attached the debug logs.

Ciao, Michael.

>>>>> Starting test050-syncrepl-multimaster ...
running defines.sh
Initializing server configurations...
Starting producer slapd on TCP/IP port 9011...
Using ldapsearch to check that producer slapd is running...
Inserting syncprov overlay on producer...
Starting consumer slapd on TCP/IP port 9012...
Using ldapsearch to check that consumer slapd is running...
Configuring syncrepl on consumer...
Starting consumer2 slapd on TCP/IP port 9013...
Using ldapsearch to check that consumer2 slapd is running...
Configuring syncrepl on consumer2...
Adding schema and databases on producer...
Using ldapadd to populate producer...
Waiting 20 seconds for syncrepl to receive changes...
Using ldapadd to populate consumer...
ldapadd failed for consumer database (32)!
>>>>> ./scripts/test050-syncrepl-multimaster failed (exit 32)
make[2]: *** [hdb-yes] Error 32
make[2]: Leaving directory
`/home/michael/src/openldap/OPENLDAP_REL_ENG_2_4/openldap/tests'
make[1]: *** [test] Error 2
make[1]: Leaving directory
`/home/michael/src/openldap/OPENLDAP_REL_ENG_2_4/openldap/tests'
make: *** [test] Error 2

--------------------------- end of test.out ----------------------------
ldap_add: No such object (32)
        matched DN: ou=People,dc=example,dc=com
adding new entry "cn=James A Jones 2,ou=Alumni
Association,ou=People,dc=example,dc=com"

Comment 1 Michael Ströder 2008-04-16 06:55:26 UTC
See slapd's debug logs attached as tar.gz
Comment 2 Howard Chu 2008-04-16 08:43:12 UTC
michael@stroeder.com wrote:
> Full_Name: Michael Ströder
> Version: RE24
> OS: OpenSUSE Linux 10.3
> URL:
> Submission from: (NULL) (84.163.88.4)
>
>
> HI!
>
> I already reported different test scripts failing every now and then.
> I ran 'make test' a couple of times and finally it failed once in test050 (see
> below). Running test050 again succeeds.
>
> I've also attached the debug logs.

As noted in the other email thread, I'm pretty sure this is the same as
http://www.openldap.org/lists/openldap-devel/200804/msg00067.html

I have committed a patch to syncprov.c for this, but I haven't reproduced the 
original problem so I can't verify that it is fixed. Running test050 by itself 
20 times never showed any problem for me. (Dual core OpenSUSE 10.2)

If you can get it to fail more regularly, let me know...

> Ciao, Michael.
>
>>>>>> Starting test050-syncrepl-multimaster ...
> running defines.sh
> Initializing server configurations...
> Starting producer slapd on TCP/IP port 9011...
> Using ldapsearch to check that producer slapd is running...
> Inserting syncprov overlay on producer...
> Starting consumer slapd on TCP/IP port 9012...
> Using ldapsearch to check that consumer slapd is running...
> Configuring syncrepl on consumer...
> Starting consumer2 slapd on TCP/IP port 9013...
> Using ldapsearch to check that consumer2 slapd is running...
> Configuring syncrepl on consumer2...
> Adding schema and databases on producer...
> Using ldapadd to populate producer...
> Waiting 20 seconds for syncrepl to receive changes...
> Using ldapadd to populate consumer...
> ldapadd failed for consumer database (32)!
>>>>>> ./scripts/test050-syncrepl-multimaster failed (exit 32)
> make[2]: *** [hdb-yes] Error 32
> make[2]: Leaving directory
> `/home/michael/src/openldap/OPENLDAP_REL_ENG_2_4/openldap/tests'
> make[1]: *** [test] Error 2
> make[1]: Leaving directory
> `/home/michael/src/openldap/OPENLDAP_REL_ENG_2_4/openldap/tests'
> make: *** [test] Error 2
>
> --------------------------- end of test.out ----------------------------
> ldap_add: No such object (32)
>          matched DN: ou=People,dc=example,dc=com
> adding new entry "cn=James A Jones 2,ou=Alumni
> Association,ou=People,dc=example,dc=com"
>
>
>


-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 3 Raphael Ouazana 2008-04-16 09:55:22 UTC
Le Mer 16 avril 2008 10:41, hyc@symas.com a écrit :
> michael@stroeder.com wrote:
>> Full_Name: Michael Ströder
>> Version: RE24
>> OS: OpenSUSE Linux 10.3
>> URL:
>> Submission from: (NULL) (84.163.88.4)
>>
>>
>> HI!
>>
>> I already reported different test scripts failing every now and then.
>> I ran 'make test' a couple of times and finally it failed once in
>> test050 (see
>> below). Running test050 again succeeds.
>>
>> I've also attached the debug logs.
>
> As noted in the other email thread, I'm pretty sure this is the same as
> http://www.openldap.org/lists/openldap-devel/200804/msg00067.html
>
> I have committed a patch to syncprov.c for this, but I haven't reproduced
> the
> original problem so I can't verify that it is fixed. Running test050 by
> itself
> 20 times never showed any problem for me. (Dual core OpenSUSE 10.2)
>
> If you can get it to fail more regularly, let me know...

All tests passed now twice here, plus 5 times for test050/hdb only.
Ubuntu Hardy/i386/dual core.

Regards,
Raphaël Ouazana.

Comment 4 Michael Ströder 2008-04-16 10:57:13 UTC
Howard Chu wrote:
> 
> I have committed a patch to syncprov.c for this, but I haven't 
> reproduced the original problem so I can't verify that it is fixed. 
> Running test050 by itself 20 times never showed any problem for me. 
> (Dual core OpenSUSE 10.2)

Note that I reported it especially for RE24 since I've only tested it 
there. I will check HEAD more thoroughly with your patch in HEAD. Any 
chance to see exactly this patch in RE24 any time soon so that I can 
test with this particular code-base? Yes, I could port your patch to 
RE24 myself but I'd prefer if we have exactly the same code-base.

Ciao, Michael.

Comment 5 Raphael Ouazana 2008-04-16 12:49:59 UTC
Re,

Le Mer 16 avril 2008 10:41, hyc@symas.com a écrit :
> michael@stroeder.com wrote:
>> Full_Name: Michael Ströder
>> Version: RE24
>> OS: OpenSUSE Linux 10.3
>> URL:
>> Submission from: (NULL) (84.163.88.4)
>>
>>
>> HI!
>>
>> I already reported different test scripts failing every now and then.
>> I ran 'make test' a couple of times and finally it failed once in
>> test050 (see
>> below). Running test050 again succeeds.
>>
>> I've also attached the debug logs.
>
> As noted in the other email thread, I'm pretty sure this is the same as
> http://www.openldap.org/lists/openldap-devel/200804/msg00067.html
>
> I have committed a patch to syncprov.c for this, but I haven't reproduced
> the
> original problem so I can't verify that it is fixed. Running test050 by
> itself
> 20 times never showed any problem for me. (Dual core OpenSUSE 10.2)
>
> If you can get it to fail more regularly, let me know...
.
In fact test050 fails with or without your patch on syncprov.c.

Logs attached (RE24+last syncprov patch).

Regards,
Raphaël Ouazana.

Cleaning up test run directory leftover from previous run.
Running ./scripts/test050-syncrepl-multimaster...
running defines.sh
Initializing server configurations...
Starting producer slapd on TCP/IP port 9011...
Using ldapsearch to check that producer slapd is running...
Inserting syncprov overlay on producer...
Starting consumer slapd on TCP/IP port 9012...
Using ldapsearch to check that consumer slapd is running...
Configuring syncrepl on consumer...
Starting consumer2 slapd on TCP/IP port 9013...
Using ldapsearch to check that consumer2 slapd is running...
Configuring syncrepl on consumer2...
Adding schema and databases on producer...
Using ldapadd to populate producer...
Waiting 20 seconds for syncrepl to receive changes...
Using ldapadd to populate consumer...
ldapadd failed for consumer database (32)!
Comment 6 Howard Chu 2008-04-16 15:14:33 UTC
Raphaël Ouazana-Sustowski wrote:
> Re,
>
> Le Mer 16 avril 2008 10:41, hyc@symas.com a écrit :

>> I have committed a patch to syncprov.c for this, but I haven't reproduced
>> the
>> original problem so I can't verify that it is fixed. Running test050 by
>> itself
>> 20 times never showed any problem for me. (Dual core OpenSUSE 10.2)
>>
>> If you can get it to fail more regularly, let me know...
> .
> In fact test050 fails with or without your patch on syncprov.c.

OK, thanks. I actually reproduced the failure once as well. Will need a 
different workaround.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 7 Howard Chu 2008-04-16 23:51:54 UTC
changed notes
changed state Open to Test
moved from Incoming to Software Bugs
Comment 8 Howard Chu 2008-04-16 23:59:55 UTC
Raphaël Ouazana-Sustowski wrote:
> Re,
>
> Le Mer 16 avril 2008 10:41, hyc@symas.com a écrit :
>> If you can get it to fail more regularly, let me know...
> .
> In fact test050 fails with or without your patch on syncprov.c.
>
> Logs attached (RE24+last syncprov patch).

OK, please try again using the latest syncrepl.c patch. The syncprov.c patch 
has been reverted since it did nothing helpful.
-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 9 Raphael Ouazana 2008-04-17 11:59:52 UTC
Hi,

Le Jeu 17 avril 2008 01:59, Howard Chu a écrit :
> Raphaël Ouazana-Sustowski wrote:
>> Re,
>>
>> Le Mer 16 avril 2008 10:41, hyc@symas.com a écrit :
>>> If you can get it to fail more regularly, let me know...
>> .
>> In fact test050 fails with or without your patch on syncprov.c.
>>
>> Logs attached (RE24+last syncprov patch).
>
> OK, please try again using the latest syncrepl.c patch. The syncprov.c
> patch
> has been reverted since it did nothing helpful.

Failed after 68 iterations:
Cleaning up test run directory leftover from previous run.
Running ./scripts/test050-syncrepl-multimaster...
running defines.sh
Initializing server configurations...
Starting producer slapd on TCP/IP port 9011...
Using ldapsearch to check that producer slapd is running...
Inserting syncprov overlay on producer...
Starting consumer slapd on TCP/IP port 9012...
Using ldapsearch to check that consumer slapd is running...
Configuring syncrepl on consumer...
Starting consumer2 slapd on TCP/IP port 9013...
Using ldapsearch to check that consumer2 slapd is running...
Configuring syncrepl on consumer2...
Adding schema and databases on producer...
Using ldapadd to populate producer...
Waiting 20 seconds for syncrepl to receive changes...
Using ldapadd to populate consumer...
Waiting 20 seconds for syncrepl to receive changes...
Using ldapsearch to check that syncrepl received database changes...
Waiting 5 seconds for syncrepl to receive changes...
Waiting 5 seconds for syncrepl to receive changes...
Waiting 5 seconds for syncrepl to receive changes...
Waiting 5 seconds for syncrepl to receive changes...
Waiting 5 seconds for syncrepl to receive changes...
Waiting 5 seconds for syncrepl to receive changes...
ldapsearch failed (32)!
Comment 10 Phil Cogbill III 2008-04-21 17:28:40 UTC
I ran into this same problem and was able to fix it.  I had been using 
the yum version of Berkeley DB 4.5 and updated to Oracle Berkeley 4.6 
and then it worked fine.  Just make sure that when you run configure 
that it includes the correct version of db.h


Comment 11 Phil Cogbill III 2008-04-21 17:31:18 UTC
scratch that.. it happened again. :\  fooie

Comment 12 Quanah Gibson-Mount 2008-04-21 17:44:12 UTC
I updated RE24 this morning with a couple of fixes from Howard.

See if you can still repo, thanks.

--Quanah

--On April 21, 2008 5:31:48 PM +0000 phil@cogbill.net wrote:

> scratch that.. it happened again. :\  fooie
>
>



--

Quanah Gibson-Mount
Principal Software Engineer
Zimbra, Inc
--------------------
Zimbra ::  the leader in open source messaging and collaboration

Comment 13 Raphael Ouazana 2008-04-24 16:16:29 UTC
Hi,

Le Lun 21 avril 2008 19:44, quanah@zimbra.com a écrit :
> I updated RE24 this morning with a couple of fixes from Howard.
>
> See if you can still repo, thanks.

It still seems to fail, after more than 100 runs:
Cleaning up test run directory leftover from previous run.
Running ./scripts/test050-syncrepl-multimaster...
running defines.sh
Initializing server configurations...
Starting producer slapd on TCP/IP port 9011...
Using ldapsearch to check that producer slapd is running...
Inserting syncprov overlay on producer...
Starting consumer slapd on TCP/IP port 9012...
Using ldapsearch to check that consumer slapd is running...
Configuring syncrepl on consumer...
Starting consumer2 slapd on TCP/IP port 9013...
Using ldapsearch to check that consumer2 slapd is running...
Configuring syncrepl on consumer2...
Adding schema and databases on producer...
Using ldapadd to populate producer...
Waiting 20 seconds for syncrepl to receive changes...
Using ldapadd to populate consumer...
Waiting 20 seconds for syncrepl to receive changes...
Using ldapsearch to check that syncrepl received database changes...
Using ldapsearch to check that syncrepl received database changes on
consumer2...
Using ldapsearch to read config from the producer...
Using ldapsearch to read config from the consumer...
Using ldapsearch to read config from consumer2...
Filtering producer results...
Filtering consumer results...
Filtering consumer2 results...
Comparing retrieved configs from producer and consumer...
Comparing retrieved configs from producer and consumer2...
Using ldapsearch to read all the entries from the producer...
Using ldapsearch to read all the entries from the consumer...
Using ldapsearch to read all the entries from the consumer2...
Filtering producer results...
Filtering consumer results...
Filtering consumer2 results...
Comparing retrieved entries from producer and consumer...
Comparing retrieved entries from producer and consumer2...
test failed - producer and consumer2 databases differ

I sent the testrun directory through FTP:
ftp://ftp.openldap.org/incoming/raphael-ouazana-testrun-20080424.tgz

Regards,
Raphaël Ouazana.

Comment 14 ando@openldap.org 2008-04-26 10:28:04 UTC
raphael.ouazana@linagora.com wrote:
> Hi,
> 
> Le Lun 21 avril 2008 19:44, quanah@zimbra.com a écrit :
>> I updated RE24 this morning with a couple of fixes from Howard.
>>
>> See if you can still repo, thanks.
> 
> It still seems to fail, after more than 100 runs:

It appears from your logs that the entry

	"cn=ITD Staff,ou=Groups,dc=example,dc=com"

is being added first, and then removed from server 3, probably related 
to some issue in sync replication.  Can you enable "sync" log when 
running the tests?

SLAPD_DEBUG="sync,stats,args,trace" ./run test050

p.



Ing. Pierangelo Masarati
OpenLDAP Core Team

SysNet s.r.l.
via Dossi, 8 - 27100 Pavia - ITALIA
http://www.sys-net.it
---------------------------------------
Office:  +39 02 23998309
Mobile:  +39 333 4963172
Email:   pierangelo.masarati@sys-net.it
---------------------------------------


Comment 15 Howard Chu 2008-04-28 20:05:23 UTC
ando@sys-net.it wrote:
> raphael.ouazana@linagora.com wrote:
>> Hi,
>>
>> Le Lun 21 avril 2008 19:44, quanah@zimbra.com a écrit :
>>> I updated RE24 this morning with a couple of fixes from Howard.
>>>
>>> See if you can still repo, thanks.
>> It still seems to fail, after more than 100 runs:
>
> It appears from your logs that the entry
>
> 	"cn=ITD Staff,ou=Groups,dc=example,dc=com"
>
> is being added first, and then removed from server 3, probably related
> to some issue in sync replication.

Seems like the consumer has received a cookie from one provider, while 
receiving a newer entry from the other provider. During the nonpresent search 
the new entry is found and removed. So we need to restrict the filter in the 
nonpresent search to only return entries older than the cookie context.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 16 Luca Scamoni 2008-04-30 09:23:24 UTC
Here it is again. HEAD code checked out this morning.
I'm able to reproduce the crash fairly often.
testrun directory uploaded at

ftp://ftp.sys-net.it/luca_scamoni_its5470_20080430.tgz

the core file is in testrun/con



Comment 17 Luca Scamoni 2008-04-30 12:47:39 UTC
Howard Chu wrote:
> Thanks. Please try HEAD again.
No way.
new testrun directory in
ftp://ftp.sys-net.it/luca_scamoni_its5470_20080430-new.tgz

backtrace attached


Comment 18 Luca Scamoni 2008-05-02 08:36:58 UTC
luca@OpenLDAP.org wrote:
> This is a multi-part message in MIME format.
> --------------080809000906010300090306
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> Content-Transfer-Encoding: 7bit
>
> Howard Chu wrote:
>   
>> Thanks. Please try HEAD again.
>>     
> No way.
> new testrun directory in
> ftp://ftp.sys-net.it/luca_scamoni_its5470_20080430-new.tgz
>
> backtrace attached
>   
recent commits seem to have fixed it (at least, right now I'm not able 
to reproduce it anymore...)


Comment 19 Howard Chu 2008-05-02 08:58:58 UTC
luca@OpenLDAP.org wrote:
> luca@OpenLDAP.org wrote:
>> This is a multi-part message in MIME format.
>> --------------080809000906010300090306
>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>> Content-Transfer-Encoding: 7bit
>>
>> Howard Chu wrote:
>>
>>> Thanks. Please try HEAD again.
>>>
>> No way.
>> new testrun directory in
>> ftp://ftp.sys-net.it/luca_scamoni_its5470_20080430-new.tgz
>>
>> backtrace attached
>>
> recent commits seem to have fixed it (at least, right now I'm not able
> to reproduce it anymore...)

Right. Confirmed here too; I (temporarily) added an assert(0) to the offending 
branch of code to make sure the patch was actually getting hit. It takes a 
very particular timing to trigger that code path.

I'm not sure how we can reliably test for this down the road. Perhaps we 
should add a "disabled" config keyword for backends and syncrepl consumers, so 
that we can start up the individual servers, (which takes an unpredictable 
amount of time for each) and then enable various parts in a fixed sequence 
(e.g. 1 second sleeps between ldapmodify/enable requests). Even that's hit or 
miss, because our test database is so small it's unlikely that we can hit the 
window of time on demand.
-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 20 Raphael Ouazana 2008-05-05 14:06:35 UTC
Hi,

Le Ven 2 mai 2008 11:01, hyc@symas.com a écrit :
> luca@OpenLDAP.org wrote:
>> luca@OpenLDAP.org wrote:
>>> This is a multi-part message in MIME format.
>>> --------------080809000906010300090306
>>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>> Content-Transfer-Encoding: 7bit
>>>
>>> Howard Chu wrote:
>>>
>>>> Thanks. Please try HEAD again.
>>>>
>>> No way.
>>> new testrun directory in
>>> ftp://ftp.sys-net.it/luca_scamoni_its5470_20080430-new.tgz
>>>
>>> backtrace attached
>>>
>> recent commits seem to have fixed it (at least, right now I'm not able
>> to reproduce it anymore...)
>
> Right. Confirmed here too; I (temporarily) added an assert(0) to the
> offending
> branch of code to make sure the patch was actually getting hit. It takes a
> very particular timing to trigger that code path.
>
> I'm not sure how we can reliably test for this down the road. Perhaps we
> should add a "disabled" config keyword for backends and syncrepl
> consumers, so
> that we can start up the individual servers, (which takes an unpredictable
> amount of time for each) and then enable various parts in a fixed sequence
> (e.g. 1 second sleeps between ldapmodify/enable requests). Even that's hit
> or
> miss, because our test database is so small it's unlikely that we can hit
> the
> window of time on demand.

I'm testing the last RE24 tag. After 201 successful runs of test050, I got
a failure :/
Cleaning up test run directory leftover from previous run.
Running ./scripts/test050-syncrepl-multimaster...
running defines.sh
Initializing server configurations...
Starting producer slapd on TCP/IP port 9011...
Using ldapsearch to check that producer slapd is running...
Inserting syncprov overlay on producer...
Starting consumer slapd on TCP/IP port 9012...
Using ldapsearch to check that consumer slapd is running...
Configuring syncrepl on consumer...
Starting consumer2 slapd on TCP/IP port 9013...
Using ldapsearch to check that consumer2 slapd is running...
Configuring syncrepl on consumer2...
Adding schema and databases on producer...
Using ldapadd to populate producer...
Waiting 20 seconds for syncrepl to receive changes...
Using ldapadd to populate consumer...
Waiting 20 seconds for syncrepl to receive changes...
Using ldapsearch to check that syncrepl received database changes...
Waiting 5 seconds for syncrepl to receive changes...
Waiting 5 seconds for syncrepl to receive changes...
Waiting 5 seconds for syncrepl to receive changes...
Waiting 5 seconds for syncrepl to receive changes...
Waiting 5 seconds for syncrepl to receive changes...
Waiting 5 seconds for syncrepl to receive changes...
ldapsearch failed (32)!

testrun uploaded in
ftp://ftp.openldap.org/incoming/raphael-ouazana-testrun-080505.tgz

Regards,
Raphaël Ouazana.

Comment 21 Howard Chu 2008-05-05 21:08:19 UTC
Raphaël Ouazana-Sustowski wrote:
> Hi,
>
> Le Ven 2 mai 2008 11:01, hyc@symas.com a écrit :
>> luca@OpenLDAP.org wrote:
>>> luca@OpenLDAP.org wrote:
>>>> This is a multi-part message in MIME format.
>>>> --------------080809000906010300090306
>>>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>>> Content-Transfer-Encoding: 7bit
>>>>
>>>> Howard Chu wrote:
>>>>
>>>>> Thanks. Please try HEAD again.
>>>>>
>>>> No way.
>>>> new testrun directory in
>>>> ftp://ftp.sys-net.it/luca_scamoni_its5470_20080430-new.tgz
>>>>
>>>> backtrace attached
>>>>
>>> recent commits seem to have fixed it (at least, right now I'm not able
>>> to reproduce it anymore...)
>> Right. Confirmed here too; I (temporarily) added an assert(0) to the
>> offending
>> branch of code to make sure the patch was actually getting hit. It takes a
>> very particular timing to trigger that code path.
>>
>> I'm not sure how we can reliably test for this down the road. Perhaps we
>> should add a "disabled" config keyword for backends and syncrepl
>> consumers, so
>> that we can start up the individual servers, (which takes an unpredictable
>> amount of time for each) and then enable various parts in a fixed sequence
>> (e.g. 1 second sleeps between ldapmodify/enable requests). Even that's hit
>> or
>> miss, because our test database is so small it's unlikely that we can hit
>> the
>> window of time on demand.
>
> I'm testing the last RE24 tag. After 201 successful runs of test050, I got
> a failure :/
> Cleaning up test run directory leftover from previous run.
> Running ./scripts/test050-syncrepl-multimaster...
> running defines.sh
> Initializing server configurations...
> Starting producer slapd on TCP/IP port 9011...
> Using ldapsearch to check that producer slapd is running...
> Inserting syncprov overlay on producer...
> Starting consumer slapd on TCP/IP port 9012...
> Using ldapsearch to check that consumer slapd is running...
> Configuring syncrepl on consumer...
> Starting consumer2 slapd on TCP/IP port 9013...
> Using ldapsearch to check that consumer2 slapd is running...
> Configuring syncrepl on consumer2...
> Adding schema and databases on producer...
> Using ldapadd to populate producer...
> Waiting 20 seconds for syncrepl to receive changes...
> Using ldapadd to populate consumer...
> Waiting 20 seconds for syncrepl to receive changes...
> Using ldapsearch to check that syncrepl received database changes...
> Waiting 5 seconds for syncrepl to receive changes...
> Waiting 5 seconds for syncrepl to receive changes...
> Waiting 5 seconds for syncrepl to receive changes...
> Waiting 5 seconds for syncrepl to receive changes...
> Waiting 5 seconds for syncrepl to receive changes...
> Waiting 5 seconds for syncrepl to receive changes...
> ldapsearch failed (32)!
>
> testrun uploaded in
> ftp://ftp.openldap.org/incoming/raphael-ouazana-testrun-080505.tgz

The logs show that the syncrepl consumers all timed out periodically, when 
trying to bind to a provider. It seems that using a 1 second timeout in the 
syncrepl configs is too short, or your test machine was too slow during that run.

Probably we should remove that timeout now, since the cn=config/thread pause 
issue has already been resolved.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 22 Quanah Gibson-Mount 2008-05-05 23:55:15 UTC
changed notes
changed state Test to Release
Comment 23 Raphael Ouazana 2008-05-07 12:45:50 UTC
Hi,

Le Lun 5 mai 2008 23:08, Howard Chu a écrit :
> The logs show that the syncrepl consumers all timed out periodically, when
> trying to bind to a provider. It seems that using a 1 second timeout in
> the
> syncrepl configs is too short, or your test machine was too slow during
> that run.
>
> Probably we should remove that timeout now, since the cn=config/thread
> pause
> issue has already been resolved.


Currently testing new RE24.  I have now a slapd process stuck at 100% :
\_ /bin/sh ./scripts/test050-syncrepl-multimaster hdb
    \_ /tmp/openldap/tests/../servers/slapd/slapd -s0 -F slapd.d -h
ldap://localhost:9011/ -d 261
    \_ /tmp/openldap/tests/../servers/slapd/slapd -s0 -F ./slapd.d -h
ldap://localhost:9012/ -d 261
    \_ /tmp/openldap/tests/../servers/slapd/slapd -s0 -F ./slapd.d -h
ldap://localhost:9013/ -d 261
    \_ /bin/sh ./scripts/test050-syncrepl-multimaster hdb
        \_ ../clients/tools/ldapsearch -P 3 -x -LLL -H
ldap://localhost:9013/ -s base -b cn=Ursula Hampster,ou=Alumni
Association,ou=People,dc=example,dc=com (objectClass=*)
        \_ awk /^dn:/ {print "OK"}

Do you need some debugging info (strace, gdb, log, or other) ?

Regards,
Raphael Ouazana.

Comment 24 Quanah Gibson-Mount 2008-05-07 14:11:52 UTC
--On Wednesday, May 07, 2008 12:44 PM +0000 raphael.ouazana@linagora.com 
wrote:

> Currently testing new RE24.  I have now a slapd process stuck at 100% :
> \_ /bin/sh ./scripts/test050-syncrepl-multimaster hdb
>     \_ /tmp/openldap/tests/../servers/slapd/slapd -s0 -F slapd.d -h
> ldap://localhost:9011/ -d 261
>     \_ /tmp/openldap/tests/../servers/slapd/slapd -s0 -F ./slapd.d -h
> ldap://localhost:9012/ -d 261
>     \_ /tmp/openldap/tests/../servers/slapd/slapd -s0 -F ./slapd.d -h
> ldap://localhost:9013/ -d 261
>     \_ /bin/sh ./scripts/test050-syncrepl-multimaster hdb
>         \_ ../clients/tools/ldapsearch -P 3 -x -LLL -H
> ldap://localhost:9013/ -s base -b cn=Ursula Hampster,ou=Alumni
> Association,ou=People,dc=example,dc=com (objectClass=*)
>         \_ awk /^dn:/ {print "OK"}
>
> Do you need some debugging info (strace, gdb, log, or other) ?

Please provide a gdb backtrace of the hung process.

--Quanah



--

Quanah Gibson-Mount
Principal Software Engineer
Zimbra, Inc
--------------------
Zimbra ::  the leader in open source messaging and collaboration

Comment 25 Raphael Ouazana 2008-05-07 14:26:26 UTC
Hi,

Le Mer 7 mai 2008 16:11, Quanah Gibson-Mount a écrit :
> --On Wednesday, May 07, 2008 12:44 PM +0000 raphael.ouazana@linagora.com
> wrote:
>
>> Currently testing new RE24.  I have now a slapd process stuck at 100% :
>> \_ /bin/sh ./scripts/test050-syncrepl-multimaster hdb
>>     \_ /tmp/openldap/tests/../servers/slapd/slapd -s0 -F slapd.d -h
>> ldap://localhost:9011/ -d 261
>>     \_ /tmp/openldap/tests/../servers/slapd/slapd -s0 -F ./slapd.d -h
>> ldap://localhost:9012/ -d 261
>>     \_ /tmp/openldap/tests/../servers/slapd/slapd -s0 -F ./slapd.d -h
>> ldap://localhost:9013/ -d 261
>>     \_ /bin/sh ./scripts/test050-syncrepl-multimaster hdb
>>         \_ ../clients/tools/ldapsearch -P 3 -x -LLL -H
>> ldap://localhost:9013/ -s base -b cn=Ursula Hampster,ou=Alumni
>> Association,ou=People,dc=example,dc=com (objectClass=*)
>>         \_ awk /^dn:/ {print "OK"}
>>
>> Do you need some debugging info (strace, gdb, log, or other) ?
>
> Please provide a gdb backtrace of the hung process.

It is not really hung, it is more in an infinite loop.

Here is the backtrace:
(gdb) thread apply all bt

Thread 5 (Thread 0xb7ad5b90 (LWP 6028)):
#0  0xb7eef410 in __kernel_vsyscall ()
#1  0xb7d14676 in epoll_wait () from /lib/tls/i686/cmov/libc.so.6
#2  0x080600bb in slapd_daemon_task (ptr=0x0) at daemon.c:2281
#3  0xb7da54fb in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#4  0xb7d13e5e in clone () from /lib/tls/i686/cmov/libc.so.6

Thread 4 (Thread 0xb76d4b90 (LWP 6030)):
#0  0xb7eef410 in __kernel_vsyscall ()
#1  0xb7da9aa5 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib/tls/i686/cmov/libpthread.so.0
#2  0x0807302b in send_ldap_ber (conn=0xb7ad6b28, ber=0xb76d2e98) at
result.c:188
#3  0x08075256 in slap_send_search_entry (op=0x8252af0, rs=0xb76d4178) at
result.c:1200
#4  0x08056418 in config_send (op=0x8252af0, rs=0xb76d4178, ce=0x828bf48,
depth=0) at bconfig.c:3507
#5  0x080563e4 in config_send (op=0x8252af0, rs=0xb76d4178, ce=0x828bf48,
depth=1) at bconfig.c:3516
#6  0x080563be in config_send (op=0x8252af0, rs=0xb76d4178, ce=0x82527c0,
depth=0) at bconfig.c:3511
#7  0x080563e4 in config_send (op=0x8252af0, rs=0xb76d4178, ce=0x82527c0,
depth=1) at bconfig.c:3516
#8  0x080563be in config_send (op=0x8252af0, rs=0xb76d4178, ce=0x824fc80,
depth=0) at bconfig.c:3511
#9  0x080564e4 in config_back_search (op=0x8252af0, rs=0xb76d4178) at
bconfig.c:5261
#10 0x08065283 in fe_op_search (op=0x8252af0, rs=0xb76d4178) at search.c:366
#11 0x08065af7 in do_search (op=0x8252af0, rs=0xb76d4178) at search.c:217
#12 0x0806312f in connection_operation (ctx=0xb76d4248, arg_v=0x8252af0)
at connection.c:1084
#13 0x080639bf in connection_read_thread (ctx=0xb76d4248, argv=0x8) at
connection.c:1211
#14 0x0812fb48 in ldap_int_thread_pool_wrapper (xpool=0x822cae8) at
tpool.c:663
#15 0xb7da54fb in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#16 0xb7d13e5e in clone () from /lib/tls/i686/cmov/libc.so.6

Thread 3 (Thread 0xb71d2b90 (LWP 6034)):
#0  0xb7eef410 in __kernel_vsyscall ()
#1  0xb7da9aa5 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib/tls/i686/cmov/libpthread.so.0
#2  0x0812fbb5 in ldap_int_thread_pool_wrapper (xpool=0x822cae8) at
tpool.c:654
#3  0xb7da54fb in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#4  0xb7d13e5e in clone () from /lib/tls/i686/cmov/libc.so.6

Thread 2 (Thread 0xb6dd1b90 (LWP 6035)):
#0  0xb7eef410 in __kernel_vsyscall ()
#1  0xb7da9aa5 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib/tls/i686/cmov/libpthread.so.0
#2  0x0812f4c2 in handle_pause (tpool=<value optimized out>, do_pause=1)
at tpool.c:738
#3  0x08054abf in config_back_add (op=0xb6dd0dd4, rs=0xb6dd0728) at
bconfig.c:4548
---Type <return> to continue, or q <return> to quit---
#4  0x080bb70f in syncrepl_entry (si=0x8255888, op=0xb6dd0dd4,
entry=0x823ffec, modlist=0xb6dd0b44, syncstate=1, syncUUID=0xb6dd0b0c,
    syncCSN=0x0) at syncrepl.c:2108
#5  0x080be0d9 in do_syncrep2 (op=0xb6dd0dd4, si=0x8255888) at syncrepl.c:862
#6  0x080bfec4 in do_syncrepl (ctx=0xb6dd1248, arg=0x8255ac8) at
syncrepl.c:1276
#7  0x08063a06 in connection_read_thread (ctx=0xb6dd1248, argv=0x9) at
connection.c:1213
#8  0x0812fb48 in ldap_int_thread_pool_wrapper (xpool=0x822cae8) at
tpool.c:663
#9  0xb7da54fb in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#10 0xb7d13e5e in clone () from /lib/tls/i686/cmov/libc.so.6

Thread 1 (Thread 0xb7c3c6b0 (LWP 6024)):
#0  0xb7eef410 in __kernel_vsyscall ()
#1  0xb7da6775 in pthread_join () from /lib/tls/i686/cmov/libpthread.so.0
#2  0x0805d633 in slapd_daemon () at daemon.c:2644
#3  0x0804c597 in main (argc=8, argv=0xbfaefcc4) at main.c:946
#0  0xb7eef410 in __kernel_vsyscall ()
(gdb)

Regards,
Raphaël Ouazana.

Comment 26 Howard Chu 2008-05-07 22:03:37 UTC
raphael.ouazana@linagora.com wrote:
> It is not really hung, it is more in an infinite loop.

Nothing in this trace indicates a loop; all of the threads are in wait states. 
Is there anything in the logs that indicates a loop?

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 27 Raphael Ouazana 2008-05-08 01:28:11 UTC
Le Jeu 8 mai 2008 00:03, Howard Chu a écrit :
> raphael.ouazana@linagora.com wrote:
>> It is not really hung, it is more in an infinite loop.
>
> Nothing in this trace indicates a loop; all of the threads are in wait
> states.
> Is there anything in the logs that indicates a loop?

Yes, at least two things:
- one CPU is about at 100%
- logs from slapd.1.log and slapd.2.log are slowly increasing

It seems to be an infinite loop between the first slapd and the second
slapd, where the second is always active.

I'm sorry I won't be able to access the logs until Tuesday, and all the
processes are now killed.

Regards,
Raphaël Ouazana.

Comment 28 Quanah Gibson-Mount 2008-05-13 03:22:38 UTC
changed notes
changed state Release to Closed
Comment 29 OpenLDAP project 2014-08-01 21:04:14 UTC
syncrepl fixed in HEAD
fixed in RE24