Issue 6158 - syncprov: assert causing slapd to core dump
Summary: syncprov: assert causing slapd to core dump
Status: VERIFIED FIXED
Alias: None
Product: OpenLDAP
Classification: Unclassified
Component: slapd (show other issues)
Version: unspecified
Hardware: All All
: --- normal
Target Milestone: ---
Assignee: OpenLDAP project
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-06-02 10:28 UTC by Jonathan
Modified: 2014-08-01 21:03 UTC (History)
0 users

See Also:


Attachments
patch-syncprov-20090602.patch (624 bytes, patch)
2009-06-02 16:11 UTC, Jonathan
Details

Note You need to log in before you can comment on or make changes to this issue.
Description Jonathan 2009-06-02 10:28:05 UTC
Full_Name: Jonathan Clarke
Version: 2.3.43
OS: Solaris
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (213.41.243.192)


Hi,

I have a 2.3.43 running on a Solaris Sparc server, which crashes occasionally -
once every week or two, always during the night. At this particular time a large
number of operations are performed, including mass deletes and adds. I haven't
been able to reproduce this bug, just watch it happen on the production server
every now and again...

I managed to obtain a coredump, and a backtrace (at the end of this message). I
realize this isn't much to go on, but I'm rather unfamiliar with this part of
the code, so I wondered if anyone has an idea what's going on here?

FWIW, the dynlist and chain overlays are in use on the server, and the database
is bdb, with a syncrepl consumer as well as syncprov overlay.


Backtrace follows:
8<-------------------------------------------------------------
Thread 1 (process 1054014    ):
#0  0xfee4aa58 in _lwp_kill () from /lib/libc.so.1
#1  0xfede5a64 in raise () from /lib/libc.so.1
#2  0xfedc1954 in abort () from /lib/libc.so.1
#3  0xfedc1b90 in _assert () from /lib/libc.so.1
#4  0xff30ef44 in ldap_pvt_runqueue_resched (rq=0x16c630, entry=0xee6c0a0,
defer=0) at rq.c:165
#5  0xfe7f4a94 in syncprov_qstart (so=0x10acb540) at syncprov.c:933
#6  0xfe7f4d6c in syncprov_qresp (opc=0x1b1bfaf8, so=0x10acb540, mode=2) at
syncprov.c:982
#7  0xfe7f5aa4 in syncprov_matchops (op=0xf6bffa50, opc=0x1b1bfaf8, saveit=0) at
syncprov.c:1175
#8  0xfe7f7490 in syncprov_op_response (op=0xf6bffa50, rs=0xf6bff644) at
syncprov.c:1561
#9  0x000575cc in ?? ()
#10 0x000575cc in ?? ()
8<-------------------------------------------------------------

Thanks in advance for any pointers!

Regards,
Jonathan
Comment 1 Jonathan 2009-06-02 16:11:39 UTC
On 02.06.2009 12:28, jonathan@phillipoux.net wrote:
> Full_Name: Jonathan Clarke
> Version: 2.3.43
> OS: Solaris
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (213.41.243.192)
>
>
> Hi,
>
> I have a 2.3.43 running on a Solaris Sparc server, which crashes occasionally -
> once every week or two, always during the night. At this particular time a large
> number of operations are performed, including mass deletes and adds. I haven't
> been able to reproduce this bug, just watch it happen on the production server
> every now and again...
>
> I managed to obtain a coredump, and a backtrace (at the end of this message). I
> realize this isn't much to go on, but I'm rather unfamiliar with this part of
> the code, so I wondered if anyone has an idea what's going on here?
>
> FWIW, the dynlist and chain overlays are in use on the server, and the database
> is bdb, with a syncrepl consumer as well as syncprov overlay.
>
>
> Backtrace follows:
> 8<-------------------------------------------------------------
> Thread 1 (process 1054014    ):
> #0  0xfee4aa58 in _lwp_kill () from /lib/libc.so.1
> #1  0xfede5a64 in raise () from /lib/libc.so.1
> #2  0xfedc1954 in abort () from /lib/libc.so.1
> #3  0xfedc1b90 in _assert () from /lib/libc.so.1
> #4  0xff30ef44 in ldap_pvt_runqueue_resched (rq=0x16c630, entry=0xee6c0a0,
> defer=0) at rq.c:165
> #5  0xfe7f4a94 in syncprov_qstart (so=0x10acb540) at syncprov.c:933
> #6  0xfe7f4d6c in syncprov_qresp (opc=0x1b1bfaf8, so=0x10acb540, mode=2) at
> syncprov.c:982
> #7  0xfe7f5aa4 in syncprov_matchops (op=0xf6bffa50, opc=0x1b1bfaf8, saveit=0) at
> syncprov.c:1175
> #8  0xfe7f7490 in syncprov_op_response (op=0xf6bffa50, rs=0xf6bff644) at
> syncprov.c:1561
> #9  0x000575cc in ?? ()
> #10 0x000575cc in ?? ()
> 8<-------------------------------------------------------------
>
> Thanks in advance for any pointers!
>    

OK, I've spent some more time trying to understand this part of 
syncprov.c. From what I understand :

- the assert failure in ldap_pvt_runqueue_resched is caused by the fact 
syncprov_qstart is trying to "reschedule" a task that is no longer in 
the task_list
- the only time the task is removed from the task_list (via 
ldap_pvt_runqueue_remove) is when the task is being run, in 
syncprov_qtask, if syncprov_qplay returns !=0
- the next time syncprov_qstart is called, it finds "so->s_qtask" is not 
NULL, and tries to reschedule the task, but it's no longer in the task_list.

I've written a patch that sets "so->s_qtask" to NULL in syncprov_qtask, 
just after removing the task from the task_list. So that when 
syncprov_qstart is called again, it goes into 
ldap_pvt_runqueue_insert... The patch is attached.

Unfortunately, I can't confirm it fixes the bug since I can't reproduce 
it... For those who understand the logic behind this, does this make any 
sense? :)

Regards,
Jonathan

-- 
--------------------------------------------------------------
Jonathan Clarke - jonathan@phillipoux.net
--------------------------------------------------------------
Ldap Synchronization Connector (LSC) - http://lsc-project.org
--------------------------------------------------------------

Comment 2 Howard Chu 2009-06-02 16:23:19 UTC
jonathan@phillipoux.net wrote:
> OK, I've spent some more time trying to understand this part of
> syncprov.c. From what I understand :
>
> - the assert failure in ldap_pvt_runqueue_resched is caused by the fact
> syncprov_qstart is trying to "reschedule" a task that is no longer in
> the task_list
> - the only time the task is removed from the task_list (via
> ldap_pvt_runqueue_remove) is when the task is being run, in
> syncprov_qtask, if syncprov_qplay returns !=0
> - the next time syncprov_qstart is called, it finds "so->s_qtask" is not
> NULL, and tries to reschedule the task, but it's no longer in the task_list.
>
> I've written a patch that sets "so->s_qtask" to NULL in syncprov_qtask,
> just after removing the task from the task_list. So that when
> syncprov_qstart is called again, it goes into
> ldap_pvt_runqueue_insert... The patch is attached.
>
> Unfortunately, I can't confirm it fixes the bug since I can't reproduce
> it... For those who understand the logic behind this, does this make any
> sense? :)

Ah, you want rev 1.249 of syncprov.c. Closing this as a dup of ITS#5776.

Of course, all of this code has been removed from RE24 as of 1.265.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Comment 3 Howard Chu 2009-06-02 16:23:45 UTC
changed notes
changed state Open to Closed
Comment 4 Jonathan 2009-06-03 07:22:09 UTC
On 02.06.2009 18:23, Howard Chu wrote:
> jonathan@phillipoux.net wrote:
>> OK, I've spent some more time trying to understand this part of
>> syncprov.c. From what I understand :
>>
>> - the assert failure in ldap_pvt_runqueue_resched is caused by the fact
>> syncprov_qstart is trying to "reschedule" a task that is no longer in
>> the task_list
>> - the only time the task is removed from the task_list (via
>> ldap_pvt_runqueue_remove) is when the task is being run, in
>> syncprov_qtask, if syncprov_qplay returns !=0
>> - the next time syncprov_qstart is called, it finds "so->s_qtask" is not
>> NULL, and tries to reschedule the task, but it's no longer in the
>> task_list.
>>
>> I've written a patch that sets "so->s_qtask" to NULL in syncprov_qtask,
>> just after removing the task from the task_list. So that when
>> syncprov_qstart is called again, it goes into
>> ldap_pvt_runqueue_insert... The patch is attached.
>>
>> Unfortunately, I can't confirm it fixes the bug since I can't reproduce
>> it... For those who understand the logic behind this, does this make any
>> sense? :)
>
> Ah, you want rev 1.249 of syncprov.c. Closing this as a dup of ITS#5776.

Indeed, that's great. Thanks a lot!

> Of course, all of this code has been removed from RE24 as of 1.265.

Will this patch make it into RE23 for a possible maintenance release of 2.3?

Regards,
Jonathan
-- 
--------------------------------------------------------------
Jonathan Clarke - jonathan@phillipoux.net
--------------------------------------------------------------
Ldap Synchronization Connector (LSC) - http://lsc-project.org
--------------------------------------------------------------

Comment 5 Quanah Gibson-Mount 2009-06-03 17:00:38 UTC
--On Wednesday, June 03, 2009 7:22 AM +0000 jonathan@phillipoux.net wrote:


> Will this patch make it into RE23 for a possible maintenance release of
> 2.3?

There will be no further releases of OpenLDAP 2.3.  You should at this 
point be working on migrating to the OpenLDAP 2.4 release.

--Quanah


--

Quanah Gibson-Mount
Principal Software Engineer
Zimbra, Inc
--------------------
Zimbra ::  the leader in open source messaging and collaboration

Comment 6 OpenLDAP project 2014-08-01 21:03:38 UTC
dup #5776