[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
Re: (ITS#6158) syncprov: assert causing slapd to core dump
This is a multi-part message in MIME format.
--------------080204020206070201080405
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
On 02.06.2009 12:28, jonathan@phillipoux.net wrote:
> Full_Name: Jonathan Clarke
> Version: 2.3.43
> OS: Solaris
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (213.41.243.192)
>
>
> Hi,
>
> I have a 2.3.43 running on a Solaris Sparc server, which crashes occasionally -
> once every week or two, always during the night. At this particular time a large
> number of operations are performed, including mass deletes and adds. I haven't
> been able to reproduce this bug, just watch it happen on the production server
> every now and again...
>
> I managed to obtain a coredump, and a backtrace (at the end of this message). I
> realize this isn't much to go on, but I'm rather unfamiliar with this part of
> the code, so I wondered if anyone has an idea what's going on here?
>
> FWIW, the dynlist and chain overlays are in use on the server, and the database
> is bdb, with a syncrepl consumer as well as syncprov overlay.
>
>
> Backtrace follows:
> 8<-------------------------------------------------------------
> Thread 1 (process 1054014 ):
> #0 0xfee4aa58 in _lwp_kill () from /lib/libc.so.1
> #1 0xfede5a64 in raise () from /lib/libc.so.1
> #2 0xfedc1954 in abort () from /lib/libc.so.1
> #3 0xfedc1b90 in _assert () from /lib/libc.so.1
> #4 0xff30ef44 in ldap_pvt_runqueue_resched (rq=0x16c630, entry=0xee6c0a0,
> defer=0) at rq.c:165
> #5 0xfe7f4a94 in syncprov_qstart (so=0x10acb540) at syncprov.c:933
> #6 0xfe7f4d6c in syncprov_qresp (opc=0x1b1bfaf8, so=0x10acb540, mode=2) at
> syncprov.c:982
> #7 0xfe7f5aa4 in syncprov_matchops (op=0xf6bffa50, opc=0x1b1bfaf8, saveit=0) at
> syncprov.c:1175
> #8 0xfe7f7490 in syncprov_op_response (op=0xf6bffa50, rs=0xf6bff644) at
> syncprov.c:1561
> #9 0x000575cc in ?? ()
> #10 0x000575cc in ?? ()
> 8<-------------------------------------------------------------
>
> Thanks in advance for any pointers!
>
OK, I've spent some more time trying to understand this part of
syncprov.c. From what I understand :
- the assert failure in ldap_pvt_runqueue_resched is caused by the fact
syncprov_qstart is trying to "reschedule" a task that is no longer in
the task_list
- the only time the task is removed from the task_list (via
ldap_pvt_runqueue_remove) is when the task is being run, in
syncprov_qtask, if syncprov_qplay returns !=0
- the next time syncprov_qstart is called, it finds "so->s_qtask" is not
NULL, and tries to reschedule the task, but it's no longer in the task_list.
I've written a patch that sets "so->s_qtask" to NULL in syncprov_qtask,
just after removing the task from the task_list. So that when
syncprov_qstart is called again, it goes into
ldap_pvt_runqueue_insert... The patch is attached.
Unfortunately, I can't confirm it fixes the bug since I can't reproduce
it... For those who understand the logic behind this, does this make any
sense? :)
Regards,
Jonathan
--
--------------------------------------------------------------
Jonathan Clarke - jonathan@phillipoux.net
--------------------------------------------------------------
Ldap Synchronization Connector (LSC) - http://lsc-project.org
--------------------------------------------------------------
--------------080204020206070201080405
Content-Type: text/x-patch;
name="patch-syncprov-20090602.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="patch-syncprov-20090602.patch"
Index: servers/slapd/overlays/syncprov.c
===================================================================
RCS file: /repo/OpenLDAP/pkg/ldap/servers/slapd/overlays/syncprov.c,v
retrieving revision 1.56.2.51
diff -u -p -r1.56.2.51 syncprov.c
--- servers/slapd/overlays/syncprov.c 9 Jul 2008 20:53:13 -0000 1.56.2.51
+++ servers/slapd/overlays/syncprov.c 2 Jun 2009 15:57:21 -0000
@@ -908,6 +908,7 @@ syncprov_qtask( void *ctx, void *arg )
} else {
/* bail out on any error */
ldap_pvt_runqueue_remove( &slapd_rq, rtask );
+ if ( so ) so->s_qtask = NULL;
}
ldap_pvt_thread_mutex_unlock( &slapd_rq.rq_mutex );
--------------080204020206070201080405--