Full_Name: Jonathan Clarke Version: 2.3.43 OS: Solaris URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (213.41.243.192) Hi, I have a 2.3.43 running on a Solaris Sparc server, which crashes occasionally - once every week or two, always during the night. At this particular time a large number of operations are performed, including mass deletes and adds. I haven't been able to reproduce this bug, just watch it happen on the production server every now and again... I managed to obtain a coredump, and a backtrace (at the end of this message). I realize this isn't much to go on, but I'm rather unfamiliar with this part of the code, so I wondered if anyone has an idea what's going on here? FWIW, the dynlist and chain overlays are in use on the server, and the database is bdb, with a syncrepl consumer as well as syncprov overlay. Backtrace follows: 8<------------------------------------------------------------- Thread 1 (process 1054014 ): #0 0xfee4aa58 in _lwp_kill () from /lib/libc.so.1 #1 0xfede5a64 in raise () from /lib/libc.so.1 #2 0xfedc1954 in abort () from /lib/libc.so.1 #3 0xfedc1b90 in _assert () from /lib/libc.so.1 #4 0xff30ef44 in ldap_pvt_runqueue_resched (rq=0x16c630, entry=0xee6c0a0, defer=0) at rq.c:165 #5 0xfe7f4a94 in syncprov_qstart (so=0x10acb540) at syncprov.c:933 #6 0xfe7f4d6c in syncprov_qresp (opc=0x1b1bfaf8, so=0x10acb540, mode=2) at syncprov.c:982 #7 0xfe7f5aa4 in syncprov_matchops (op=0xf6bffa50, opc=0x1b1bfaf8, saveit=0) at syncprov.c:1175 #8 0xfe7f7490 in syncprov_op_response (op=0xf6bffa50, rs=0xf6bff644) at syncprov.c:1561 #9 0x000575cc in ?? () #10 0x000575cc in ?? () 8<------------------------------------------------------------- Thanks in advance for any pointers! Regards, Jonathan
On 02.06.2009 12:28, jonathan@phillipoux.net wrote: > Full_Name: Jonathan Clarke > Version: 2.3.43 > OS: Solaris > URL: ftp://ftp.openldap.org/incoming/ > Submission from: (NULL) (213.41.243.192) > > > Hi, > > I have a 2.3.43 running on a Solaris Sparc server, which crashes occasionally - > once every week or two, always during the night. At this particular time a large > number of operations are performed, including mass deletes and adds. I haven't > been able to reproduce this bug, just watch it happen on the production server > every now and again... > > I managed to obtain a coredump, and a backtrace (at the end of this message). I > realize this isn't much to go on, but I'm rather unfamiliar with this part of > the code, so I wondered if anyone has an idea what's going on here? > > FWIW, the dynlist and chain overlays are in use on the server, and the database > is bdb, with a syncrepl consumer as well as syncprov overlay. > > > Backtrace follows: > 8<------------------------------------------------------------- > Thread 1 (process 1054014 ): > #0 0xfee4aa58 in _lwp_kill () from /lib/libc.so.1 > #1 0xfede5a64 in raise () from /lib/libc.so.1 > #2 0xfedc1954 in abort () from /lib/libc.so.1 > #3 0xfedc1b90 in _assert () from /lib/libc.so.1 > #4 0xff30ef44 in ldap_pvt_runqueue_resched (rq=0x16c630, entry=0xee6c0a0, > defer=0) at rq.c:165 > #5 0xfe7f4a94 in syncprov_qstart (so=0x10acb540) at syncprov.c:933 > #6 0xfe7f4d6c in syncprov_qresp (opc=0x1b1bfaf8, so=0x10acb540, mode=2) at > syncprov.c:982 > #7 0xfe7f5aa4 in syncprov_matchops (op=0xf6bffa50, opc=0x1b1bfaf8, saveit=0) at > syncprov.c:1175 > #8 0xfe7f7490 in syncprov_op_response (op=0xf6bffa50, rs=0xf6bff644) at > syncprov.c:1561 > #9 0x000575cc in ?? () > #10 0x000575cc in ?? () > 8<------------------------------------------------------------- > > Thanks in advance for any pointers! > OK, I've spent some more time trying to understand this part of syncprov.c. From what I understand : - the assert failure in ldap_pvt_runqueue_resched is caused by the fact syncprov_qstart is trying to "reschedule" a task that is no longer in the task_list - the only time the task is removed from the task_list (via ldap_pvt_runqueue_remove) is when the task is being run, in syncprov_qtask, if syncprov_qplay returns !=0 - the next time syncprov_qstart is called, it finds "so->s_qtask" is not NULL, and tries to reschedule the task, but it's no longer in the task_list. I've written a patch that sets "so->s_qtask" to NULL in syncprov_qtask, just after removing the task from the task_list. So that when syncprov_qstart is called again, it goes into ldap_pvt_runqueue_insert... The patch is attached. Unfortunately, I can't confirm it fixes the bug since I can't reproduce it... For those who understand the logic behind this, does this make any sense? :) Regards, Jonathan -- -------------------------------------------------------------- Jonathan Clarke - jonathan@phillipoux.net -------------------------------------------------------------- Ldap Synchronization Connector (LSC) - http://lsc-project.org --------------------------------------------------------------
jonathan@phillipoux.net wrote: > OK, I've spent some more time trying to understand this part of > syncprov.c. From what I understand : > > - the assert failure in ldap_pvt_runqueue_resched is caused by the fact > syncprov_qstart is trying to "reschedule" a task that is no longer in > the task_list > - the only time the task is removed from the task_list (via > ldap_pvt_runqueue_remove) is when the task is being run, in > syncprov_qtask, if syncprov_qplay returns !=0 > - the next time syncprov_qstart is called, it finds "so->s_qtask" is not > NULL, and tries to reschedule the task, but it's no longer in the task_list. > > I've written a patch that sets "so->s_qtask" to NULL in syncprov_qtask, > just after removing the task from the task_list. So that when > syncprov_qstart is called again, it goes into > ldap_pvt_runqueue_insert... The patch is attached. > > Unfortunately, I can't confirm it fixes the bug since I can't reproduce > it... For those who understand the logic behind this, does this make any > sense? :) Ah, you want rev 1.249 of syncprov.c. Closing this as a dup of ITS#5776. Of course, all of this code has been removed from RE24 as of 1.265. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
changed notes changed state Open to Closed
On 02.06.2009 18:23, Howard Chu wrote: > jonathan@phillipoux.net wrote: >> OK, I've spent some more time trying to understand this part of >> syncprov.c. From what I understand : >> >> - the assert failure in ldap_pvt_runqueue_resched is caused by the fact >> syncprov_qstart is trying to "reschedule" a task that is no longer in >> the task_list >> - the only time the task is removed from the task_list (via >> ldap_pvt_runqueue_remove) is when the task is being run, in >> syncprov_qtask, if syncprov_qplay returns !=0 >> - the next time syncprov_qstart is called, it finds "so->s_qtask" is not >> NULL, and tries to reschedule the task, but it's no longer in the >> task_list. >> >> I've written a patch that sets "so->s_qtask" to NULL in syncprov_qtask, >> just after removing the task from the task_list. So that when >> syncprov_qstart is called again, it goes into >> ldap_pvt_runqueue_insert... The patch is attached. >> >> Unfortunately, I can't confirm it fixes the bug since I can't reproduce >> it... For those who understand the logic behind this, does this make any >> sense? :) > > Ah, you want rev 1.249 of syncprov.c. Closing this as a dup of ITS#5776. Indeed, that's great. Thanks a lot! > Of course, all of this code has been removed from RE24 as of 1.265. Will this patch make it into RE23 for a possible maintenance release of 2.3? Regards, Jonathan -- -------------------------------------------------------------- Jonathan Clarke - jonathan@phillipoux.net -------------------------------------------------------------- Ldap Synchronization Connector (LSC) - http://lsc-project.org --------------------------------------------------------------
--On Wednesday, June 03, 2009 7:22 AM +0000 jonathan@phillipoux.net wrote: > Will this patch make it into RE23 for a possible maintenance release of > 2.3? There will be no further releases of OpenLDAP 2.3. You should at this point be working on migrating to the OpenLDAP 2.4 release. --Quanah -- Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
dup #5776