[Date Prev][Date Next] [Chronological] [Thread] [Top]

slurpd segmentation fault - a desparate call for help



I'm sitting here with my web front-end all built, my schema designed, my
database populated and everything "slapd" ready to go.  I ran the system
in test mode for weeks and everything worked flawlessly.  But now that
I've put the system in a quasi-production mode with lots of data being
updated and replicated I've hit a brick wall.  I can't put openldap in
full production because I can't figure out why slurpd keeps dying from a
segmentation fault. 

On our system, slurpd should replicate to two slaves.  If I run slurpd
in debug mode, the segmentation fault is most usually happening just
after the debug statement "re-write on-disk replication log" is output
to the screen.  This debug statement comes just before Rq_write tries to
write the contents of a replication queue to a file in the program file
'rq.c'.  

I'm pulling my hair out on this one because I've tried everything I can
think of and still can't determine what is causing this problem.  Worse,
slurpd is hardly replicating anything now because of the frequency of
the segmentation faults.  

If I clear out the /usr/local/var/openldap-slurpd/replica/slurpd.replog
file,  things seem to work better for a while.  But as that file gets
larger the segmentation faults begin happening more and more frequently
and updates to the slaves no longer occur.  It's interesting that at
this point of havoc, the failed replications do not show up in the
172.16.41.23.rej file but they do show up in the slurpd.replog file -
not as failures but, I assume by the very fact that they are in that
file, as successes.

I shall be forever grateful to anyone who can help solve this dilemma
since I can't go into full production until it is solved.  Meanwhile, if
anyone knows the answer to any of the following, it may help me find the
solution:

1) is slurpd.replog the file that rq.c is trying to write the contents
of the replication queue to?

2) Is the "replogfile" referenced in slapd.conf the replication queue
that rq.c is trying to write to that file?

3) Why does slurpd continually read the contents of the slurpd.replog
file?  And what does it mean when it reads an entry and says, in the
debug output:

    "Replica 172.16.44.23:389, skip repl record for
uid=sam,ou=people,dc=crazy,dc=com (not mine)"

What's the "(not mine)" mean?  Sometimes, instead of "not mine", it says
other things like "(old)".  What is slurpd doing with this file besides
just writing replicated logs to it?  And should I be concerned about all
the skipped records?

4) There are some strings that are displayed from slurpd in debug mode
that don't appear anywhere in the source files.  One such string is
"ldap_msgfree".  There are times when this is the last message output
from debug before a segmentation fault.  Anyone know anything about this
string and where it is coming from?

5) Is there anywhere I can get other help interpreting the debug output
from slurpd?  Anywhere (besides reading the code) that I can get details
about the operation of slurpd?

Thanks for any help anyone can offer,

Mike