[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: null_callbacks after initial sync



Howard Chu wrote:
> Nick Geron wrote:
>> Howard Chu wrote:
>>> Nick Geron wrote:
>>>> We're now thinking some of our issues may be attributable to time
>>>> granularity issues.  We're seeing missing information on the
>>>> consumer if
>>>> multiple successive writes are attempted via a script.  If we slow
>>>> down
>>>> to human speed or insert sleeps in our test code, this gets a little
>>>> better.  I see that A.2.4 N-Way MultiMaster Replication notes that
>>>> entryCSNs now record with microseconds, but does this apply to mirrors
>>>> as well?
>>> CSNs were extended to microsecond resolution only for the benefit of
>>> conflict resolution. For all other purposes, the changecount field
>>> ensures sufficient granularity.
>> In that case, why do we see any difference in propagation between
>> scripted (quick) updates and hand/command line (slow) modifications?  Or
>> are you simply saying time is not the issue?
>
> Timestamps are not the issue for propagation.
>
>> For example - manipulating one particular entry:
>>
>> 1) update server 1 adding 1 attribute = propagates to second server
>> * wait a few seconds
>> 2) update server 1 adding 4 attributes = first of four propagates to
>> second server
>>
>> After waiting a second or so, another successful operation on the
>> 'write' server will propagate all modifications over to the second
>> server as expected.  This behavior is why we suspected a time
>> granularity issue.  It should be noted that this doesn't work for us
>> (and others as I would expect) as there is no guarantee that another
>> operation on the 'write' server will occur, thereby propagating the
>> current entry.
>
> OK, this sounds like the background thread to propagate updates isn't
> getting scheduled when it should. That could be a bug in the syncprov
> overlay.

Should I file a report, and if so, what information is required from
this end?
>
>>>> Can I setup a two node N-Way?
>>> "2" is certainly a valid value of "N".
>> Well, there's that developer 'charm' I've been reading throughout years
>> of archives.  Since the admin doc make a distinction between the 'hybrid
>> configuration' of MirrorMode and N-Way Multi-Master, I was more looking
>> for clarification between the two implementations.
>
> Then that is what you should have asked. "Looking for clarification
> between the implementation of MirrorMode and Multi-Master" is a much
> clearer question than "Can I setup a two node N-Way", and there is no
> way one could logically get from the latter to the former, based on
> the context of your email. If you don't ask useful questions, you have
> only yourself to blame when you don't get useful answers.
>
> There is no difference now between the MirrorMode and Multi-Master
> code. The only difference is purely a matter of usage. In a MirrorMode
> setup you use an external frontend that guarantees that writes are
> only directed to one server. As long as that guarantee is kept, your
> servers will have perfect data consistency. In a Multi-Master setup,
> you allow writes to any server, and the data consistency is not
> guaranteed. In that case the CSNs are used for conflict resolution;
> when competing writes are made to the same entries the last writer
> wins. (Note - the servers will all eventually converge on a consistent
> view of the data, the issue is that the resulting data may not
> resemble what you expected. If your servers' clocks are not tightly
> synchronized, it's pretty certain to be different from what you
> expected.)
>

I would disagree with the assumption that the thread itself did not
provide enough context, but it is silly to belabor this point.  Thank
you for clearing that up for me. 
>>> Syncrepl doesn't write session logs. Read RFC4533.
>>>
>> I'll look into it.  Thanks.
>
>> Switching gears, what would the devs say is the capabilities in
>> operations per second with 2.4.7?
>
> I've recently run back-hdb with a 5GB database in back-hdb, 20,000
> indexed searches/second concurrent with 13,000 modifies/second on an 8
> core Opteron server (1.9GHz cores). This was tested using slamd and
> ~80 client threads, sustained over a 2 hour run.
>
Thanks for the info.  That would certainly point to a problem with my
build environment, or a new bug.
>> I'm seeing a number of aborts when
>> testing under high load.  The latest came from running scripted
>> ldapsearches and ldapmodifies which resulted in a mutex error (or so I
>> am told by one of our developers).
>>
>> Specifically:
>>
>> 1) adding about 100 attributes to an entry
>> 2) diffing the output of ldapsearch between the two nodes in loop
>> 3) once synced, grabbing the attributes, shoving them in a temp file
>> with delete instructions and using that with ldapmodify.
>>
>> I complied with debugging on which results in an abort with
>> "connection.c: 676: connection_state_closing: Assertion 'c_struct_state
>> == 0x02' failed" logged.
>
> Interesting. It would be useful to get a gdb stack trace from that
> situation.
>
Yesterday I was able to successfully reproduce this beahvior at least
three times.  This morning, I was able to reproduce it with the above
steps yet again.  From a gdb session, no backtrace was available,
however.  I then recompiled with debugging enabled and was unable to
reproduce the bug until I added '-d 7' to the run arguments.  It should
be noted that before recompiling, I was able to reproduce the behavior
with and without the command line debug argument.

Here's the stack trace from a gdb session with arguments, -h 'ldap:///
ldaps:///' -d 7

=> acl_string_expand: expanded:
uid=[^,]+,ou=employees,ou=people,dc=example,dc=com
=> regex_matches: string:        uid=syncrepl,ou=ldap,dc=example,dc=com
=> regex_matches: rc: 1 no matches
slapd: connection.c:676: connection_state_closing: Assertion
`c->c_struct_state == 0x02' failed.

Program received signal SIGABRT, Aborted.
[Switching to Thread 1124096320 (LWP 7301)]
0x0000003918230055 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003918230055 in raise () from /lib64/libc.so.6
#1  0x0000003918231af0 in abort () from /lib64/libc.so.6
#2  0x0000003918229756 in __assert_fail () from /lib64/libc.so.6
#3  0x000000000042d345 in connection_state_closing ()
#4  0x000000000043d43b in slap_freeself_cb ()
#5  0x000000000043ef81 in slap_send_search_entry ()
#6  0x00000000004c88c4 in syncprov_initialize ()
#7  0x00002aaaaaabc1c7 in ldap_int_thread_pool_wrapper
(xpool=0x1a0b06d0) at tpool.c:625
#8  0x00000039196062f7 in start_thread () from /lib64/libpthread.so.0
#9  0x00000039182ce85d in clone () from /lib64/libc.so.6
(gdb)

Envrionment = CentOS 5 updated to whatever RH thinks is current as of
last week.  Oracle db 4.5.20 and openldap 2.4.7 compiled by hand.

Please let me know if any further information would be helpful.

-Nick