[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#6275) syncrepl taking long(not sync) when consumer not connect for a moment


Please see my comments in your previous e-mail.



Quanah Gibson-Mount wrote:
> --On Thursday, August 27, 2009 6:39 AM -0700 Rodrigo Costa 
> <rlvcosta@yahoo.com> wrote:
>> Quanah,
>> Please see answer in your previous e-mail below.
>> I'm also sending the information I could collect attached since it is a
>> small file(5KB).
>> The behavior that appears strange and that could indicate a problem is
>> the fact that even when consumer is stopped the provider still doing
>> something for a long time. This doesn't appear to be correct.
>> Other strange behavior is that when system enters in this state one
>> provider CPU stays running around 100% CPU usage. I made a jmeter script
>> to test individual bind/search(no ldapsearch *) and then even with some
>> load(like 200 simultaneous query) I do not see CPU in 100%. Something
>> doesn't appear to be ok since I do not see why CPU should enter in 100%
>> permanently.
> I explained to you previously why this would be.  Other comments inline.
>>> Why are you stopping the provider to do a slapcat?
>> [Rodrigo]Faster dump of data. And in any case if other situation like a
>> problema occurs the secondary system could stay disconnect for other
>> reasons.
> [Rodrigo] I have 2 reasons :
1)Since backup takes sometime and DB has multiple branches for the same 
record the only way to have a consistent backup is executing a cold backup;
2)slapcat in a stop slapd could perform faster and also fulfill item 1 
above(cold backup)

> Do you have any evidence that an offline slapcat is faster than one 
> while slapd is running?  I don't understand what you mean in the rest 
> of that sentence.
> [Rodrigo] I didn't try with load traffic but it seems reasonable if a 
> cold backup is faster and cleaner than a hot backup.
>>>> Even a small number of entrances are different when consumer in
>>>> Provider 2
>>>> connects to Provider 1 then syncrepl enters in the full DB search as
>>>> expected.
>>> What is your sessionlog setting on each provider for the syncprov
>>> overlay?
>> [Rodrigo]
>> syncprov-checkpoint 10000 120
>> syncprov-sessionlog 100000
> Hm, I would probably checkpoint the cookie a lot more frequently than 
> you have it set to.  The sessionlog setting seems fine to me.
[Rodrigo] Ok
>> Same configuration in both systems.
>>>> For definition purposes I have some memory limitations where I need to
>>>> limit dncachesize for around 80% of DB entrances.
>>> We already went through other things you could do to reduce your
>>> memory footprint in other ways.  You've completely ignored that
>>> advice.  As long as your dncachesize is in this state, I don't expect
>>> things to behave normally.
>> [Rodrigo]I implemented what was possible. The end is this cache config
>> possible by the memory constraints :
>> # Cache values
>> # cachesize       10000
>> cachesize       20000
>> dncachesize     3000000
>> # dncachesize    400000
>> # idlcachesize    10000
>> idlcachesize    30000
>> # cachefree       10
>> cachefree       100
> You don't say anything in here about your DB_CONFIG settings, which is 
> where you could stand to gain the most amount of memory back.  I do 
> see you're definitely running a very restricted 
> cachesize/idlcachesize. ;)
> [Rodrigo]DB_CONFIG is using only 100MB of memory and DB_LOG_AUTOREMOVE.
>>> What value did you set for "cachefree"?
>> [Rodrigo] cachefree       100
> [Rodrigo] I made the change proposed and tested. The behavior was 
> really better since after dncachesize was filled the issue did not 
> repeated as before.
BUT it just took more time until the behavior repeats. After some more 
time then just after dncachesize reaches around 3Mi the behavior 
returned. What happens is :
1-> Provider 1 CPU start to consume around 100%;
2-> Consumer 2 CPU goes to 0% consumption(before it was around 10% when 
replication in place);
3-> Replication never ends(I cannot see in the Provider 2 data) and even 
I stop Consumer 2(or slapd) the CPU in Provider 1 remains days in 100%.

Looks like code enter in a dead loop which I could not identify the 
condition or the requirement to avoid it. I generated some GDB traces 
and as soon as possible(there is space) I will put in the ftp.
> This value is likely substantially way too low for your system 
> configuration.  This is how many entries get freed from any of the 
> caches. With your dncachesize being 3,000,000, removing 100 entries 
> from it will do hardly anything, and may be part of the issue.  If it 
> wasn't for the major imbalance between your entry, idl, and 
> dncachesizes, I would suggest a fairly high value like 100,000.  But 
> given your entry cache is 20,000, you'll probably have to limit the 
> cachefree to 5000-10000.  But it is going to need to be higher than 100.
> --Quanah
> -- 
> Quanah Gibson-Mount
> Principal Software Engineer
> Zimbra, Inc
> --------------------
> Zimbra ::  the leader in open source messaging and collaboration