[Date Prev][Date Next] [Chronological] [Thread] [Top]

(ITS#6275) syncrepl taking long(not sync) when consumer not connect for a moment



Full_Name: Rodrigo Luiz Vargas Costa
Version: 2.4.17
OS: CentOS release 5.2 (Final)
URL: ftp://ftp.openldap.org/incoming/<TBD>
Submission from: (NULL) (135.245.8.5)


Openldap developers,

I have being exchange some information at openldap lists where looks like some
improvements are being done in replication for release 2.4.18.

The architecture I'm running has 2 machines in MirrorMode in the same subnet(at
the same switch). These systems are part of a HA system sharing a VIP and where
both machines have slapd running simultaneously(bind to any local interface) and
only VIP is exchanged for HA purposes.

The issue I'm facing is related, in a general user view, is when I stop the
secondary Provider2(master 2) for backup purposes using slapcat. The
Provider1(master 1) continues to provide ldap service where some entrances can
be created during the time backup is running(no consumer from Provider 2).

Even a small number of entrances are different when consumer in Provider 2
connects to Provider 1 then syncrepl enters in the full DB search as expected.

For definition purposes I have some memory limitations where I need to limit
dncachesize for around 80% of DB entrances.

>From a user perspective I see that after cache is filled system enters in some
state where synchronization doesn't happen anymore. For full reference(config,
gdb, etc), please see file attached in FTP.

Then I see 2 issues :

1)Consumer from Provider2, even passed days and only a small number of
differences for test purpose happen(no traffic), the syncrepl never ends and
there isn't replication(Provider 1 stay continuously consuming 100% CPU);
2)Even I stop the Provider2(then its consumer) I do not see any change in
Provider 1 activities. The CPU continues in 100% even passed days what suggest
some hang in the thread or logic.

I compiled openldap with GDB symbols and then execute some traces in the threads
during the state 2 report above. Looks like it stay looping forever locked in
some thread lock.

I could also note that when in this situation the monitor cache, in a very slow
pace, changes the cache in a single entrance. Being more specific :

dn: cn=Database 1,cn=Databases,cn=Monitor
structuralObjectClass: monitoredObject
creatorsName:
modifiersName:
createTimestamp: 20090821145848Z
modifyTimestamp: 20090821145848Z
monitoredInfo: bdb
monitorIsShadow: TRUE
namingContexts: ou=CONTENT,o=domain,c=fr
readOnly: FALSE
monitorOverlay: syncprov
olmBDBEntryCache: 19920
olmBDBDNCache: 3896287
olmBDBIDLCache: 2
olmDbDirectory: /var/openldap-data/bdb1/
entryDN: cn=Database 1,cn=Databases,cn=Monitor
subschemaSubentry: cn=Subschema
hasSubordinates: TRUE

Stays running in the values 3896287 and 3896288. Looks like the memory re-use is
being too short causing locks that takes long time causing a non
synchronization.

I made several GDB traces for different conditions. Please see ftp attachment
file for details.

Thanks,

Rodrigo.

PS-> I could not put the file in the openldap ftp. It says device full. Please
let me know how can I send this file.