[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: issue with bad data ? In MMR setup



Daniel Jung wrote:
hi folks,

Ran into the following on the slaves while replicating:
  mdb_id2entry_put: mdb_put failed: MDB_PAGE_FULL: Internal error - page has
no more space(-30786)
null_callback : error code 0x50
syncrepl_entry: rid=407 be_modify failed (80)

This should never happen. Unfortunately some earlier LMDB releases had bugs related to delete operations that might trigger this.

My previous posted issue happened while replicating an entry that is identical
to new problem.

I dont think this is a coincident that sync replication failed while modifying
a specific DN.
This issue was only visible in some slaves and not all the slaves.

Any idea as to how i could go about troubleshooting this?  I did manual
changes to the this specific DN and replication works without issue.


On Sat, Jun 14, 2014 at 3:56 AM, Daniel Jung <mimianddaniel@gmail.com
<mailto:mimianddaniel@gmail.com>> wrote:

    Hi,

    Ldap daemon was being restarted every so many minutes. All the consumers
    were out of sync and had to be re-synced. This specific master in question
    in MMR setup was restored from other master and the issue went away.
    running 2.4.37 on centos6 with hdb backend on the masters and lmdb on the
    consumers.

      Searching thru the list shows a lot of hits with "too old", AFAIK ntp is
    kept quite closely. serverid "000" no longer exists as it was
    decomissioned  since last year, hence contextcsn is really old. Not sure
    if that played a role in this havoc or not.  Could you tell me what "srs"
    and "log" means in the context below?

"srs csn" is the CSN from a consumer's cookie. "log csn" is a CSN from the syncprov session log. If serverID 000 has been decomissioned, you probably should delete its CSN from your contextCSN attribute on both consumer and provider. Since syncprov always tries to send changes to a consumer based on the oldest CSN, you're alwyas going to be plowing thru a lot of old updates with this.



    Following is what I found in the log, and there were a lot of these which
    probably contributed to restart of the daemon:

    Jun 14 00:05:21 name of the server  slapd[16745]: srs csn
    20131226183611.000000Z#000000#000#000000
    Jun 14 00:05:21 name of the server  slapd[16745]: log csn
    20131206192447.000000Z#000000#000#000000
    Jun 14 00:05:21 name of the server  slapd[16745]: cmp -2, too old
    Jun 14 00:05:21 name of the server  slapd[16745]: log csn
    20131206193513.000000Z#000000#000#000000
    Jun 14 00:05:21 name of the server slapd[16745]: cmp -2, too old
    </snip>
    Jun 14 00:05:59 name of the server slapd[16745]: do_syncrep2: rid=0
    01 (-1) Can't contact LDAP server
    </snip>
    Jun 14 00:06:15 name of the server slapd[16745]: log csn
    20131229125124.532456Z#000000#001#000000
    Jun 14 00:06:15 name of the server slapd[16745]: cmp -256, too old
    Jun 14 00:06:15 name of the server  slapd[16745]: log csn
    20131229125143.680121Z#000000#001#000000
    Jun 14 00:06:15 name of the server slapd[16745]: cmp -256, too old
    Jun 14 00:06:15 name of the server slapd[16745]: log csn 2013122913
    <tel:2013122913>
    </snip>
    Jun 14 00:06:59 name of the server  slapd[31392]: do_syncrep2: rid=000
    LDAP_RES_INTERMEDIATE - SYNC_ID_SET
    Jun 14 00:06:59 name of the server slapd[31392]: do_syncrep2: rid=000
    cookie=rid=000,sid=002,csn=20140613220035.981531Z#000000#001#000000
    Jun 14 00:06:59 name of the server slapd[31392]: do_syncrep2: rid=000
    LDAP_RES_INTERMEDIATE - REFRESH_DELETE

    </snip>

    thank you




--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/