[Date Prev][Date Next] [Chronological] [Thread] [Top]

RE: Syncrepl connections failing



Thanks for the response Andrew.  See comments below

> -----Original Message-----
> From: Andrew Findlay [mailto:andrew.findlay@skills-1st.co.uk]
> Sent: Friday, April 24, 2009 4:42 AM
> To: John Kane
> Cc: openldap-technical@openldap.org
> Subject: Re: Syncrepl connections failing
> 
> On Thu, Apr 23, 2009 at 05:41:09PM -0500, John Kane wrote:
> 
> > I am having a problem with what appears (to me) to be 'stale' TCP
> > connections for syncrepl between the master and a pair of slaves.
> After
> > restarting all, I see changes on the master replicated to both
> slaves.
> > BUT, if I wait about 30 minutes or more, then make a change, the
> > replication fails (most of the time).  netstat on the LDAP port show
> the
> > connections still established, but queued packets at the master
> server.
> > After about 15 minutes, the master server drops the connection.  An
> > overnight tcpdump on the master showed LDAP occasionally sending a
> > keep-alive, with 2hrs between the keep-alive messages (these keep-
> alives
> > are inconsistent, though, some nights I see none).
> 
> > Note:  The 2 slaves are running on blades in an IBM chassis, and the
> > master is on a 1U Linux server, just 'one-hop' away.   Prior to
this,
> > when I had a master/slave pair running on the blades, syncRepl was
> > working fine for several months.  It was not until I moved the
master
> to
> > the another server did the failures start.
> 
> Do you have a firewall or NAT configured on or between any of the
> boxes? This sort of problem with long-lived connections is often due
> to state being dropped from IP-level devices.

[JK] Spoke with the network admin, and he guarantees only a single
router exists between the master and slaves, and it is wide-open, i.e.
it is just doing pure routing, no NAT or firewall that would cause this.


To debug this, I am running a couple of crons on the master server.  One
does a netstat every 5 minutes, the other does an LDAP password change
on a test user every 30 minutes.  I restart all and watch the logs.  The
first password change may (or may not) get replicated to the slaves.
Subsequent changes do not.  netstat shows the message queued for about
10-15 minutes, then the connection drops.  Then I see an LDAP keep-alive
every two hours, and this will briefly bring the connections back up,
and it goes through the whole cycle again. 

Is there a configurable keep alive for syncRepl?  I've seen other
equipment, load balancers for example, that have TCP timers that will
drop an inactive TCP connection.  On one in particular, the Array load
balancer we were testing, the max of this configurable TCP timer was
less than 2 hours.


********* Here's an excerpt from the cron logs *********

Thu Apr 30 04:45:01 UTC 2009
<-------------------------------------------------------- BOTH DOWN
Thu Apr 30 04:50:01 UTC 2009
tcp        0      0 172.24.1.191:ldap           slave-1.dist-one:44364
ESTABLISHED <-- BOTH RECOVERED BY KEEPALIVE
tcp        0      0 172.24.1.191:ldap           slave-2.dist-one:38348
ESTABLISHED
Thu Apr 30 04:55:01 UTC 2009
tcp        0      0 172.24.1.191:ldap           slave-1.dist-one:44364
ESTABLISHED
tcp        0      0 172.24.1.191:ldap           slave-2.dist-one:38348
ESTABLISHED
Thu Apr 30 05:00:01 UTC 2009
tcp        0      0 172.24.1.191:ldap           slave-1.dist-one:44364
ESTABLISHED
tcp        0      0 172.24.1.191:ldap           slave-2.dist-one:38348
ESTABLISHED
Thu Apr 30 05:01:01 UTC 2009
New password: sNM/tVks
Result: Success (0)
modifying entry
"uid=testSyncDrop,ou=people,o=partner_x,dc=dist-one,dc=net"

Thu Apr 30 05:05:01 UTC 2009
tcp        0   2402 172.24.1.191:ldap           slave-1.dist-one:44364
ESTABLISHED <-- QUEUED ON BOTH
tcp        0   2402 172.24.1.191:ldap           slave-2.dist-one:38348
ESTABLISHED
Thu Apr 30 05:10:01 UTC 2009
tcp        0   2402 172.24.1.191:ldap           slave-1.dist-one:44364
ESTABLISHED
tcp        0   2402 172.24.1.191:ldap           slave-2.dist-one:38348
ESTABLISHED
Thu Apr 30 05:15:01 UTC 2009
tcp        0   2402 172.24.1.191:ldap           slave-1.dist-one:44364
ESTABLISHED
tcp        0   2402 172.24.1.191:ldap           slave-2.dist-one:38348
ESTABLISHED
Thu Apr 30 05:20:01 UTC 2009
<-------------------------------------------------------- DROPPED BOTH
Thu Apr 30 05:25:01 UTC 2009
Thu Apr 30 05:30:01 UTC 2009
Thu Apr 30 05:31:01 UTC 2009
New password: Ajae2z9/
Result: Success (0)
modifying entry
"uid=testSyncDrop,ou=people,o=partner_x,dc=dist-one,dc=net"

Thu Apr 30 05:35:01 UTC 2009
Thu Apr 30 05:40:01 UTC 2009
Thu Apr 30 05:45:01 UTC 2009
Thu Apr 30 05:50:01 UTC 2009
Thu Apr 30 05:55:01 UTC 2009
Thu Apr 30 06:00:01 UTC 2009
Thu Apr 30 06:01:01 UTC 2009
New password: j.ec4OnZ
Result: Success (0)
modifying entry
"uid=testSyncDrop,ou=people,o=partner_x,dc=dist-one,dc=net"

Thu Apr 30 06:05:01 UTC 2009
Thu Apr 30 06:10:01 UTC 2009
Thu Apr 30 06:15:01 UTC 2009
Thu Apr 30 06:20:01 UTC 2009
Thu Apr 30 06:25:01 UTC 2009
Thu Apr 30 06:30:01 UTC 2009
Thu Apr 30 06:31:01 UTC 2009
New password: NtUirUuh
Result: Success (0)
modifying entry
"uid=testSyncDrop,ou=people,o=partner_x,dc=dist-one,dc=net"

Thu Apr 30 06:35:01 UTC 2009
Thu Apr 30 06:40:01 UTC 2009
Thu Apr 30 06:45:01 UTC 2009
Thu Apr 30 06:50:01 UTC 2009
Thu Apr 30 06:55:01 UTC 2009
tcp        0      0 172.24.1.191:ldap           slave-1.dist-one:33271
ESTABLISHED <-- BOTH RECOVERED BY KEEPALIVE
tcp        0      0 172.24.1.191:ldap           slave-2.dist-one:42721
ESTABLISHED
Thu Apr 30 07:00:01 UTC 2009
tcp        0      0 172.24.1.191:ldap           slave-1.dist-one:33271
ESTABLISHED
tcp        0      0 172.24.1.191:ldap           slave-2.dist-one:42721
ESTABLISHED
Thu Apr 30 07:01:01 UTC 2009
New password: r/Jn0AYk
Result: Success (0)
modifying entry
"uid=testSyncDrop,ou=people,o=partner_x,dc=dist-one,dc=net"

Thu Apr 30 07:05:01 UTC 2009
tcp        0   2402 172.24.1.191:ldap           slave-1.dist-one:33271
ESTABLISHED <-- QUEUED ON ONE
tcp        0      0 172.24.1.191:ldap           slave-2.dist-one:42721
ESTABLISHED <-- SUCCESSFUL ON THE OTHER
Thu Apr 30 07:10:01 UTC 2009
tcp        0   2402 172.24.1.191:ldap           slave-1.dist-one:33271
ESTABLISHED
tcp        0      0 172.24.1.191:ldap           slave-2.dist-one:42721
ESTABLISHED
Thu Apr 30 07:15:01 UTC 2009
tcp        0   2402 172.24.1.191:ldap           slave-1.dist-one:33271
ESTABLISHED
tcp        0      0 172.24.1.191:ldap           slave-2.dist-one:42721
ESTABLISHED
Thu Apr 30 07:20:01 UTC 2009
tcp        0      0 172.24.1.191:ldap           slave-2.dist-one:42721
ESTABLISHED <-- DROPPED THE QUEUED ONE
Thu Apr 30 07:25:01 UTC 2009
tcp        0      0 172.24.1.191:ldap           slave-2.dist-one:42721
ESTABLISHED


********* LDAP log on the 1st slave (unsuccessful update until the next
2hr keep alive) *********

Apr 30 04:49:53 slave-1 slapd[27904]: syncrepl_entry: rid 004
LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
Apr 30 04:49:53 slave-1 slapd[27904]: syncrepl_entry: rid 004 be_search
(0)
Apr 30 04:49:53 slave-1 slapd[27904]: syncrepl_entry: rid 004
cn=lb_check,ou=systemaccts,o=partner_x,dc=dist-one,dc=net
Apr 30 04:49:53 slave-1 slapd[27904]: syncrepl_entry: rid 004 be_modify
(0)
Apr 30 04:49:53 slave-1 slapd[27904]: do_syncrep2: rid 004
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
Apr 30 06:49:52 slave-1 slapd[27904]: do_syncrep2: rid 004 Can't contact
LDAP server
Apr 30 06:49:52 slave-1 slapd[27904]: do_syncrepl: rid 004 retrying (9
retries left)
Apr 30 06:50:22 slave-1 slapd[27904]: syncrepl_entry: rid 004
LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
Apr 30 06:50:22 slave-1 slapd[27904]: syncrepl_entry: rid 004 be_search
(0)
Apr 30 06:50:22 slave-1 slapd[27904]: syncrepl_entry: rid 004
o=partner_x,dc=dist-one,dc=net
<snip>
Apr 30 06:50:22 slave-1 slapd[27904]: syncrepl_entry: rid 004
LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
Apr 30 06:50:22 slave-1 slapd[27904]: syncrepl_entry: rid 004 be_search
(0)
Apr 30 06:50:22 slave-1 slapd[27904]: syncrepl_entry: rid 004
cn=lb_check,ou=systemaccts,o=partner_x,dc=dist-one,dc=net
Apr 30 06:50:22 slave-1 slapd[27904]: syncrepl_entry: rid 004 be_modify
(0)
Apr 30 06:50:22 slave-1 slapd[27904]: do_syncrep2: rid 004
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
Apr 30 06:53:32 slave-1 slapd[27904]: connection_read(33): no
connection!
Apr 30 06:57:32 slave-1 slapd[27904]: connection_read(33): no
connection!
Apr 30 07:45:32 slave-1 slapd[27904]: connection_read(37): no
connection!
Apr 30 07:55:32 slave-1 slapd[27904]: connection_read(33): no
connection!
Apr 30 07:56:32 slave-1 slapd[27904]: connection_read(33): no
connection!
Apr 30 07:59:32 slave-1 slapd[27904]: connection_read(33): no
connection!
Apr 30 08:17:32 slave-1 slapd[27904]: connection_read(33): no
connection!
Apr 30 08:25:32 slave-1 slapd[27904]: connection_read(33): no
connection!
Apr 30 08:31:32 slave-1 slapd[27904]: connection_read(37): no
connection!
Apr 30 08:46:32 slave-1 slapd[27904]: connection_read(37): no
connection!
Apr 30 08:50:21 slave-1 slapd[27904]: do_syncrep2: rid 004 Can't contact
LDAP server
Apr 30 08:50:21 slave-1 slapd[27904]: do_syncrepl: rid 004 retrying (9
retries left)
Apr 30 08:50:51 slave-1 slapd[27904]: syncrepl_entry: rid 004
LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
Apr 30 08:50:51 slave-1 slapd[27904]: syncrepl_entry: rid 004 be_search
(0)
Apr 30 08:50:51 slave-1 slapd[27904]: syncrepl_entry: rid 004
o=partner_x,dc=dist-one,dc=net


********* On the 2nd slave (successful update) *********

Apr 30 04:49:52 slave-2 slapd[6973]: syncrepl_entry: rid 002
LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
Apr 30 04:49:52 slave-2 slapd[6973]: syncrepl_entry: rid 002 be_search
(0)
Apr 30 04:49:52 slave-2 slapd[6973]: syncrepl_entry: rid 002
cn=lb_check,ou=systemaccts,o=partner_x,dc=dist-one,dc=net
Apr 30 04:49:52 slave-2 slapd[6973]: syncrepl_entry: rid 002 be_modify
(0)
Apr 30 04:49:52 slave-2 slapd[6973]: do_syncrep2: rid 002
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
Apr 30 06:49:51 slave-2 slapd[6973]: do_syncrep2: rid 002 Can't contact
LDAP server
Apr 30 06:49:51 slave-2 slapd[6973]: do_syncrepl: rid 002 retrying (9
retries left)
Apr 30 06:50:21 slave-2 slapd[6973]: syncrepl_entry: rid 002
LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
Apr 30 06:50:21 slave-2 slapd[6973]: syncrepl_entry: rid 002 be_search
(0)
Apr 30 06:50:21 slave-2 slapd[6973]: syncrepl_entry: rid 002
o=partner_x,dc=dist-one,dc=net
<snip>
Apr 30 06:50:21 slave-2 slapd[6973]: syncrepl_entry: rid 002
LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
Apr 30 06:50:21 slave-2 slapd[6973]: syncrepl_entry: rid 002 be_search
(0)
Apr 30 06:50:21 slave-2 slapd[6973]: syncrepl_entry: rid 002
cn=lb_check,ou=systemaccts,o=partner_x,dc=dist-one,dc=net
Apr 30 06:50:21 slave-2 slapd[6973]: syncrepl_entry: rid 002 be_modify
(0)
Apr 30 06:50:21 slave-2 slapd[6973]: do_syncrep2: rid 002
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
Apr 30 07:01:01 slave-2 slapd[6973]: syncrepl_entry: rid 002
LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY)
Apr 30 07:01:01 slave-2 slapd[6973]: syncrepl_entry: rid 002 be_search
(0)
Apr 30 07:01:01 slave-2 slapd[6973]: syncrepl_entry: rid 002
uid=testSyncDrop,ou=people,o=partner_x,dc=dist-one,dc=net
Apr 30 07:01:02 slave-2 slapd[6973]: syncrepl_entry: rid 002 be_modify
(0)


Thanks again,
John



This message is confidential to Prodea Systems, Inc unless otherwise indicated 
or apparent from its nature. This message is directed to the intended recipient 
only, who may be readily determined by the sender of this message and its 
contents. If the reader of this message is not the intended recipient, or an 
employee or agent responsible for delivering this message to the intended 
recipient:(a)any dissemination or copying of this message is strictly 
prohibited; and(b)immediately notify the sender by return message and destroy 
any copies of this message in any form(electronic, paper or otherwise) that you 
have.The delivery of this message and its information is neither intended to be 
nor constitutes a disclosure or waiver of any trade secrets, intellectual 
property, attorney work product, or attorney-client communications. The 
authority of the individual sending this message to legally bind Prodea Systems  
is neither apparent nor implied,and must be independently verified.