[Date Prev][Date Next] [Chronological] [Thread] [Top]

SOLVED: RE: Syncrepl connections failing



So, it appears LDAP does do TCP keepalive, and, at least in Linux, its
interval is controlled by sysctl on the server. The initial value of the
interval was 2 hrs, as I mentioned previously.  I set it to 15 minutes,
and now all is good.


Here's the changes I made on my master server (beware, this is doing
kernel changes that are global):

# sudo /sbin/sysctl net.ipv4.tcp_keepalive_time
net.ipv4.tcp_keepalive_intvl
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_intvl = 75

OR

# sudo cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
# sudo cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75


I modified my sysctl.conf, adding the following lines:

# Change the keepalive time from the default 7200s
net.ipv4.tcp_keepalive_time = 900

# Change the keepalive interval from the default 75s
net.ipv4.tcp_keepalive_intvl = 900


Then activated the changes using:

# sudo /sbin/sysctl -p

# sudo /sbin/sysctl net.ipv4.tcp_keepalive_time
net.ipv4.tcp_keepalive_intvl
net.ipv4.tcp_keepalive_time = 900
net.ipv4.tcp_keepalive_intvl = 900


I know this does not explain why my connection were getting 'stale'
after 30 minutes, but now LDAP sends a keepalive every 15 minutes, and
my connections never get 'stale'.  

There is an excellent "TCP Keepalive HOWTO" written by Fabio Busatto
explaining the TCP keepalive; how to implement it and manipulate its
values.  This doc can be found on many, many webstites.

Thanks,
John


 

> -----Original Message-----
> From: openldap-technical-
> bounces+john.kane=prodeasystems.com@OpenLDAP.org [mailto:openldap-
> technical-bounces+john.kane=prodeasystems.com@OpenLDAP.org] On Behalf
> Of John Kane
> Sent: Thursday, April 30, 2009 11:59 AM
> To: Andrew Findlay
> Cc: openldap-technical@openldap.org
> Subject: RE: Syncrepl connections failing
> 
> Thanks for the response Andrew.  See comments below
> 
> > -----Original Message-----
> > From: Andrew Findlay [mailto:andrew.findlay@skills-1st.co.uk]
> > Sent: Friday, April 24, 2009 4:42 AM
> > To: John Kane
> > Cc: openldap-technical@openldap.org
> > Subject: Re: Syncrepl connections failing
> >
> > On Thu, Apr 23, 2009 at 05:41:09PM -0500, John Kane wrote:
> >
> > > I am having a problem with what appears (to me) to be 'stale' TCP
> > > connections for syncrepl between the master and a pair of slaves.
> > After
> > > restarting all, I see changes on the master replicated to both
> > slaves.
> > > BUT, if I wait about 30 minutes or more, then make a change, the
> > > replication fails (most of the time).  netstat on the LDAP port
> show
> > the
> > > connections still established, but queued packets at the master
> > server.
> > > After about 15 minutes, the master server drops the connection.
An
> > > overnight tcpdump on the master showed LDAP occasionally sending a
> > > keep-alive, with 2hrs between the keep-alive messages (these keep-
> > alives
> > > are inconsistent, though, some nights I see none).
> >
> > > Note:  The 2 slaves are running on blades in an IBM chassis, and
> the
> > > master is on a 1U Linux server, just 'one-hop' away.   Prior to
> this,
> > > when I had a master/slave pair running on the blades, syncRepl was
> > > working fine for several months.  It was not until I moved the
> master
> > to
> > > the another server did the failures start.
> >
> > Do you have a firewall or NAT configured on or between any of the
> > boxes? This sort of problem with long-lived connections is often due
> > to state being dropped from IP-level devices.
> 
> [JK] Spoke with the network admin, and he guarantees only a single
> router exists between the master and slaves, and it is wide-open, i.e.
> it is just doing pure routing, no NAT or firewall that would cause
> this.
> 
> 
> To debug this, I am running a couple of crons on the master server.
> One
> does a netstat every 5 minutes, the other does an LDAP password change
> on a test user every 30 minutes.  I restart all and watch the logs.
> The
> first password change may (or may not) get replicated to the slaves.
> Subsequent changes do not.  netstat shows the message queued for about
> 10-15 minutes, then the connection drops.  Then I see an LDAP keep-
> alive
> every two hours, and this will briefly bring the connections back up,
> and it goes through the whole cycle again.
> 
> Is there a configurable keep alive for syncRepl?  I've seen other
> equipment, load balancers for example, that have TCP timers that will
> drop an inactive TCP connection.  On one in particular, the Array load
> balancer we were testing, the max of this configurable TCP timer was
> less than 2 hours.
> 
> 
> ********* Here's an excerpt from the cron logs *********
> 
> Thu Apr 30 04:45:01 UTC 2009
> <-------------------------------------------------------- BOTH DOWN
> Thu Apr 30 04:50:01 UTC 2009
> tcp        0      0 172.24.1.191:ldap           slave-1.dist-one:44364
> ESTABLISHED <-- BOTH RECOVERED BY KEEPALIVE
> tcp        0      0 172.24.1.191:ldap           slave-2.dist-one:38348
> ESTABLISHED
> Thu Apr 30 04:55:01 UTC 2009
> tcp        0      0 172.24.1.191:ldap           slave-1.dist-one:44364
> ESTABLISHED
> tcp        0      0 172.24.1.191:ldap           slave-2.dist-one:38348
> ESTABLISHED
> Thu Apr 30 05:00:01 UTC 2009
> tcp        0      0 172.24.1.191:ldap           slave-1.dist-one:44364
> ESTABLISHED
> tcp        0      0 172.24.1.191:ldap           slave-2.dist-one:38348
> ESTABLISHED
> Thu Apr 30 05:01:01 UTC 2009
> New password: sNM/tVks
> Result: Success (0)
> modifying entry
> "uid=testSyncDrop,ou=people,o=partner_x,dc=dist-one,dc=net"
> 
> Thu Apr 30 05:05:01 UTC 2009
> tcp        0   2402 172.24.1.191:ldap           slave-1.dist-one:44364
> ESTABLISHED <-- QUEUED ON BOTH
> tcp        0   2402 172.24.1.191:ldap           slave-2.dist-one:38348
> ESTABLISHED
> Thu Apr 30 05:10:01 UTC 2009
> tcp        0   2402 172.24.1.191:ldap           slave-1.dist-one:44364
> ESTABLISHED
> tcp        0   2402 172.24.1.191:ldap           slave-2.dist-one:38348
> ESTABLISHED
> Thu Apr 30 05:15:01 UTC 2009
> tcp        0   2402 172.24.1.191:ldap           slave-1.dist-one:44364
> ESTABLISHED
> tcp        0   2402 172.24.1.191:ldap           slave-2.dist-one:38348
> ESTABLISHED
> Thu Apr 30 05:20:01 UTC 2009
> <-------------------------------------------------------- DROPPED BOTH
> Thu Apr 30 05:25:01 UTC 2009
> Thu Apr 30 05:30:01 UTC 2009
> Thu Apr 30 05:31:01 UTC 2009
> New password: Ajae2z9/
> Result: Success (0)
> modifying entry
> "uid=testSyncDrop,ou=people,o=partner_x,dc=dist-one,dc=net"
> 
> Thu Apr 30 05:35:01 UTC 2009
> Thu Apr 30 05:40:01 UTC 2009
> Thu Apr 30 05:45:01 UTC 2009
> Thu Apr 30 05:50:01 UTC 2009
> Thu Apr 30 05:55:01 UTC 2009
> Thu Apr 30 06:00:01 UTC 2009
> Thu Apr 30 06:01:01 UTC 2009
> New password: j.ec4OnZ
> Result: Success (0)
> modifying entry
> "uid=testSyncDrop,ou=people,o=partner_x,dc=dist-one,dc=net"
> 
> Thu Apr 30 06:05:01 UTC 2009
> Thu Apr 30 06:10:01 UTC 2009
> Thu Apr 30 06:15:01 UTC 2009
> Thu Apr 30 06:20:01 UTC 2009
> Thu Apr 30 06:25:01 UTC 2009
> Thu Apr 30 06:30:01 UTC 2009
> Thu Apr 30 06:31:01 UTC 2009
> New password: NtUirUuh
> Result: Success (0)
> modifying entry
> "uid=testSyncDrop,ou=people,o=partner_x,dc=dist-one,dc=net"
> 
> Thu Apr 30 06:35:01 UTC 2009
> Thu Apr 30 06:40:01 UTC 2009
> Thu Apr 30 06:45:01 UTC 2009
> Thu Apr 30 06:50:01 UTC 2009
> Thu Apr 30 06:55:01 UTC 2009
> tcp        0      0 172.24.1.191:ldap           slave-1.dist-one:33271
> ESTABLISHED <-- BOTH RECOVERED BY KEEPALIVE
> tcp        0      0 172.24.1.191:ldap           slave-2.dist-one:42721
> ESTABLISHED
> Thu Apr 30 07:00:01 UTC 2009
> tcp        0      0 172.24.1.191:ldap           slave-1.dist-one:33271
> ESTABLISHED
> tcp        0      0 172.24.1.191:ldap           slave-2.dist-one:42721
> ESTABLISHED
> Thu Apr 30 07:01:01 UTC 2009
> New password: r/Jn0AYk
> Result: Success (0)
> modifying entry
> "uid=testSyncDrop,ou=people,o=partner_x,dc=dist-one,dc=net"
> 
> Thu Apr 30 07:05:01 UTC 2009
> tcp        0   2402 172.24.1.191:ldap           slave-1.dist-one:33271
> ESTABLISHED <-- QUEUED ON ONE
> tcp        0      0 172.24.1.191:ldap           slave-2.dist-one:42721
> ESTABLISHED <-- SUCCESSFUL ON THE OTHER
> Thu Apr 30 07:10:01 UTC 2009
> tcp        0   2402 172.24.1.191:ldap           slave-1.dist-one:33271
> ESTABLISHED
> tcp        0      0 172.24.1.191:ldap           slave-2.dist-one:42721
> ESTABLISHED
> Thu Apr 30 07:15:01 UTC 2009
> tcp        0   2402 172.24.1.191:ldap           slave-1.dist-one:33271
> ESTABLISHED
> tcp        0      0 172.24.1.191:ldap           slave-2.dist-one:42721
> ESTABLISHED
> Thu Apr 30 07:20:01 UTC 2009
> tcp        0      0 172.24.1.191:ldap           slave-2.dist-one:42721
> ESTABLISHED <-- DROPPED THE QUEUED ONE
> Thu Apr 30 07:25:01 UTC 2009
> tcp        0      0 172.24.1.191:ldap           slave-2.dist-one:42721
> ESTABLISHED
> 
> 
> ********* LDAP log on the 1st slave (unsuccessful update until the
next
> 2hr keep alive) *********
> 
> Apr 30 04:49:53 slave-1 slapd[27904]: syncrepl_entry: rid 004
> LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
> Apr 30 04:49:53 slave-1 slapd[27904]: syncrepl_entry: rid 004
be_search
> (0)
> Apr 30 04:49:53 slave-1 slapd[27904]: syncrepl_entry: rid 004
> cn=lb_check,ou=systemaccts,o=partner_x,dc=dist-one,dc=net
> Apr 30 04:49:53 slave-1 slapd[27904]: syncrepl_entry: rid 004
be_modify
> (0)
> Apr 30 04:49:53 slave-1 slapd[27904]: do_syncrep2: rid 004
> LDAP_RES_INTERMEDIATE - REFRESH_DELETE
> Apr 30 06:49:52 slave-1 slapd[27904]: do_syncrep2: rid 004 Can't
> contact
> LDAP server
> Apr 30 06:49:52 slave-1 slapd[27904]: do_syncrepl: rid 004 retrying (9
> retries left)
> Apr 30 06:50:22 slave-1 slapd[27904]: syncrepl_entry: rid 004
> LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
> Apr 30 06:50:22 slave-1 slapd[27904]: syncrepl_entry: rid 004
be_search
> (0)
> Apr 30 06:50:22 slave-1 slapd[27904]: syncrepl_entry: rid 004
> o=partner_x,dc=dist-one,dc=net
> <snip>
> Apr 30 06:50:22 slave-1 slapd[27904]: syncrepl_entry: rid 004
> LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
> Apr 30 06:50:22 slave-1 slapd[27904]: syncrepl_entry: rid 004
be_search
> (0)
> Apr 30 06:50:22 slave-1 slapd[27904]: syncrepl_entry: rid 004
> cn=lb_check,ou=systemaccts,o=partner_x,dc=dist-one,dc=net
> Apr 30 06:50:22 slave-1 slapd[27904]: syncrepl_entry: rid 004
be_modify
> (0)
> Apr 30 06:50:22 slave-1 slapd[27904]: do_syncrep2: rid 004
> LDAP_RES_INTERMEDIATE - REFRESH_DELETE
> Apr 30 06:53:32 slave-1 slapd[27904]: connection_read(33): no
> connection!
> Apr 30 06:57:32 slave-1 slapd[27904]: connection_read(33): no
> connection!
> Apr 30 07:45:32 slave-1 slapd[27904]: connection_read(37): no
> connection!
> Apr 30 07:55:32 slave-1 slapd[27904]: connection_read(33): no
> connection!
> Apr 30 07:56:32 slave-1 slapd[27904]: connection_read(33): no
> connection!
> Apr 30 07:59:32 slave-1 slapd[27904]: connection_read(33): no
> connection!
> Apr 30 08:17:32 slave-1 slapd[27904]: connection_read(33): no
> connection!
> Apr 30 08:25:32 slave-1 slapd[27904]: connection_read(33): no
> connection!
> Apr 30 08:31:32 slave-1 slapd[27904]: connection_read(37): no
> connection!
> Apr 30 08:46:32 slave-1 slapd[27904]: connection_read(37): no
> connection!
> Apr 30 08:50:21 slave-1 slapd[27904]: do_syncrep2: rid 004 Can't
> contact
> LDAP server
> Apr 30 08:50:21 slave-1 slapd[27904]: do_syncrepl: rid 004 retrying (9
> retries left)
> Apr 30 08:50:51 slave-1 slapd[27904]: syncrepl_entry: rid 004
> LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
> Apr 30 08:50:51 slave-1 slapd[27904]: syncrepl_entry: rid 004
be_search
> (0)
> Apr 30 08:50:51 slave-1 slapd[27904]: syncrepl_entry: rid 004
> o=partner_x,dc=dist-one,dc=net
> 
> 
> ********* On the 2nd slave (successful update) *********
> 
> Apr 30 04:49:52 slave-2 slapd[6973]: syncrepl_entry: rid 002
> LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
> Apr 30 04:49:52 slave-2 slapd[6973]: syncrepl_entry: rid 002 be_search
> (0)
> Apr 30 04:49:52 slave-2 slapd[6973]: syncrepl_entry: rid 002
> cn=lb_check,ou=systemaccts,o=partner_x,dc=dist-one,dc=net
> Apr 30 04:49:52 slave-2 slapd[6973]: syncrepl_entry: rid 002 be_modify
> (0)
> Apr 30 04:49:52 slave-2 slapd[6973]: do_syncrep2: rid 002
> LDAP_RES_INTERMEDIATE - REFRESH_DELETE
> Apr 30 06:49:51 slave-2 slapd[6973]: do_syncrep2: rid 002 Can't
contact
> LDAP server
> Apr 30 06:49:51 slave-2 slapd[6973]: do_syncrepl: rid 002 retrying (9
> retries left)
> Apr 30 06:50:21 slave-2 slapd[6973]: syncrepl_entry: rid 002
> LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
> Apr 30 06:50:21 slave-2 slapd[6973]: syncrepl_entry: rid 002 be_search
> (0)
> Apr 30 06:50:21 slave-2 slapd[6973]: syncrepl_entry: rid 002
> o=partner_x,dc=dist-one,dc=net
> <snip>
> Apr 30 06:50:21 slave-2 slapd[6973]: syncrepl_entry: rid 002
> LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
> Apr 30 06:50:21 slave-2 slapd[6973]: syncrepl_entry: rid 002 be_search
> (0)
> Apr 30 06:50:21 slave-2 slapd[6973]: syncrepl_entry: rid 002
> cn=lb_check,ou=systemaccts,o=partner_x,dc=dist-one,dc=net
> Apr 30 06:50:21 slave-2 slapd[6973]: syncrepl_entry: rid 002 be_modify
> (0)
> Apr 30 06:50:21 slave-2 slapd[6973]: do_syncrep2: rid 002
> LDAP_RES_INTERMEDIATE - REFRESH_DELETE
> Apr 30 07:01:01 slave-2 slapd[6973]: syncrepl_entry: rid 002
> LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY)
> Apr 30 07:01:01 slave-2 slapd[6973]: syncrepl_entry: rid 002 be_search
> (0)
> Apr 30 07:01:01 slave-2 slapd[6973]: syncrepl_entry: rid 002
> uid=testSyncDrop,ou=people,o=partner_x,dc=dist-one,dc=net
> Apr 30 07:01:02 slave-2 slapd[6973]: syncrepl_entry: rid 002 be_modify
> (0)
> 
> 
> Thanks again,
> John
> 
> 
> 
> This message is confidential to Prodea Systems, Inc unless otherwise
> indicated
> or apparent from its nature. This message is directed to the intended
> recipient
> only, who may be readily determined by the sender of this message and
> its
> contents. If the reader of this message is not the intended recipient,
> or an
> employee or agent responsible for delivering this message to the
> intended
> recipient:(a)any dissemination or copying of this message is strictly
> prohibited; and(b)immediately notify the sender by return message and
> destroy
> any copies of this message in any form(electronic, paper or otherwise)
> that you
> have.The delivery of this message and its information is neither
> intended to be
> nor constitutes a disclosure or waiver of any trade secrets,
> intellectual
> property, attorney work product, or attorney-client communications.
The
> authority of the individual sending this message to legally bind
Prodea
> Systems
> is neither apparent nor implied,and must be independently verified.



This message is confidential to Prodea Systems, Inc unless otherwise indicated 
or apparent from its nature. This message is directed to the intended recipient 
only, who may be readily determined by the sender of this message and its 
contents. If the reader of this message is not the intended recipient, or an 
employee or agent responsible for delivering this message to the intended 
recipient:(a)any dissemination or copying of this message is strictly 
prohibited; and(b)immediately notify the sender by return message and destroy 
any copies of this message in any form(electronic, paper or otherwise) that you 
have.The delivery of this message and its information is neither intended to be 
nor constitutes a disclosure or waiver of any trade secrets, intellectual 
property, attorney work product, or attorney-client communications. The 
authority of the individual sending this message to legally bind Prodea Systems  
is neither apparent nor implied,and must be independently verified.