Re: Replication speed and data sizing (mdb)

To: openldap-technical@openldap.org

Subject: Re: Replication speed and data sizing (mdb)

From: Brian Wright <brianw@marketo.com>

Date: Mon, 10 Aug 2015 17:19:22 -0700

In-reply-to: <55AEB15F.5020909@marketo.com>

Organization: Marketo.com

References: <55A9C667.6030603@marketo.com> <8C69D93CA09141B8EDF6478A@quanah-mac.local> <55AEB15F.5020909@marketo.com>

User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Thunderbird/31.7.0

Title: Signature

Hi Quanah or anyone with experience,

I have upgraded to 2.4.41 in a two node cluster and still see replication slowness. I have inserted 300k user records into an lmdb database. The data.mdb ended up 2 GB in size. The insertion took 3 hours to complete (likely mostly due to ldapadd). I enabled replication to a second node using the following 2-way replication with the following syncrepl statement (similar on both nodes):
syncrepl rid=1 provider=ldap://ldap1 type=refreshAndPersist retry="5 5 300 +" searchbase="dc=marketo,dc=com" attrs="*,+" bindmethod=simple binddn="cn=admin,dc=marketo,dc=com" credentials=<redacted>

I started this replication on Friday and by Monday it is only 28% complete (around 90k records have been transferred -- data.mdb = 571M). These servers have full speed network connections between them, so I don't understand the protocol slowness. Is replication not intended for this amount of transfer load? Are we expected to recover the node via a separate method (i.e., slapcat / slapadd) and then kick replication off only after it's been loaded?

Additionally, when I have restarted a partially replicated node far into this replication process (90k records), the entire process stops and does not resume on restart. I do not have journaling enabled, but because these are new full records it wouldn't buy much performance speed here other than perhaps better replication recovery.

My questions include...

Is syncrepl configured optimally?
Will journaling help with replication recovery?

We're trying to solve the problem of how to recover/replace a failed node in a system containing a very large number of records and bring it back into the cluster as quickly as possible. We're also trying to resolve how to ensure that replication works consistently on restart.

Please let me know.

Thanks.

On 7/21/15 1:53 PM, Brian Wright wrote:

Hi Quanah,

I will upgrade to 2.4.41 and re-run my testing.

Thanks.

On 7/21/15 12:16 PM, Quanah Gibson-Mount wrote:
--On July 17, 2015 at 8:22:15 PM -0700 Brian Wright <brianw@marketo.com> 
wrote:
We are using 2.4.39. I realize there are newer versions available, but at
the time when we started our LDAP project, this was the version available.
There were several significant changes made to 2.4.41 to attempt to address 
a number of the issues you are reporting. I would suggest upgrading to 
2.4.41 and see if you find any significant improvements.

--Quanah
--

Brian Wright
Sr. UNIX Systems Engineer
901 Mariners Island Blvd Suite 200
San Mateo, CA 94404 USA
Email brianw@marketo.com
Phone +1.650.539.3530
www.marketo.com

--
Signature

Brian Wright
Sr. UNIX Systems Engineer
901 Mariners Island Blvd Suite 200
San Mateo, CA 94404 USA
Email brianw@marketo.com
Phone +1.650.539.3530
www.marketo.com

Follow-Ups:

Re: Replication speed and data sizing (mdb)
- From: Aaron Richton <richton@nbcs.rutgers.edu>
Re: Replication speed and data sizing (mdb)
- From: Andrew Findlay <andrew.findlay@skills-1st.co.uk>