Full_Name: Quanah Gibson-Mount Version: RE24 Sept 11, 2015 OS: Linux 2.6 URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (75.111.52.177) Seeing a scenario where if slapd is stopped on a new MMR node while a full REFRESH is occurring, the state of that refresh is not tracked, and the wrong CSN value is stored. This dataset has 15,000 users. We see it get up to user 625: Oct 20 16:13:09 q2 slapd[18724]: syncrepl_entry: rid=100 be_search (0) Oct 20 16:13:09 q2 slapd[18724]: syncrepl_entry: rid=100 uid=user625,ou=people,dc=q1,dc=aon,dc=zimbraview,dc=com Oct 20 16:13:09 q2 slapd[18724]: slap_queue_csn: queueing 0x44c7e30 20151020185526.862768Z#000000#000#000000 Oct 20 16:13:09 q2 slapd[18724]: slap_graduate_commit_csn: removing 0x44c87c0 20151020185526.862768Z#000000#000#000000 Oct 20 16:13:09 q2 slapd[18724]: syncrepl_entry: rid=100 be_add uid=user625,ou=people,dc=q2C2Cdc=aon,dc=zimbraview,dc=com (0) Oct 20 16:13:09 q2 slapd[18724]: slapd stopped. Then when slapd is restarted: Oct 20 16:13:16 q2 slapd[18970]: do_syncrep2: rid=100 cookie=rid=100,sid=001,csn=20151020201231.263989Z#000000#001#000000 Oct 20 16:13:16 q2 slapd[18970]: sp_p_queue_csn: queueing 0x309dfd8 20151020201231.263989Z#000000#001#000000 Oct 20 16:13:16 q2 slapd[18970]: slap_queue_csn: queueing 0x5054008 20151020201231.263989Z#000000#001#000000 Oct 20 16:13:16 q2 slapd[18970]: slap_graduate_commit_csn: removing 0x49353c0 20151020201231.263989Z#000000#001#000000 Oct 20 16:13:16 q2 slapd[18970]: slap_graduate_commit_csn: removing 0x4935060 20151020201231.263989Z#000000#001#000000 Oct 20 16:13:16 q2 slapd[18970]: syncrepl_message_to_op: rid=100 be_add cn=q2.aon.zimbraview.com,cn=servers,cn=zimbra (0) which causes it to skip the other 14,000+ users.
quanah@openldap.org wrote: > Full_Name: Quanah Gibson-Mount > Version: RE24 Sept 11, 2015 > OS: Linux 2.6 > URL: ftp://ftp.openldap.org/incoming/ > Submission from: (NULL) (75.111.52.177) > > > Seeing a scenario where if slapd is stopped on a new MMR node while a full > REFRESH is occurring, the state of that refresh is not tracked, and the wrong > CSN value is stored. > This dataset has 15,000 users. We see it get up to user 625: > > Oct 20 16:13:09 q2 slapd[18724]: syncrepl_entry: rid=100 be_search (0) > Oct 20 16:13:09 q2 slapd[18724]: syncrepl_entry: rid=100 > uid=user625,ou=people,dc=q1,dc=aon,dc=zimbraview,dc=com > Oct 20 16:13:09 q2 slapd[18724]: slap_queue_csn: queueing 0x44c7e30 > 20151020185526.862768Z#000000#000#000000 > Oct 20 16:13:09 q2 slapd[18724]: slap_graduate_commit_csn: removing 0x44c87c0 > 20151020185526.862768Z#000000#000#000000 > Oct 20 16:13:09 q2 slapd[18724]: syncrepl_entry: rid=100 be_add > uid=user625,ou=people,dc=q2C2Cdc=aon,dc=zimbraview,dc=com (0) > Oct 20 16:13:09 q2 slapd[18724]: slapd stopped. > > > Then when slapd is restarted: > > Oct 20 16:13:16 q2 slapd[18970]: do_syncrep2: rid=100 > cookie=rid=100,sid=001,csn=20151020201231.263989Z#000000#001#000000 > Oct 20 16:13:16 q2 slapd[18970]: sp_p_queue_csn: queueing 0x309dfd8 > 20151020201231.263989Z#000000#001#000000 > Oct 20 16:13:16 q2 slapd[18970]: slap_queue_csn: queueing 0x5054008 > 20151020201231.263989Z#000000#001#000000 > Oct 20 16:13:16 q2 slapd[18970]: slap_graduate_commit_csn: removing 0x49353c0 > 20151020201231.263989Z#000000#001#000000 > Oct 20 16:13:16 q2 slapd[18970]: slap_graduate_commit_csn: removing 0x4935060 > 20151020201231.263989Z#000000#001#000000 > Oct 20 16:13:16 q2 slapd[18970]: syncrepl_message_to_op: rid=100 be_add > cn=q2.aon.zimbraview.com,cn=servers,cn=zimbra (0) > > which causes it to skip the other 14,000+ users. After investigating the server setup, there are a few problems here. The new server was being configured with sid=001 which was already assigned to the original master. That's clearly going to screw things up. Aside from that, the new server was converted to MMR using dynamic config and we have a sequencing problem - it adds the syncprov overlay first, and then adds the syncrepl config. This is actually the only safe order, since the consumer will start as soon as it's added and syncprov will already be in place, ready to propagate changes as needed. But .. syncprov does a check in syncprov_db_open() to decide whether it should generate an initial contextCSN on a new DB. This step is ignored if the backend is configured for MMR (and must be ignored). The problem is that this node *will be* configured for MMR, but it isn't yet, because the consumer hasn't been dynamically configured yet. So syncprov generates its own contextCSN, which is checkpointed on shutdown. The real contextCSN from the master hasn't been received yet since the refresh is still in progress when the server is stopped, so on next restart this consumer will present its generated contextCSN, which is newer than the original master's, and so it won't resume refreshing from where it left off. Generating a new contextCSN at startup is of questionable worth. We discussed this a bit 'way back in 2004 http://www.openldap.org/lists/openldap-devel/200408/msg00035.html Perhaps we should just not do it; if a single-master provider starts up empty and a consumer tries to talk to it and both have an empty cookie, the provider should just respond "you're up to date". -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
hyc@symas.com wrote: > Generating a new contextCSN at startup is of questionable worth. We discussed > this a bit 'way back in 2004 > http://www.openldap.org/lists/openldap-devel/200408/msg00035.html Perhaps we > should just not do it; +1 > if a single-master provider starts up empty and a > consumer tries to talk to it and both have an empty cookie, the provider > should just respond "you're up to date". Why not return an error to the consumer? Does the provider know whether it's running as single-master? Ciao, Michael.
Michael Ströder wrote: > hyc@symas.com wrote: >> Generating a new contextCSN at startup is of questionable worth. We discussed >> this a bit 'way back in 2004 >> http://www.openldap.org/lists/openldap-devel/200408/msg00035.html Perhaps we >> should just not do it; > > +1 > >> if a single-master provider starts up empty and a >> consumer tries to talk to it and both have an empty cookie, the provider >> should just respond "you're up to date". > > Why not return an error to the consumer? Typically if a consumer receives an error it will disconnect and retry later. There's not much point making the consumer reconnect - which may be costly for a TCP session. If it's a refreshAndPersist consumer, it just needs to hang on and wait for some real data to arrive. > Does the provider know whether it's running as single-master? Generally yes. A single-master setup has serverID=0. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
hyc@symas.com wrote: > Michael Ströder wrote: >> hyc@symas.com wrote: >>> Generating a new contextCSN at startup is of questionable worth. We discussed >>> this a bit 'way back in 2004 >>> http://www.openldap.org/lists/openldap-devel/200408/msg00035.html Perhaps we >>> should just not do it; >> >> +1 >> >>> if a single-master provider starts up empty and a >>> consumer tries to talk to it and both have an empty cookie, the provider >>> should just respond "you're up to date". >> >> Why not return an error to the consumer? > > Typically if a consumer receives an error it will disconnect and retry later. > There's not much point making the consumer reconnect - which may be costly for > a TCP session. If it's a refreshAndPersist consumer, it just needs to hang on > and wait for some real data to arrive. Is the cost really that high compared to the rest of the initialization? >> Does the provider know whether it's running as single-master? > > Generally yes. A single-master setup has serverID=0. Hmm, this introduces more semantics on serverID. I have some doubts about corner-cases. Maybe I misunderstood but IMO the issue was about changing a provider to a MMR replica which would need serverID!=0 anyway. Ciao, Michael.
Michael Ströder wrote: > hyc@symas.com wrote: >> Michael Ströder wrote: >>> hyc@symas.com wrote: >>>> Generating a new contextCSN at startup is of questionable worth. We discussed >>>> this a bit 'way back in 2004 >>>> http://www.openldap.org/lists/openldap-devel/200408/msg00035.html Perhaps we >>>> should just not do it; >>> >>> +1 >>> >>>> if a single-master provider starts up empty and a >>>> consumer tries to talk to it and both have an empty cookie, the provider >>>> should just respond "you're up to date". >>> >>> Why not return an error to the consumer? >> >> Typically if a consumer receives an error it will disconnect and retry later. >> There's not much point making the consumer reconnect - which may be costly for >> a TCP session. If it's a refreshAndPersist consumer, it just needs to hang on >> and wait for some real data to arrive. > > Is the cost really that high compared to the rest of the initialization? I meant "TLS" there. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
hyc@symas.com wrote: > Michael Ströder wrote: >> hyc@symas.com wrote: >>> Michael Ströder wrote: >>>> hyc@symas.com wrote: >>>>> Generating a new contextCSN at startup is of questionable worth. We discussed >>>>> this a bit 'way back in 2004 >>>>> http://www.openldap.org/lists/openldap-devel/200408/msg00035.html Perhaps we >>>>> should just not do it; >>>> >>>> +1 >>>> >>>>> if a single-master provider starts up empty and a >>>>> consumer tries to talk to it and both have an empty cookie, the provider >>>>> should just respond "you're up to date". >>>> >>>> Why not return an error to the consumer? >>> >>> Typically if a consumer receives an error it will disconnect and retry later. >>> There's not much point making the consumer reconnect - which may be costly for >>> a TCP session. If it's a refreshAndPersist consumer, it just needs to hang on >>> and wait for some real data to arrive. >> >> Is the cost really that high compared to the rest of the initialization? > > I meant "TLS" there. As I'm solely using TLS secured LDAP connection *everywhere* I also implied TLS. Still I assume opening the syncrepl connection a few times again is nothing compared to the majority LDAP clients opening connections for every single LDAP simple bind request. So if it simplifies error handling which likely results in more robustness, I'd strongly prefer that. Ciao, Michael.
Michael Ströder wrote: > hyc@symas.com wrote: >> Michael Ströder wrote: >>> hyc@symas.com wrote: >>>> Michael Ströder wrote: >>>>> hyc@symas.com wrote: >>>>>> Generating a new contextCSN at startup is of questionable worth. We discussed >>>>>> this a bit 'way back in 2004 >>>>>> http://www.openldap.org/lists/openldap-devel/200408/msg00035.html Perhaps we >>>>>> should just not do it; >>>>> >>>>> +1 >>>>> >>>>>> if a single-master provider starts up empty and a >>>>>> consumer tries to talk to it and both have an empty cookie, the provider >>>>>> should just respond "you're up to date". >>>>> >>>>> Why not return an error to the consumer? >>>> >>>> Typically if a consumer receives an error it will disconnect and retry later. >>>> There's not much point making the consumer reconnect - which may be costly for >>>> a TCP session. If it's a refreshAndPersist consumer, it just needs to hang on >>>> and wait for some real data to arrive. >>> >>> Is the cost really that high compared to the rest of the initialization? >> >> I meant "TLS" there. > > As I'm solely using TLS secured LDAP connection *everywhere* I also implied > TLS. Still I assume opening the syncrepl connection a few times again is > nothing compared to the majority LDAP clients opening connections for every > single LDAP simple bind request. So if it simplifies error handling which > likely results in more robustness, I'd strongly prefer that. As it turns out, it greatly simplified things to handle this condition as you suggest. Fixed now in git master. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
changed notes changed state Open to Test moved from Incoming to Software Bugs
changed notes changed state Test to Release
fixed in master fixed in RE25 fixed in RE24(2.4.43)
changed notes changed state Release to Closed