Full_Name: Maxime Besson Version: 2.4.48 OS: Amazon Linux URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (92.184.104.114) During a cloud provider migration, I have migrated an OpenLDAP instance using MDB + Syncrepl (in mirror mode, with session log) from Ubuntu to Amazon Linux. When starting a server with no data, syncrepl kicks in and starts replicating the other server's database (12G, 7M entries). In Ubuntu this phase takes about 25 minutes (with dbnosync). In Amazon Linux, with the exact same cn=config and daemon options, it took 10 days on first try, at 100% CPU consumption the whole time. I observed that the start of the import is pretty fast (data.mdb grows by 10M/s) but import speed slows down rapidly to a crawl. After some investigation, I discovered that disabling the sessionlog brings the import times to normal. Digging further, I found that most of the CPU time is spent in syncprov_add_slog, and particularly this line: /* Keep the list in csn order. */ ... for ( sep = &sl->sl_head; *sep; sep = &(*sep)->se_next ) { >>>>>>> if ( ber_bvcmp( &se->se_csn, &(*sep)->se_csn ) < 0 ) { se->se_next = *sep; *sep = se; break; } } The sessionlog appears to be growing endlessly: (gdb) print sl->sl_num $1 = 681896 (gdb) print sl->sl_size $3 = 100 Digging around the source, it seems that the session log is supposed to be cleaned up at the end of syncprov_add_slog, if it's not playing: if (!sl->sl_playing) { while ( sl->sl_num > sl->sl_size ) { ... However: (gdb) print sl->sl_playing $4 = 543388517 This looked like an uninitialized value to me, and indeed, sp_cf_gen doesn't seem to initialize this field: if ( !sl ) { sl = ch_malloc( sizeof( sessionlog )); sl->sl_mincsn = NULL; sl->sl_sids = NULL; sl->sl_num = 0; sl->sl_numcsns = 0; sl->sl_head = sl->sl_tail = NULL; ldap_pvt_thread_mutex_init( &sl->sl_mutex ); si->si_logs = sl; } I tried the following patch: --- openldap-2.4.48.orig/servers/slapd/overlays/syncprov.c 2019-07-23 16:46:22.000000000 +0200 +++ openldap-2.4.48/servers/slapd/overlays/syncprov.c 2020-01-08 11:33:16.770110282 +0100 @@ -3082,6 +3082,7 @@ sl = ch_malloc( sizeof( sessionlog )); sl->sl_mincsn = NULL; sl->sl_sids = NULL; + sl->sl_playing = 0; sl->sl_num = 0; sl->sl_numcsns = 0; sl->sl_head = sl->sl_tail = NULL; And it got rid of the issue. My guess is that most of the time, sl_playing happens to be 0, but on Amazon Linux, for some reason (patched glibc?), the sessionlog lands in the wrong chunk of uninitilized memory Additional info: # rpm -q glibc glibc-2.26-32.amzn2.0.2.x86_64 # uname -a Linux 4.14.152-127.182.amzn2.x86_64 #1 SMP Thu Nov 14 17:32:43 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux # /usr/local/openldap/libexec/slapd -V @(#) $OpenLDAP: slapd 2.4.48 (Jan 8 2020 11:39:38) $ (LTB build)
maxime.besson@worteks.com wrote: > Full_Name: Maxime Besson > Version: 2.4.48 > OS: Amazon Linux > URL: ftp://ftp.openldap.org/incoming/ > Submission from: (NULL) (92.184.104.114) Thanks for the report, fixed now in git master. > > > During a cloud provider migration, I have migrated an OpenLDAP instance using > MDB + Syncrepl (in mirror mode, with session log) from Ubuntu to Amazon Linux. > > When starting a server with no data, syncrepl kicks in and starts replicating > the other server's database (12G, 7M entries). > > In Ubuntu this phase takes about 25 minutes (with dbnosync). In Amazon Linux, > with the exact same cn=config and daemon options, it took 10 days on first try, > at 100% CPU consumption the whole time. > > I observed that the start of the import is pretty fast (data.mdb grows by > 10M/s) but import speed slows down rapidly to a crawl. > > After some investigation, I discovered that disabling the sessionlog brings the > import times to normal. > > Digging further, I found that most of the CPU time is spent in > syncprov_add_slog, and particularly this line: > > > /* Keep the list in csn order. */ > ... > for ( sep = &sl->sl_head; *sep; sep = &(*sep)->se_next ) { >>>>>>>> if ( ber_bvcmp( &se->se_csn, &(*sep)->se_csn ) < 0 ) { > se->se_next = *sep; > *sep = se; > break; > } > } > > The sessionlog appears to be growing endlessly: > > (gdb) print sl->sl_num > $1 = 681896 > (gdb) print sl->sl_size > $3 = 100 > > Digging around the source, it seems that the session log is supposed to be > cleaned up at the end of syncprov_add_slog, if it's not playing: > > if (!sl->sl_playing) { > while ( sl->sl_num > sl->sl_size ) { > ... > > However: > > (gdb) print sl->sl_playing > $4 = 543388517 > > This looked like an uninitialized value to me, and indeed, sp_cf_gen > doesn't seem to initialize this field: > > if ( !sl ) { > sl = ch_malloc( sizeof( sessionlog )); > sl->sl_mincsn = NULL; > sl->sl_sids = NULL; > sl->sl_num = 0; > sl->sl_numcsns = 0; > sl->sl_head = sl->sl_tail = NULL; > ldap_pvt_thread_mutex_init( &sl->sl_mutex ); > si->si_logs = sl; > } > > > I tried the following patch: > > --- openldap-2.4.48.orig/servers/slapd/overlays/syncprov.c 2019-07-23 > 16:46:22.000000000 +0200 > +++ openldap-2.4.48/servers/slapd/overlays/syncprov.c 2020-01-08 > 11:33:16.770110282 +0100 > @@ -3082,6 +3082,7 @@ > sl = ch_malloc( sizeof( sessionlog )); > sl->sl_mincsn = NULL; > sl->sl_sids = NULL; > + sl->sl_playing = 0; > sl->sl_num = 0; > sl->sl_numcsns = 0; > sl->sl_head = sl->sl_tail = NULL; > > > And it got rid of the issue. My guess is that most of the time, sl_playing > happens to be 0, but on Amazon Linux, for some reason (patched glibc?), the > sessionlog lands in the wrong chunk of uninitilized memory > > Additional info: > > # rpm -q glibc > glibc-2.26-32.amzn2.0.2.x86_64 > > # uname -a > Linux 4.14.152-127.182.amzn2.x86_64 #1 SMP Thu Nov 14 17:32:43 UTC 2019 x86_64 > x86_64 x86_64 GNU/Linux > > # /usr/local/openldap/libexec/slapd -V > @(#) $OpenLDAP: slapd 2.4.48 (Jan 8 2020 11:39:38) $ > > (LTB build) > > > > > -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
changed notes changed state Open to Release moved from Incoming to Software Bugs
Fixed in master (57dbe995b8f7c80421acf82ad187a4f281e9a4d5) Fixed in RE24 (2.4.49)
changed notes changed state Release to Closed