[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: filter preprocessing for performance improvement

To: man@mentata.com
Subject: Re: filter preprocessing for performance improvement
From: Pierangelo Masarati <ando@sys-net.it>
Date: Sat, 03 Mar 2007 15:08:35 +0100
Cc: Francis Swasey <Frank.Swasey@uvm.edu>, openldap-devel@openldap.org
In-reply-to: <45E956F5.6070409@jonanddeb.net>
References: <Pine.GSO.4.63.0703021142370.29106@grapevine.cc.utexas.edu> <45E874F5.2090500@uvm.edu> <45E956F5.6070409@jonanddeb.net>
User-agent: Mail/News 1.5.0.8 (X11/20061210)

Jon Roberts wrote:

> I was just today thinking about something along the lines of filter
> preprocessing (at the client level actually) that prevented say a
> contains search like (telephonenumber=*67530*) on an attribute that the
> directory has not indexed for substring searches (case of
> telephonenumber). Something at the server level would be better of course.

Something like that was discussed long time ago when I proposed the
"limits" feature (which eventually got into slapd in its current form).
 It's hard to tell what such constraint would mean.  However, if one
only looks at the presence of a substrings filter in a search,
unexpected results may occur; for example:

	(telephonenumber=*67530*)		=> reject

but what about

	(!(telephonenumber=*67530*))		=> ?

or

	(&(uid=foo)(telephonenumber=*67530*))	=> ?

A better approach, which we recently developed for a customer, would be
to define what filter is to be considered acceptable and what is not,
and then analyze the logic of the filter to see if it matches that of
the requirement.  For example, logic analysis could allow to determine
if a filter is surely acceptable, surely unacceptable, or "grey"; then,
decision making could determine what to do in the "grey" cases.

If what you want to control is searches resulting in large candidate
sets, you need to define what may potentially lead to large candidate
sets.  So you need to define what's "large", and what simple filters
could lead to large candidates sets.

For example: (uid=foo) is likely to lead to a unique candidate (or at
most to a few, unless the "unique" overlay is in use), but
(objectClass=person) is likely to lead to more than one candidate, and
in some case to nearly all the entries in the database.  The
administrator should define what's the expected behavior of exact
matching on a given attribute.

Then, filter analysis would lead to an answer in terms of likelihood of
resulting in large candidate sets.

The filter analyzer would need to use rules like:

	- presence	=> large set
	- equality	=> small or large set, depending on attribute
	- appprox	=> small or large set, depending on attribute
	- substrings	=> large
	- ordering	=> large

	- not		=> small to large, large to large
	- and		=> small if at least one small, large otherwise
	- or		=> large if at least one is large

If the rule is: "reject large candidate lists", filter analysis should
walk the filter to terminals, stopping as soon as the rule says that the
candidate set could be large.  If a small set is expected after the
filter tree has been entirely traversed, the operation can continue.

It is not simple, and it needs some configuration from the
administrator, to guess what equality rules can be assumed to return
"small" candidate sets, but it should work.

To make it more general, attributes could be given a weight, or a
probability: some equalities may be believed to return small sets,
others to return larger sets, so that

	(sn=Smith)

may return a considerable set (=> weight 0.1) but (!(cn=Smith)) could
return an even larger set (=> weight 1-0.1 = 0.9).

Then, using P() as a "probability operator" [*]:

	P(x=v)		=>	admin defined
	P(x~=v)		=>	admin defined
	P(x=*v*)	=>	admin defined (based on subsrt length?)
				but in any case > P(x=v); 0.5 should be
				reasonable, so that P(!(x=*v*)) is 0.5)
	P(x>v), P(x<v)	=>	admin defined (based on v?)
				but in any case > P(x=v)
	P(not(f))	=>	1-P(f)
	P(and(f1,f2))	=>	min(P(f1),P(f2))
	P(or(f1,f2))	=>	max(P(f1),P(f2))

and the requirement could be to have P(filter) < threshold.

If anyone wants to implement anything like that, I suggest to either
write an overlay that intercepts search operations and analyzes the
filter (that's the approach we followed for very specific filter
analysis rules), or to extend limits so that they accept run-time
loadable rules.

p.

[*] my statistics are quite naive; one could object that the probability
of P(and(x,y)) should rather be P(x)*P(y) and so, but I'm just trying to
define a set of rules that make sense.  Think of them as "fuzzy" rules
rather than consistent maths.

Ing. Pierangelo Masarati
OpenLDAP Core Team

SysNet s.n.c.
Via Dossi, 8 - 27100 Pavia - ITALIA
http://www.sys-net.it
------------------------------------------
Office:   +39.02.23998309
Mobile:   +39.333.4963172
Email:    pierangelo.masarati@sys-net.it
------------------------------------------

Follow-Ups:
- Re: filter preprocessing for performance improvement
  - From: Howard Chu <hyc@symas.com>

References:
- filter preprocessing for performance improvement
  - From: Eric Irrgang <erici@motown.cc.utexas.edu>
- Re: filter preprocessing for performance improvement
  - From: Francis Swasey <Frank.Swasey@uvm.edu>
- Re: filter preprocessing for performance improvement
  - From: Jon Roberts <jon@jonanddeb.net>

Prev by Date: Re: filter preprocessing for performance improvement
Next by Date: Re: filter preprocessing for performance improvement
Index(es):
- Chronological
- Thread