Query Optimization

Intractable Problems” is a column designed to challenge and stimulate your thinking skills. In this issue I want to challenge you to look at some of the difficulties inherent in accessing the data in our computer systems.

Now, I know that each and every one of you always gets all the information out of the computer that you need, without difficulty. Why, we all use SQL, the structured query language designed for easy access into relational databases!

Let’s take a close look at SQL[1]. Did you know that SQL statements are relational calculus expressions? I am not making this up. Relational calculus is a way of expressing what our result is to be, without figuring out how to get it. Of course, we’re all calculus experts, right? Yeah, sure!

Guess what happens if we want some data which must be selected on more than one condition. For example, we would like to find all 65 year old males in the province of New Brunswick.

Now, logically, we know that there are two ways to find the answer. First, we could find all the males in New Brunswick and then look to see which of them are 65 years old. Or, we could find all the 65 year old people in New Brunswick and then see which of them are male. Does it make a difference which way we use?

Yes, it does. We would like to know the selectivity of each condition. Selectivity is the ratio between the amount of data that satisfies the condition to the total amount of data that must be searched. The selectivity is the probability that a specific data item is found. If the selectivity is small, then only a few data items are selected by the condition. We want to use this condition first in any compound expression so that we minimize the amount of work necessary to satisfy the next condition. In general, for multiple conjunctive conditions, the best order of application is from smallest selectivity to largest.

In many cases neither you nor I nor the computer know the precise selectivities. So, the best assumption we can make is that the data is uniformly distributed across its range. For example, assuming uniform distributions, if we are looking for 65 year old males in a population then we can predict that a gender selectivity is 0.5 and an age selectivity is 0.01.  This assumes that gender is either male or female and age ranges across integer values from 1 to 100.  With these selectivities we will get a 50 fold reduction in search work if we select on age before gender. In other words, to find out how many 65 year old people are male, first find those who are 65 years old and then find out who are male. The other way is far too difficult.

Now, let’s raise the stakes. What happens if the distribution is not uniform? Let’s introduce a new area attribute and consider the case where 95% of the people in our population live in the province of New Brunswick and the remaining 5% live in 199 different areas of the world. In this case there are 200 different area values. The area selectivity, assuming a uniform distribution, is 1/200 or 0.005.  This is much smaller than the age selectivity shown above. According to our rule, the area attribute should be accessed first given any query with a conjunctive clause relating area and age and gender.

But  this is a very poor plan if we wanted a New Brunswick resident because 95% of people in our population will be chosen first.  Our distribution was not very uniform, was it? How is a person to know this, without understanding the data and applying some calculus?

SQL, a language founded on relational calculus, does not help us choose the most effective query formulation. SQL is a language to specify what we want, but not how to do it.  The query optimization task is often an intractable problem, frequently left to some foolish computer to figure out.

So, the next time the computer is slow, ask yourself if there was something that you could have done to formulate your request in a better way.  I don’t think you expect the computer to know how to solve problems that you yourself can’t do. Or do we expect computers to do our thinking for us?  Now that is an interesting problem!

Next issue, I want to take a critical look at the serious problems involved in processing information. Do you think that there are any limits as to what we can do?





Footnotes

1. Codd, E.F. “A Data Base Sublanguage Founded on Relational Calculus”, Proc. 1971 ACM SIGFIDET Workshop on Data Description, Access and Control.