Sunday, January 27, 2013

...But Mike, Can I Trust the Query Optimizer?

This is usually the 2nd question I hear most often, right after:

"...but Mike, will SQL perform as well as my RPG/COBOL program"?

Before I provide an answer (and it's not exactly what you think), let's explore a few things.

According to merriam-webster.com

TRUST is defined as:

- assured reliance on the character, ability, strength, or truth of someone or something

- one in which confidence is placed

And CONFIDENCE is defined as:

- a feeling or consciousness of one's powers or of reliance on one's circumstances

- faith or belief that one will act in a right, proper, or effective way

So, if you don't trust the query optimizer, what you are saying is that you don't have confidence in it's ability. If you don't have confidence in the query optimizer, you are really stating that it will not act in the right, proper or effective way. And to make matters worse, you find yourself stuck between a rock and a hard place. Because how can you make use of the latest and greatest DB2 features available only via SQL, if you don't have faith in the SQL query optimizer?

Hold on a minute...  how do YOU know what the right, proper or effective way really is?

Are you as omnipresent as the query optimizer?

By the way, Merriam Webster tells us OMNIPRESENT means: present in all places at all times.


Query Optimization Defined


The query optimizer's job is to produce a plan. Producing a plan is the result of many factors. Factors that will limit or expand what is possible with regards to the strategies and methods of execution to produce the correct result in a timely manner.

Knowing and understanding these factors are key to trusting the query optimizer. Experience with how these factors affect the final plan also goes a long way in building confidence that the query optimizer has provided the best plan for your query, your data and your environment.

What is the "best" plan anyway?

Most folks don't appreciate that the query optimizer has two equally important jobs.

Job 1: build a plan that produces the CORRECT output

Job 2: build a plan that is ESTIMATED to be the fastest possible given all the factors present and accounted for

We all forget (and thus have full trust and confidence in) the optimizer's first job. We all assume the plan for our query will produce the correct output. Don't believe me? When was the last time you actually validated your query's output was correct?  HA!  You DO trust the optimizer don't you.

What most users focus on is Job 2.  In other words, we all keep our eyes on the query run time. This is where we begin to loose faith. This is where we loose confidence with the programmer in the box.


Ah, The Programmer in the Box


Or is it The Ghost in the Machine

If we think of the query plan as a program - a program executed by the database engine and managed by the operating system, then there must be a programmer behind the scenes to design and code that program. In essence, this is exactly what the query optimizer is. So, getting back to the original question on top, you are really asking: "can I trust the programmer in the box". Let's open the box and peer inside the machine...

The programmer is highly trained, highly experienced, up-to-date on all things computer science past and present. The programmer never sleeps, is always ready.  The programmer is fast - capable of designing, writing and compiling programs in milliseconds.  The programmer is vigilant - watching, learning, adapting to the changing environment.


The Process


What does the programmer have to work with? What are the requirements and the inputs?

As with any programming endeavor, query optimization starts with a request, and continues with identifying the attributes of the underlying data and data structures, as well as the computing and I/O resources available.

Most users realize that the SQL statement is the request. This request, can include items such as:
  • Column list projected or returned
  • Source or target of query
  • Constraining or identify criteria
  • Grouping or aggregation criteria
  • Constraining criteria on groups or aggregates
  • Ordering criteria for final result set
  • Constraining number of  results
 Syntactically, these are known to SQL coders respectively as:

SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY
FETCH FIRST

Most SQL users DO NOT fully understand or appreciate the data profile, the data structures, and the system's computing and I/O capabilities (speeds and feeds if you will). Yet, the programmer in the box certainly does.

Imagine a programmer that not only has all the information, but also has access to all the structures above and below the machine interface. Imagine having access to the latest computational techniques materialized in the IBM i microcode. This would be a true advantage, and hopefully would in turn inspire confidence and trust in the user.


The Key Factors


From my perspective, the 3.5 most important factors that positively or negatively affect the programmer in the box are:

   1. Indexing strategy

   2. Fair share of memory

   3. Optimization goal

   3.5. Reusable open data path (ODP)

Without an adequate indexing strategy, the query optimizer is limited in the techniques available.  Not unlike the RPG or COBOL programmer who wants desperately to access one row by key (think CHAIN or random read), but no one provided the correct keyed logical file that enables the technique.

Without an adequate fair share of memory, the database engine is limited in how aggressive it can be with asynchronously reading data, and keeping that data in memory to minimize wait time.  Not unlike the RPG or COBOL programmer who wants desperately to use an internal array or table, but can't due to size limitations.

Without properly communicating the optimization goal, the query optimizer is unaware of the application's fetch behavior. This can result in a plan that is best for returning all of the query results instead of a plan that returns the first subset of results. Not unlike the RPG or COBOL programmer who reads all the data and loads the entire subfile, instead of filling the first page and returning control to the user sooner than later.

Without proper data base design, good set-at-a-time coding techniques and data centric processing, the query optimizer is forced to revisit the query plan over and over. To do this, the cursor is (hard) closed and the query optimizer is called upon to ensure that the current plan is doing the right things with the right objects. Not unlike the RPG or COBOL programmer who is frequently and repeatedly tapped to end their program, revisit the design, objects and techniques - all while the user patiently waits for the program to come back on line.


Criticize Things You Don't Know About


In my experience, trust and confidence come from KNOWLEDGE, EXPERIENCE and TIME. In other words, the more knowledge you have, the more experience you have, and the more time you've spent, the more comfortable you become. This is the same with query optimization. You become wiser to the ways and means of the programmer in the box.

Those of us who have worked extensively with the DB2 for i Query Optimizer (and the fantastic programmers of the box), never rarely question the plan produced. We first look around at all of the inputs, and seek to see the problem/solution from the optimizer's point of view. More often than not, an epiphany occurs.


Get Some


It just so happens you can acquire the knowledge and initial experience with SQL query optimization by attending the very workshop dedicated to it. This class is the very best way, short of a master/apprentice relationship to get more confidence and trust in DB2 for i.

More information on the workshop can be found here.

_______

...But Mike, Can I Trust the Query Optimizer?

My knowledge and experience tells me, "yes, you can".

3 comments:

  1. i still don't trust the query optimizer. you can't easily reason about the query optimizer. it's an extra layer where everything is hidden from the programmer.

    sure you can talk about how it's necessary to learn it's behavior. but at that point it starts to get in the way (this is the sign!). when you look at imperative programming languages, there is no optimizer. sure the compiler will make optimizations but the effect is much less noticable than in sql (it's really not the same).

    how would you like it if an extra layers was added to imperative programming languages. and this layer could cause a program to sometimes take a few milliseconds and other times 1 hour to execute. and you can't directly control this!

    i would rather learn the effect that writing statements a certain way has. and learn how to write efficient queries. resorting to using temporary tables, just to avoid the optimizer is also not my idea of writing good queries. i really don't like "magic"

    ReplyDelete
  2. A few reminders:

    1) if you use SQL, you cannot avoid the optimizer - it's part of the path

    2) it is not up to you to write the "program" that fulfills the (SQL) request, that is the job of the optimizer

    3) a good optimizer will handle standard SQL and apply rewrites and sophisticated strategies to arrive at a proper plan

    4) DB2 for i has a good optimizer

    ReplyDelete
  3. The other thing is that the DB2 optimizer has gotten a LOT better over the years. I think a lot of people have experiences from the V3 or V4 days when they tried using SQL (or worse, SQL over ODBC) and found it to be rather slow.

    Try it today.

    ReplyDelete