Tuesday, June 19, 2012

Doing More, with More

In my previous post where I connected the dots on some of the latest DB2 for i enhancements, the concept of using parallelism was called out. In this entry I want to further illuminate the basis for building and running parallel INSERT + SELECT and the reasons you have to get involved.

While DB2 for i has long possessed the ability to use parallel I/O and parallel processing via the optional IBM i feature of DB2 Symmetrical Multiprocessing , only certain database operations are parallel enabled.

While just about any query can qualify for parallelism, the SQL operations of INSERT, UPDATE and DELETE do not. That means when the database engine is inserting, updating or deleting rows, the operation is occurring in a single job, thread or task. If you want to break up the work and get more jobs or threads involved, then YOU must design and construct the logic, as well as instigate and coordinate the process.

In the case of the SQL statement INSERT + SELECT, two operations are involved: an insert and a query. The query side is eligible to make use of parallelism (assuming SMP is installed and enabled), but the INSERT side will not make use of parallelism. If there is a large set of rows being returned from the query, it might be advantageous to have multiple concurrent “inserters”; where each process takes a distinct and separate set of results and inserts them into the table. And in the case of the new INSERT plus remote SELECT, we also need multiple connections to the remote system.
__________


Reviewing the new 7.1 support via the following diagram, we can see a source database (RDBaaa) and a target database (RDBbbb). The INSERT + SELECT statement is initiated on the target system.  A connection is made from the target system to the source system. The SELECT (i.e. query) is executed on the source system and the results are fetched by the target system.



Taking the example further, let’s say the source database contains a set of orders that need to be extracted, transported and loaded into the same ORDERS table in the target database. Simply use the INSERT + remote SELECT statement to do this:

INSERT INTO SchemaB.OrdersB
      (SELECT * FROM RDBaaa.SchemaA.OrdersA a)

 
With the relatively large amount of data that needs to be inserted, we should consider the use of parallelism to speed up the process. To do this we need to have multiple concurrent insert operations being fed by multiple remote queries.

Given that INSERT operation is not parallel enabled, we’ll have to issue separate INSERT + SELECT statements via multiple jobs on the target side. By running the jobs concurrently, more work will be done in the same unit of time, thus increasing throughput. The result – assuming enough computing and I/O resources are available – will be a faster completion time.

Remember, with parallelism we are trading resources for time; as in more resources for less time.

 

To ensure correct results, we need to specify which rows should be identified and selected on the source FOR EACH QUERY. In a perfect world, each query would access and select the same amount of data allowing for an even distribution of work across the jobs. 

In our example, we can use year as the distribution mechanism. Given three years of data, it’s relatively easy to use a parallel degree of three (i.e. 3 concurrent jobs). We could just as easily used 12 (local selection on month) or 36 (local selection on year and month).

To pull this off, we need each of the three queries to target a specific year. This is accomplished by specifying a WHERE clause for each SELECT statement. This piece is important if you want to avoid duplicate data being inserted.

 

There are many techniques to initiate and coordinate the concurrent jobs. One of the more simple ones is to use a CL program to systematically submit the three jobs, where by each job runs a CL program containing the new RUNSQL command that specifies an INSERT + remote SELECT statement. This will result in three separate connections to the remote database, each running a query.

 

Acknowledging the risk of a “slippery slope”, this basic approach to initiate and coordinate parallelism can be expanded to include sophisticated logic to determine the appropriate degree, build up the corresponding SQL statement(s) and start the correct number of concurrently running jobs or threads. Of course, before long you’re right in the realm of real ETL tools and thus better off acquiring one instead of building one.

To complete our example, we can optionally initiate two more jobs that will pick up any orders that lie outside of the three years we were told about (better safe than sorry).

 


__________
 


What if you don’t have a “year” column?
Use a date and extract the year:   …WHERE YEAR(date_column) = 2011

What if you don’t have a real date column?
Hmmm, it might be time to re-evaluate your data model and consider some enhancements, some modernization, and some re-engineering. Just saying.

So, there you have it – simple parallel ETL via the latest DB2 for i features and functions.

If you need assistance with getting the most out of your database please let us know. We’re here to help.



No comments:

Post a Comment

Note: Only a member of this blog may post a comment.