While
DB2 for i has long possessed the ability to use parallel I/O and parallel
processing via the optional IBM i feature of DB2 Symmetrical Multiprocessing ,
only certain database operations are parallel enabled.
While just about any
query can qualify for parallelism, the SQL operations of INSERT, UPDATE and
DELETE do not. That means when the database engine is inserting, updating or
deleting rows, the operation is occurring in a single job, thread or
task. If you want to break up the work and get more jobs or threads involved,
then YOU must design and construct the logic, as well as instigate and coordinate
the process.
In the case
of the SQL statement INSERT + SELECT, two operations are involved: an insert
and a query. The query side is eligible to make use of parallelism (assuming
SMP is installed and enabled), but the INSERT side will not make use of
parallelism. If there is a large set of rows being returned from the query, it
might be advantageous to have multiple concurrent “inserters”; where each
process takes a distinct and separate set of results and inserts them into the
table. And in the case of the new INSERT plus remote SELECT, we also
need multiple connections to the remote system.
__________
Reviewing
the new 7.1 support via the following diagram, we can see a source database
(RDBaaa) and a target database (RDBbbb). The INSERT + SELECT statement is initiated
on the target system. A connection is
made from the target system to the source system. The SELECT (i.e.
query) is executed on the source system and the results are fetched by the
target system.
Taking the
example further, let’s say the source database contains a set of orders that
need to be extracted, transported and loaded into the same ORDERS table in the
target database. Simply use the INSERT + remote SELECT statement to do this:
INSERT INTO SchemaB.OrdersB
(SELECT * FROM RDBaaa.SchemaA.OrdersA a)
With
the relatively large amount of data that needs to be inserted, we should
consider the use of parallelism to speed up the process. To do this we need to
have multiple concurrent insert operations being fed by multiple remote
queries.
Given
that INSERT operation is not parallel enabled, we’ll have to issue separate
INSERT + SELECT statements via multiple jobs on the target side. By running the
jobs concurrently, more work will be done in the same unit of time, thus
increasing throughput. The result – assuming
enough computing and I/O resources are available – will be a faster
completion time.
Remember, with parallelism we are trading resources for time;
as in more resources for less time.
To ensure
correct results, we need to specify which rows should be identified and
selected on the source FOR EACH QUERY. In a perfect world, each query would access
and select the same amount of data allowing for an even distribution of work
across the jobs.
In our example, we can use year as the distribution mechanism.
Given three years of data, it’s relatively easy to use a parallel degree of
three (i.e. 3 concurrent jobs). We could just as easily used 12 (local
selection on month) or 36 (local selection on year and month).
To pull this off, we need each of the three
queries to target a specific year. This is accomplished by specifying a WHERE
clause for each SELECT statement. This piece is important if you want to avoid
duplicate data being inserted.
There are many techniques to initiate and coordinate the concurrent jobs. One of the more simple ones is to use a CL program to systematically submit the three jobs, where by each job runs a CL program containing the new RUNSQL command that specifies an INSERT + remote SELECT statement. This will result in three separate connections to the remote database, each running a query.
Acknowledging
the risk of a “slippery slope”, this basic approach to initiate and coordinate parallelism
can be expanded to include sophisticated logic to determine the appropriate degree,
build up the corresponding SQL statement(s) and start the correct number of
concurrently running jobs or threads. Of course, before long you’re right in
the realm of real ETL tools and thus better off acquiring one instead of
building one.
To complete our example, we can optionally
initiate two more jobs that will pick up any orders that lie outside of the
three years we were told about (better safe than sorry).
__________
What if you
don’t have a “year” column?
Use a date
and extract the year: …WHERE YEAR(date_column) = 2011
What if you
don’t have a real date column?
Hmmm, it
might be time to re-evaluate your data model and consider some enhancements, some
modernization, and some re-engineering. Just saying.
So, there
you have it – simple parallel ETL via the latest DB2 for i features and
functions.
If you need assistance with getting the most out
of your database please let us know. We’re here to help.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.