Friday, June 1, 2012

Connecting a few Dots on Recent Enhancements

In Kent's entry highlighting the recent DB2 for i enhancements delivered via 7.1 TR4 and the latest DB group PTF he mentions the new RUNSQL command as well as the new INSERT + remote SELECT statement. Today I would like to connect the dots and illuminate a relatively easy way to accomplish some simple ETL processing.

ETL stands for "extract / transform / load", and it is typically used in the context of grabbing data from one database and loading it into another database in a BI environment. Quite often some transformation and/or mapping occurs between the source and target data elements. The transformation can be very complex and/or based on all kinds of business rules. The mapping can be all over board, or simple and straightforward; such as casting a numeric column value to a date column value, or creating descriptive values from existing codes.

Most realistic ETL scenarios have the source database and target database residing on different systems. This means there must be a database connection to both the source and the target system, as well as transportation of data from one system to the other. I usually refer to ETL as "ETTL", meaning: extract / transport / transform / load.

To be sure, there are a number of very capable ETL tools available on the market today. These tools will virtually meet all requirements for identifying, capturing, transporting, transform and loading data; and do it with high performance and scalability. To meet the requirements, these tools sport very robust techniques driven by sophisticated application engines.

But what if you need to do some simple ETL, or your list of requirements can be satisfied by single, standard SQL statements such as INSERT and SELECT?

The ability to perform INSERT + SELECT in a single statement has been available for a long time. This works great if both the source and the target tables are in the same system (i.e. same local database).

INSERT INTO Schema1.Table1 (SELECT * FROM Schema2.Table2);

But what if Schema2.Table2 resides on a different system?

To handle a request like this, we can turn to a couple DB2 for i 7.1 features.  Specifically, three part naming, and the INSERT + remote SELECT statement.

Three part naming allows the user to specify not only the schema and table name, but also the name of the relational database. And the relational database can be on a remote system. The local or remote connection to the specified RDB will be handled auto-magically by DB2.

SELECT * FROM RDB2.Schema2.Table2;

When this technique is used in the special case of INSERT + SELECT, DB2 for i will make a connection to the local 7.1 database for the INSERT, and make a connection to the remote database for the SELECT (i.e. query). Sweet!

Please note: this capability is only available when the INSERT statement is issued on the 7.1 system. The target of the SELECT can be any database you have configured through your relational database directory - this includes IBM i 6.1 and V5R4 targets, as well other DB2 family members.

Now we have the basis for the simple ETL.  And given the power of a single SQL query, processing of local selection, joins, grouping, and transformation can be included. Do keep in mind that all the query processing will be accomplished on the source side of the equation, not the target.  Proper query tuning is a critical success factor, as is adequate network speed and bandwidth to fetch the results from the source.

For returning and inserting very large data sets, it will be advantageous to design and build your own parallel processing. This is where the new RUNSQL command can help.

The RUNSQL command can be embedded in a CL program.  Multiple CL programs can be run simultaneously to allow concurrent INSERT + remote SELECT statements to read and write data in parallel through multiple connections.  A main driver CL program can submit and coordinate multiple jobs, each running their own respective SQL statements. Be sure to have each SELECT statement target a specific set of rows on the source side. In other words, each SQL statement will need to specify local selection predicate(s) (i.e. WHERE clause) that references a set of rows separate and distinct from the other statements. This will avoid extracting and loading duplicate data - unless that's your goal.

Connecting Yet One More Dot

In DB2 for i 6.1 there is support for database maintained timestamp columns (i.e. row change timestamp). This means that DB2 will set and modify the timestamp column during INSERT and/or UPDATE operations. This column can also be hidden from SELECT * statements.

If the timestamp column and the associated attributes are employed, the simple ETL process can make use of them. The remote SELECT can target rows that meet only certain date time criteria. For example, only select rows where the date portion of the row change timestamp value is equal to the current date. DB2 does all the heavy lifting of maintaining and using the timestamps.

Moving up and getting current with DB2 for i will bring many advantages to your environment through new data-centric techniques. If you need assistance with migrating to IBM i 7.1 and/or taking advantage of DB2 for i capabilities, please let us know.

No comments:

Post a Comment