Monday, September 24, 2012


It's everywhere.

A couple weeks ago I had the privilege of speaking to a group of UK clients visiting IBM Rochester. This is a trip they make every year or so - coming to the home of their beloved platform to learn more.

As you might expect, I was on the agenda to talk about DB2 for i and related topics. One of the topics was a simple question: What is Big Data?

I thought, this is a great question...

and here's a brief answer.

The notion of big data comes as a result of the three Vs:

  • Variety
  • Volume
  • Velocity

As in: a variety of data sources, insanely high volumes of data, all arriving at an ever increasing velocity.

The value proposition of big data is all about analyzing and extracting useful information from this tidal wave of words, sounds and images. It is much like business intelligence and the concept of data warehousing - except different.

As my colleague Tom McKinley says, "bugs don't scale". In other words, everything changes when the data gets large. Don't assume the system/solution/application behavior will be the same once the data grows to huge proportions. Consuming big data is like this too. The ways in which you gain insight and obtain valuable information has to be different when facing the three Vs.

Big data involves doing analysis and gaining insight from structured AND unstructured data.

Structured data is familiar, representing the artifacts of business transactions stored in a relational database management system or file system. Rows and columns if you will.

Unstructured data can be almost anything (call logs, web logs, email, sensor output, streaming telemetry, audio feeds, video feeds, etc.).

Unstructured data comes from anywhere and everywhere (Facebook, Twitter, Pinterest, blogs, SMS texts, smart phones, dumb phones, vehicles, roads, bridges, etc.).

Big data involves doing analysis and gaining insight from data at rest (think cars in a parking lot) and from data in motion (think cars passing a toll booth).

Big data requires new and special techniques that make use of large distributed systems and parallel processing at almost all levels to achieve results in a timely manner.

The technology that leads the pack for big data analysis is Hadoop. In a nutshell, the Apache Hadoop project develops open-source software for reliable, scalable and distributed computing. Hadoop is the software that enables the distributed processing of large data sets across clusters of servers. In other words, the infrastructure that is needed to handle the acquisition and analysis chores of big data.

IBM InfoSphere BigInsights brings the power of Hadoop to the enterprise (for data at rest).

IBM InfoSphere Streams allows you to capture and act on all of your business data... all of the time... just in time (for data in motion).

Whether you represent a small, medium, or large enterprise, what can you learn from this big explosion of data?

It depends on whether you are ready to:

  • Identify and exploit new data sources
  • Turn all types of data into information
  • Use information for new and different insight
  • Use insight to become better

Your business has an ever expanding array of data passing through it and/or near it.
The question is, are you taking full advantage?  If not, it's time you did.

Don't be overwhelmed by this phenomenon known as big data. Go ahead, start small, get moving.

No comments:

Post a Comment