
The following column highlights the new Data Warehousing tools, and explains
what a Data Warehouse is, how it is different from and interacts with a
LIMS, and what it takes to build a Data Warehouse.
The use of laboratory data for organizational decision making will soon
become common practice at many companies due to the latest crop of business
intelligence tools that fall under the Data Warehousing banner. Data Warehousing
tools take data from a variety of sources, clean it, integrate it, store
it in a single repository, and allow senior finance, business analysts,
and field sales staff to access and analyze the data to make strategic decisions.
What is a Data Warehouse?
A Data Warehouse is a physically separate store of data transformed from
the application data found in operational environments. In almost every
case, the data entering the Data Warehouse comes from an operational database.
What makes a Data Warehouse different is that it is subject-oriented, integrated,
time-variant, and nonvolatile. Data Warehouses should not, however, be confused
with operational databases.
Whereas the laboratory information in the LIMS, or operational environment
of the lab, consists of samples, tests, and results, the Data Warehouse
is organized around subjects, such as customer, vendor, product, and activity.
This difference is particularly apparent when examining the content of the
data.
Another key difference is that data within the Data Warehouse is, without
exception, integrated. This integration shows up in many different ways
-- in consistent naming conventions, in consistent measurement of variables,
in consistent encoding instructions, in consistent physical attributes of
data, etc.
All data in the Data Warehouse is accurate as of some moment in time, not
as of the moment of access, and may not be updated once recorded, thus it
is time variant.
The final defining characteristic of a Data Warehouse is that it is nonvolatile,
in other words, there are only two kinds of operations that occur: the initial
loading of data and the access of data. Inserts, deletions, and changes
-- normal activities in the operational environment -- do not occur in a
Data Warehouse.
Because the source of nearly all Data Warehouse data is the operational
environment, it is tempting to think that there is massive redundancy of
data between the two environments. This is not the case, since only that
data which is needed for organizational decision making and strategic support
is placed into the Data Warehouse. In addition, data in the operational
environment is very fresh, whereas data in the warehouse is much older and
contains summary data that is never found in the operational environment.
Operational versus Strategic
The Data Warehouse takes the operational LIMS data and examines it against
strategic criteria. This includes both project analysis in the form of studies,
samples, and results, and product analysis by lot, batch, and results. Such
in-depth data manipulation and examination have not typically been handled
by the LIMS in any case, but by third-party software. Data Warehouse tools,
however, go beyond the typical spreadsheet or database tools because they
can handle complex queries, not just simple queries.
An operational database can process most simple queries. However, complex
queries running against that operational database frequently present problems
in terms of performance. When complex queries must be supported, organizations
have a greater incentive to offload operational data to a Data Warehouse.
Building a Data Warehouse
Building a Data Warehouse involves extracting data from various operational
databases and populating a specialized Data Warehouse database which users
can then access without impacting the operational systems. An extensive
effort is required to select, map and transform the data that goes into
the warehouse, as well as a powerful front-end tool that allows users to
easily retrieve and analyze the newly available information.
The first step is to access the appropriate operational data from the variety
of legacy databases that exist across the corporation's diverse computing
environments. Once accessed, extraction tools transform the data into a
consistent, integrated form. This process involves cleaning, reconciling,
de-normalizing, and summarizing data, and then loading the data into logical
views that can be surfaced into a variety of analytical and reporting applications.
Once this data is in the warehouse, business analysts can use business intelligence
tools to exploit the data for effective decision-making.
Many organizations opt to set up a framework that provides an enterprise-wide
solution using components such as extraction/transformation tools, a relational
database, data access and reporting tools, OLAP EIS (Executive Information
System) tools, the Internet, data mining tools, and development tools and
utilities. When the Data Warehouse becomes too complex, there are middleware
products which help to make the Data Warehouse easy to use and manage.
Harness Enterprise-Wide Data
Building powerful and flexible data warehousing solutions is a key element
in harnessing the power of enterprise-wide data. Indeed, Data Warehouses
represent the next step in the evolution of decision support tools and herald
a move away from off-the-shelf database software toward client/server applications
that can accommodate the specific requirements of each organization.
There are several trends driving and affecting Data Warehouse capabilities.
Database size will naturally increase. More sophisticated data analysis
tools -- such as OLAP, data visualization, and data mining -- will gain
greater market acceptance and use. Internet web access to the Data Warehouse
will become standard. Analysts predict a proliferation of Data Warehouses
across the enterprise, effectively producing a data mart within the organization.
In addition, there will be an increasing use of metadata and non-textual
data.
There are a number of reasons why implementing Data Warehouse solutions
makes good business sense. When there is a mixture of proprietary systems
and legacy systems in an organization, such a solution can resolve access
to disparate data. By organizing that data in a single repository, better
forecasting and sales analysis can be performed. Knowledge of corporate
strengths and deficiencies can also lead to cost savings and productivity
gains. The lab will naturally be part of this information access revolution,
particularly for businesses whose products are dependent upon laboratory
data.