Data Warehousing for Business Intelligence

Abstract
This report describes a suitable methodology for the design, construction and testing of a commercial business intelligence data warehouse (BIDW) project. The reports assesses the changes that have occurred since the start of data warehousing and their impact on the hardware and software that would be employed for the project over the last 40 years.
Introduction
A data warehouse is defined as a subject-oriented, incorporated, time-variant and non-volatile assembly of data in support of an organization’s management decision making process. Thus, data warehouse as a business intelligence concept is currently widespread through the deployment of data warehousing in organizations (Simon & Shaffer 2001, p.12). Business intelligence refers to the information available for the organization for its decision-making. As such, a data warehousing system refers to the backend or infrastructural component for attaining business intelligence. Producing a commercial Business Intelligence Data Warehouse (BIDW) is indeed a complex task, necessitating application of a less complex yet relevant methodology (Taniar 2009, p.8).
Until the 1960s, early computer applications were built devoid of an overt information systems development methodology. This is mainly because systems developers were only technically trained programmers. There was no prerequisite for good communication skills that would enable the programmers and end users to deliberate on the needs or features of any new application (Reeves 2009, p.48). The result was often production of database applications that did not quite meet user requirements. In the absence of a formal development framework, it was typically a trial and error sort of project timetabling. However, this would soon be untenable with the growing use of computers together with increasing demand for better control mechanisms by management (Hughes 2008, p.32). Consequently, methodologies were devised for the development and better control of computer-based information systems.
The idea of data warehousing has its roots in the late 1980s with the activities of IBM researchers Barry Devlin and Paul Murphy who developed the first “business data warehouse” (Taniar 2009, p.36). In principle, the data warehousing concept was meant to deliver an architectural framework for the movement of data from the source operational systems to final decision support environments. The IBM data warehouse tried to address a number of issues related to this flow, especially high costs. Due to the lack of a data warehousing architecture, a massive amount of redundancy was needed to support the multiple decision-support environments (Simon & Shaffer 2001, p.45). In big corporations it was commonplace to find multiple decision support environments that operated independently. While each environment served particular users, they usually needed much of the same stored data. In this regard, whole process of gathering, sorting and incorporating data from different sources, often from long-term prevailing operational systems (legacy systems), was normally in part replicated for each of the environment for use by both management and other business professionals in their activities of data mining and online analytical processing as well as market research and decision support (Reeves 2009, p.67). Furthermore, the operational systems were regularly reexamined with the continued emergence of new decision support requirements. Typically, new requirements made it necessary to gather, clean and integrate new data from the “data marts” which were customized for ready access by the users. Nonetheless, the mechanism of retrieving and analyzing the data, extracting, transforming and loading data as well as managing the data dictionary were also regarded essential constituents of a data warehousing system (Whitehorn & Whitehorn 1999, p.28).
The major developments during the early years of data warehousing included the following:
1960s — in a combined research project, General Mills and Dartmouth College, create the terms dimensions and facts.
1970s — ACNielsen and IRI offer dimensional data warehouse for retail sales.
1970s — Bill Inmon commences defining and discussing the term “Data Warehouse”
1975 — Sperry Univac presents a database management and reporting system called MAPPER (MAintain, Prepare, and Produce Executive Reports), which included the world’s first 4GL. It introduced the first platform tailored for constructing Information Centers, which was in it’s a forerunner in modern Enterprise Data Warehousing platforms).
1983 — Teradata presents a database management system tailored for decision support.
1983 — Sperry Corporation Martyn Richard Jones outlines the Sperry Information Center approach, which while not a true DW from the the Inmon perspective, it had most of the features of DW structures and process previously by Inmon, and later on by Devlin. The TSB England & Wales used it first.
1984 — Metaphor Computer Systems, established by David Liddle and Don Massaro, produces Data Interpretation System (DIS), a hardware and software package and GUI designed for business users for developing a database management along with analytic system (Gonzales 2003, p.86).
1988 — Barry Devlin and Paul Murphy jointly publish an article titled An architecture for a business and information system in the IBM Systems Journal, introducing the would-be popular term “business data warehouse”(Reeves 2009, p.84).
1990 — Red Brick Systems, established by Ralph Kimball, presents Red Brick Warehouse, which was a database management system definitely for data warehousing.
1991 — Prism Solutions, established by Bill Inmon, presents Prism Warehouse Manager, a software for creating a data warehouse.
1992 — Bill Inmon publishes a book titled Building the Data Warehouse.
1995 — A for-organization, The Data Warehousing Institute, is founded to promote data warehousing (Teorey 2011, p.57).
1996 — Ralph Kimball publishes a book titled The Data Warehouse Toolkit.
2000 — Daniel Linstedt launches the Data Vault, which enables real time auditing of Data Warehouses.
This translates that there are two main approaches to store data in a data warehouse: (i) the dimensional approach; and (ii) the normalized approach. The first, the dimensional approach supported by the “Kimballites”, follows Ralph Kimball’s methodology that models data warehouse using a dimensional model/star schema (Whitehorn & Whitehorn 1999, p.32). The second, the normalized approach, also known as the 3NF model, supported by “Inmonites”, follow Bill Inmon’s methodology that models the data warehouse using an E-R model/normalized model (Castellanos 2011, p.66). One key differentiating element between the two types of system is the fact that data warehouses are often not in third normal form (3NF), which is a type of data normalization mainly present in OLTP environments. The illustration bellow shows the differences between data warehouse system and OLTP.

Figure 1 – Data warehouses and OLTP systems

(Whitehorn & Whitehorn 1999, p.89)
The differences between typical data warehouses and OLTP systems are exhibited in the workload, data modifications, schema design, normal operations, and historical data.
An expanded composition of data warehousing entails business intelligence tools, extraction tools, transformation and loading data into repository, together with tools for managing and retrieving metadata (Prabhu 2004, p.89).
For better management of projects, methodologies were established for the development of computer based information systems. A methodology helps understand the problem domain of a given system as well as in developing a solution to the problem (Ponniah 2013, p.54). In general, therefore, methodologies often include several models together with a set of guidelines pertaining to the model usage. The models are meant to provide a formal understanding of the problem domain in question in need of a solution (Heaton 2012, p.30).
Successful implementation of a data warehouse project requires it to be divided into three key areas:
1. Tools:
The selecting of business intelligence tools and the data warehousing team encompass the typical development for a data warehousing project, beginning with requirement gathering to query optimization and finally production rollout (Nagabhushana 2006, p.40). The tools included are:
1a. Database, Hardware
During selection of the database/hardware platform, a number of items need to be fully considered e.g. scalability. For instance, the development team needs to consider how the system can be developed to match the growing data storage needs. Also, the developers need to consider a parallel processing support such as the fading out of supercomputers with a single CPU which are being replaced with more powerful computers using multiple CPUs, in which each processor can execute a part of the task simultaneously (Reeves 2009, p.78).
Some of the popular Relational Databases include: Oracle, Microsoft SQL Server, IBM DB2, Teradata, Sybase, MySQL. On the other hand popular OS Platforms include Linux, FreeBSD, and Microsoft.
1b. ETL (Extraction, Transformation, and Loading)
ETL tool selection determination mainly depends on three factors:
i) Complexity of data transformation – The more sophisticated the data transformation is, the greater the need to purchase an ETL tool.
ii) Data cleansing needs – consideration as to whether the data requires a thorough cleansing exercise prior to storage in the data warehouse (Kimball et al. 2011, p.104). This need will necessitate the purchasing of a tool that has strong data cleansing functionalities. Alternatively, the ETL routine should be built from scratch.
iii) Data volume – commercially available tools often have features with the capability to speed up data movement. As such, purchasing a commercial product would better where the volume of data to be transferred is large.
So while selecting a database together with a hardware platform is mandatory, selecting an ETL (third-party tool) tool is highly recommendable, though not a must. In this regard, it would be a good idea to assess ETL tools to figure out the attributes most suitable. For instance, in metadata support, the ETL functions by mapping out the source data to its destination, which is a crucial element of the metadata (Mohanty 2008, p.64). This is why it is important to go for an ETL tool with the ability to work well with the overall metadata strategy.
Popular Tools
– IBM WebSphere Information Integration (Ascential DataStage)
– Ab Initio
-Informatica
-Talend
1c. OLAP
OLAP tools serve to slice and dice of the data. Therefore, they need a strong metadata layer and front-end flexibility. These are often rare features for any home-built systems to attain. Comparatively, OLTP systems are the best for meeting the operational data needs of an organization. Nonetheless, they are not best suited to support decision-support queries and business questions which managers usually need to address (Whitehorn & Whitehorn 1999, p.45). It is on this ground that the project team employs OLAP for analytics as well as dill down of data.
Two main types of OLAP tools exist: ROLAP (Relational OLAP) and MOLAP (Multidimensional OLAP). MOLAP is collected from the relational data source. When a user creates a report application, the MOLAP tool generates and then creates it quickly due to the ready availability of all relevant data which has been pre-aggregated within the cube (Whitehorn & Whitehorn 1999, p.78). Conversely, in Rolap, the engine basically serves as a smart SQL generator, in which the data warehouse administrator specifies the relationship between the relational tables and how dimensions, attributes as well as hierarchies map to the primary database tables.
Popular Tools
-Business Objects
-Cognos
-Hyperion
-Microsoft Analysis Services
-MicroStrategy
-Pentaho
-Palo OLAP Server

1d. Reporting
They offer some flexibility in the sense of the capability for each user to undertake creation, scheduling, and running of their own reports (Rainardi 2008, p.73). The most popular applications for reporting tool are business operations reporting and dashboard. There are many reporting requirements, and thus the decision by a business either purchase or create a reporting tool for its business intelligence is significantly informed by the type of requirements (Wrembel & Koncilia 2007, p.75). Ordinarily, it depends on the function it will be required for. Consequently, the presentation layer is very important because many OLAP vendors are often already supplied with a front-end presentation layer which enables users to sanction pre-defined reports or build ad hoc reports (Mohanty 2008, p.51). Nonetheless, it is crucial when assessing reporting tools, e.g. security features, allowing users access what they are meant to access only. Also customization, an easy way of pre-setting the reports to resemble the corporate standard, renders the entire analysts tasks much easier besides saving time substantially (Jarke 2003, p.62).
Popular Tools
-SAP Business Objects
– MicroStrategy
-IBM Cognos
-Actuate
-Jaspersoft
-Pentaho

1e. Metadata
Metadata serves a highly significant fuction in the implementation of a successful data warehousing. However, this does not necessarily translate that a tool is required to carry “data about data” (Mohanty 2008, p.23). Although having a formidable metadata foundation is indeed one of the success factors of a data warehousing project. Hence, while a metadata tool may not be opted for at the start of the project, it is important to have a metadata strategy in place addressing the storage of the metadata in the data warehousing system (Mohanty 2008, p.91).
1f. Excel
Excel is a relatively cheap, regularly used tool users employ to display data and for such business operations as reporting and goal tracking.
1g. Data mining tool
Data mining tools are often only utilize by very specialized users. In organizations, even large ones, data mining tools are often used by a handful of users in finding relationships among different factors (Mohanty 2008, p.85).

Bibliography
CASTELLANOS, M. (2011). Enabling real-time business intelligence: 4th international workshop, BIRTE 2010, held at the 36th International Conference on Very Large Databases, VLDB 2010, Singapore September 13, 2010 ; revised selected papers. Berlin u.a, Springer.
GONZALES, M. L. (2003). IBM data warehousing with IBM business intelligence tools. [Indianapolis], Wiley.
HEATON, J. (2012). Business intelligence cookbook a project lifecycle approach using Oracle technology. Birmingham, Packt Pub.
HUGHES, R. (2008). Agile data warehousing: delivering world-class business intelligence systems using Scrum and XP. New York, iUniverse.
JARKE, M. (2003). Fundamentals of data warehouses. Berlin [u.a.], Springer.
KIMBALL, R., ROSS, M., THORNTHWAITE, W., MUNDY, J., & BECKER, B. (2011). The Data Warehouse Lifecycle Toolkit. Hoboken, John Wiley & Sons.
MOHANTY, S. (2008). Data warehousing: design, development and best practices. New Delhi, Tata McGraw-Hill Pub. Co.
MOHANTY, S. (2008). Data warehousing: design, development and best practices. New Delhi, Tata McGraw-Hill Pub. Co.
NAGABHUSHANA, S. (2006). Data warehousing: OLAP and data mining. New Delhi, India, New Age International.
PONNIAH, P. (2013). Data warehousing fundamentals for it professionals. Hoboken, N.J., Wiley.
PRABHU, C. (2004). Data warehousing: concepts, techniques, products and applications. New Delhi, Prentice Hall of India.
RAINARDI, V. (2008). Building a data warehouse with examples in SQL Server. Berkeley, CA, Apress.
REEVES, L. L. (2009). A manager’s guide to data warehousing. Indianapolis, IN, Wiley Pub.
SIMON, A. R., & SHAFFER, S. L. (2001). Data warehousing and business intelligence for e-commerce. San Francisco, Morgan Kaufmann.
TANIAR, D. (2009). Progressive methods in data warehousing and business intelligence: concepts and competitive analytics. Hershey, PA, Information Science Reference.
TEOREY, T. J. (2011). Database modeling and design logical design. Amsterdam, Morgan Kaufmann Publishers.
WHITEHORN, M., & WHITEHORN, M. (1999). Business intelligence: the IBM solution ; [Datawarehousing and OLAP]. London [u.a.], Springer.
WREMBEL, R., & KONCILIA, C. (2007). Data warehouses and OLAP: concepts, architectures, and solutions. Hershey, IRM Press.

Latest Assignments