Datawarehousing
What exactly is a Data Warehousing?
Data Warehousing is a means of getting one or many disparate Data sources into a central Database in a cleansed, granular and uniform format, which allows for rapid reporting of key information to benefit the business. There are two key areas to consider:Data Cleansing and ETL.
Data warehouse is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis.
This definition of the data warehouse focuses on data storage. However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform, and load data into the repository, and tools to manage and retrieve metadata
-
Advantages of data warehousing
- A data warehouse provides a common data model for all data of interest
regardless of the data's source. This makes it easier to report and analyze
information than it would be if multiple data models were used to retrieve
information such as sales invoices, order receipts, general ledger charges,
etc. - Prior to loading data into the data warehouse, inconsistencies are
identified and resolved. This greatly simplifies reporting and analysis. - Information in the data warehouse is under the control of data warehouse
users so that, even if the source system data is purged over time, the
information in the warehouse can be stored safely for extended periods of
time. - Because they are separate from operational systems, data warehouses provide retrieval of data without slowing down
operational systems. - Data warehouses can work in conjunction with and, hence, enhance the value
of operational business applications, notably customer relationship management (CRM) systems. - Data warehouses facilitate decision support system applications such as
trend reports (e.g., the items with the most sales in a particular area
within the last two years), exception reports, and reports that show actual
performance versus goals. - Data warehouses are not the optimal environment for unstructured data.
- Because data must be extracted, transformed and loaded into the warehouse,
there is an element of latency in data warehouse data. - Over their life, data warehouses can have high costs. The data warehouse
is usually not static. Maintenance costs are high. - Data warehouses can get outdated relatively quickly. There is a cost of
delivering suboptimal information to the organization. - There is often a fine line between data warehouses and operational
systems. Duplicate, expensive functionality may be developed. Or,
functionality may be developed in the data warehouse that, in retrospect,
should have been developed in the operational systems and vice versa.
Disadvantages of data warehouses
Data Cleansing
It’s one thing getting data into a central system, its quite another to get good data. The old adage “rubbish in , rubbish out”applys.
An important stage for any warehouse is getting clean and uniform data. Overlooked by many, it is vital that you have good data. A review of disparate source systems will almost always produce anomalies. These should be identified and a set of rules defined for the treatment of these. These rules can then be implemented and the data cleaned during the ETL process.
Extract Transform & Load
ETL is concerned with the implementation of an automated means of extracting, transforming and loading the source data into target tables. Again there are a wide range of ETL tools available for this purpose. ETL provides the means of treating the data