Datawarehousing

What exactly is a Data Warehousing?

Data Warehousing is a means of getting one or many disparate Data sources into a central Database in a cleansed, granular and uniform format, which allows for rapid reporting of key information to benefit the business. There are two key areas to consider:Data Cleansing and ETL.

Data warehouse is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis.

This definition of the data warehouse focuses on data storage. However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform, and load data into the repository, and tools to manage and retrieve metadata

    Advantages of data warehousing
  • A data warehouse provides a common data model for all data of interest
    regardless of the data's source. This makes it easier to report and analyze
    information than it would be if multiple data models were used to retrieve
    information such as sales invoices, order receipts, general ledger charges,
    etc.
  • Prior to loading data into the data warehouse, inconsistencies are
    identified and resolved. This greatly simplifies reporting and analysis.
  • Information in the data warehouse is under the control of data warehouse
    users so that, even if the source system data is purged over time, the
    information in the warehouse can be stored safely for extended periods of
    time.
  • Because they are separate from operational systems, data warehouses provide retrieval of data without slowing down
    operational systems.
  • Data warehouses can work in conjunction with and, hence, enhance the value
    of operational business applications, notably customer relationship management (CRM) systems.
  • Data warehouses facilitate decision support system applications such as
    trend reports (e.g., the items with the most sales in a particular area
    within the last two years), exception reports, and reports that show actual
    performance versus goals.
  • Disadvantages of data warehouses

  • Data warehouses are not the optimal environment for unstructured data.
  • Because data must be extracted, transformed and loaded into the warehouse,
    there is an element of latency in data warehouse data.
  • Over their life, data warehouses can have high costs. The data warehouse
    is usually not static. Maintenance costs are high.
  • Data warehouses can get outdated relatively quickly. There is a cost of
    delivering suboptimal information to the organization.
  • There is often a fine line between data warehouses and operational
    systems. Duplicate, expensive functionality may be developed. Or,
    functionality may be developed in the data warehouse that, in retrospect,
    should have been developed in the operational systems and vice versa.


Data Cleansing

It’s one thing getting data into a central system, its quite another to get good data. The old adage “rubbish in , rubbish out”applys.

An important stage for any warehouse is getting clean and uniform data. Overlooked by many, it is vital that you have good data. A review of disparate source systems will almost always produce anomalies. These should be identified and a set of rules defined for the treatment of these. These rules can then be implemented and the data cleaned during the ETL process.

Extract Transform & Load

ETL is concerned with the implementation of an automated means of extracting, transforming and loading the source data into target tables. Again there are a wide range of ETL tools available for this purpose. ETL provides the means of treating the data