Persistent Staging: benefits and costs February 25, 2008
Posted by TMVilla in BI, Business Intelligence, DW, Data Mart, Data Warehousing, Datamart, ETL, INFA, IT, Informatica, Oracle, Technical Architecture, Technology, data warehouse.trackback
How many times have we had to reload data because of business changes, bug fixes, etc? It can be quite challenging. With the proper planning and resource planning, data reloads and management can be done fairly quickly and with little grief.
A persistent staging area is the foundation that provides this functionality. When bugs are found in the load process or the business wants to change the rules, you can design a systematic approach to removing, reloading and tracking data loads as it propagates throughout the warehouse.
You will want to define a couple of parameters in which to make the process easier to manage, allocate resources and predict the ability to reload:
-
The time range needed to be available as defined by the business. This could be one year, 18 months, or five years. You will need to check the data density and storage costs as part of your feasibility study.
-
The location of the stage history. If this is to be offline, then a reload area will need to be available. Also remember that the data usually needs to be on disk to be available for reloading. A contiguous set of stage tables partitioned by year and month usually work best.
-
Architecture and design considerations that will allow for a flexible load process. Such features include the ability to “override” the load dates and systematically reload data from stage. Design the load into a batch process that is stamped, loaded and tracked by a batchid. This allows for tracking when and from what group the warehouse data is loaded.
There are several instances in which a persistent staging area is not necessary:
-
All source data is timestamped with the creation and update dates and history is available on the source systems. You will need to compare the source retention period to the business rules defined for the warehouse.
-
All source data is timestamped and available in a persistent area external to the source systems and available for consumption by the warehouse. An enterprise application integration (EAI) data store could provide information, assuming it keeps history of it’s various transactions and they are denormalized and useful to the warehouse.
-
Separation in history is desired. The user community needs to see the data in the warehouse in its original form and will not ever need to change it per new rules and the data is reviewed before the stage data is eliminated.
Comments»
No comments yet — be the first.