Imagine you have bought a new wardrobe, It is a huge one that can fit in your clothes, cosmetics, accessories and also has space for decor. Your first step could be to bring in all the things from your existing wardrobes that you want to arrange in one place (Extract). You can also clean, filter, remove old stuff before you arrange them in the wardrobe(Transform). Finally, you will arrange them, stack the items in the right cabinets, LOAD all the necessary items in the wardrobe.
Likewise, in ETL, the last step of the ETL process is the LOAD step where data that is collected from disparate sources and undergoes transformation and is stored in a database or data warehouse or datamart.
The LOAD step of ETL refers to the physical process of loading the transformed data into the target system.
What is Loading in ETL?
Loading is the last step and an equally important step in the entire ETL process as it sets the stage for more magic. The data that gets loaded into the target system is further used for analysis, used for insights, used for arriving at enterprise-level decisions. The use cases leveraging the data here is enormous and range from making dashboards to building machine learning algorithms on top of this clean loaded data.
Since the process predominantly involves large data movement, it is of utmost importance to make sure that efficient performance techniques and steps are involved in the loading process which considers the data volume, data structure and the type of load.
Some ETL teams also architect a staging database where all the transformed data is loaded first followed by a variety of validations, audits, checks on regulatory compliance and any other issues are identified and corrected.
After all the necessary checks and assurance that data has no missing values, no duplicates and only useful, clean data is available to be loaded, the data is published into the ETL Datawarehouse or database and loaded into target tables. Yet another important step here is making sure ETL testers do a thorough check on the data residing in the target system. This includes the verification of the referential integrity between dimension and fact tables to ensure that all records relating to appropriate records are in other tables. DBA should verify that every record in a fact table relates to a record in each dimension table that will be used with that fact table.
Types of Loading
Depending on the volume, usage, time of load, and performance, data engineers can deploy one of the following ways to load data,
- Initial Load: As the name says, this involves loading all the target table data for the very first time.
- Full Load: This method involves dumping the entire data on the data warehouse. It could either be the first time, where new data is loaded or it could be a result of deleting the existing data and replacing the new data. For example, Imagine a restaurant that uploads the sales data at the end of each day. If there were 10 records for Day 1, 5 records on Day 2, the full load will load 10 records on Day 1 and on Day 2, it would load the 15 records altogether again. Due to this, full load eats more time though the complexity is low.
- Incremental Load: Delta or only the difference between the state of records in source and target records are loaded through the incremental process. By appending the new records only, the time taken for the incremental process is less compared to the Full load. The limitation is the complexity and the effort needed to make sure the order of records is correct and nothing gets changed in the process else the DW would be prone to error. For the example used above, On Day 1, 10 records would be uploaded and on Day 2, only the new 5 records will be appended.
Based on the volume, velocity and demand for data, the incremental load can be either of these,
- Batch Loading: Data accumulated by the data source is loaded in “chunks” which is defined by a timeframe or windows in hours or days.
- Stream Loading: Pipelines for real-time applications like Netflix or high demand data applications use stream loading where data is loaded in real-time as it becomes available.
- Micro-batch Loading: This involves loading small chunks of recent data that are needed by some near real-time processes.
Based on time,
- Scheduled loading: Depending on the time of activity on the application, data loading tasks can be scheduled for a definite period. There are many tools that are available to make sure the jobs run as per schedule.
- On-demand loading: This type of loading is done based on a set trigger defined by the data engineer. It could be when the user requests data, or when an event occurs or data has reached a specific size.
To ensure better performance and faster loading of data, data can be partitioned and loaded into the destination parallel, this is called Parallel Loading.
Loading data plays an equally important role compared to the other steps of ETL. Without having the right data loaded into the right target tables, the right(almost) use of data cannot be demonstrated. As much as work is involved in E and T of the process, a good amount of planning, contemplation and thorough execution is needed for the complete process to be called a success!