The T of ETL: Transform
Data has always been the hero! I say not just the heroine, but everything revolves around data. Agree? Data has been generated from time immemorial. Storing it is also not something ancient. Data is stored in books, receipts, papers, ledgers, databases, data warehouses, and many other storage devices and forms. But, till recent years, this stored data stayed behind the curtains; it did not come into the limelight.
Now think of your city library; there are umpteen books. Now, the librarian has all the data regarding the books. The author, publisher, the day the book was brought to the library, the different records of the number of people who have issued the book, and so on. Every day, a lot of books would be borrowed, and a lot would be returned. So, every day, many rows of data were added and perhaps modified too. Month on month, year on year, data kept added to the database.
One day, you(an avid book reader) request the librarian for the list of all the books you have borrowed, so what is to be done — Filter by your name. You requested to provide the book names you borrowed month-wise for last year. But she realizes the date is stored in mm/dd/yyyy format, but you need it for each month and the year before. The need to add new columns arises. Month and year — Derived columns. These are just trivial examples of how we could use the stored data. The ultimate need for transforming the stored data is to make it more meaningful and sensible.
So What is Transformation?
Data transformation is the process of changing the format, structure, or values of data. Transformation refers to the cleansing and aggregation that may need to happen to data to prepare it for analysis.
Data from the sources are raw, may or may not be clean, and might have discrepancies or null values. These deficiencies should be resolved before making sense of the data for analysis, experimentation, or prediction. For analytics, data may be transformed in two stages,
- ETL — where the transformation is the middle step. Transformation occurs after loading all the source data into a staging area. After staging, the data is transformed and loaded into a data warehouse.
- ELT — where the transformation takes place after pre-loading the data into a data lake and then is transformed at query time or when the data scientists would want to tailor it.
Other processes like data migration, wrangling, warehousing, and integration may involve data transformation. The end goal of data transformation is to ensure that data is understandable, useful, and ultimate!
What are the steps in Data transformation?
Step 1: Data Discovery — Understanding the data in its native form is the first step in the Data transformation process. This is where the data engineers and analysts look at the data to understand it and explore the hidden gems of information.
Step 2: Data Mapping — This is a crucial part of the data transformation process, mapping the fields from one database to another. If there is any oversee in this step, it would make the entire data transformation effort go in vain.
Step 3: Code Generation — Code is generated in this step to perform the transformation. Data engineers can write scripts for this process, or organizations could use tools that simplify this task.
Step 4: Code Execution — As the name suggests, execute the code generated to run the transformation tasks.
Step 5: Verification — The transformed output is checked for accuracy. This can be done through scripts or automated tools that can help check the correctness of the transformed data.
Along with this, data undergo filtering, column splits, merging, enriching, and removing data. It is important to note that not all data can be transformed, and not all data needs to be transformed. Data can be used from the source systems if it is usable.
What are the Data transformation types?
Transformation is important to bring out the hero and all the other data characters, but how?
- Remove duplicates: Duplicate records are removed. Let’s say there are two librarians; one adds a book to the record base without checking for an already available record. Now there are two records for the same book. That’s not good, is it? So, such records must be identified and removed, keeping only one copy.
- Cleaning: Handling null values and making sure NA is handled are some tasks under cleaning data. For example, some old books could not have the published date or the date of issue. So such values would be null and would need to be handled. You could either keep it as is or update it with an old date.
- Data Standardisation: Making the date formats consistent, changing “Male” to ‘M’ or removing the prefixes Ms/Mrs, change of unit measurement, character set conversion, etc.
- Derivation/Attribute Construction: Creating new columns such as year from the date or new calculated values from existing data. The best example is making a new year column from the date of issue; another could be creating a column for “Number of days since last issued”; this would calculate the difference between the date and today’s date.
- Filtering: Selecting certain required rows and columns for analysis. If you are like me, you would be only interested in horror stories, so you can ask the librarian to filter by horror and give you a list. But in an ETL process, you might filter by certain years and load it to the data warehouse to comfortably analyze only those years’ data.
- Splitting: Dividing a single column into multiple columns. Suppose the head of the library decides to have the borrower’s names split. So their full name column could be split into first and last names.
- Merging: Combining data from different tables or databases, or data sources. Let’s say we have a John Doe who borrows a book and returns it. After a considerable time, he returns to borrow a new book and says he is new. So the library database has two records of him. Not cool! So these records have to be merged. Merging does not mean only this; it could also mean merging from different data sources, tables, and sources.
- Aggregation: Consolidate raw data to get the aggregations like sum, avg, etc. Time and spatial aggregations can be done as well. The librarian can get the average reading time by checking the number of days between the issue date and the return date.
- Summarization: Storing the totals of columns at different levels. The librarian might need the total number of books that are borrowed and returned, books that are available on mental health, etc.
- Validation: Validate the data to check for correct and clean values. Can add complex validations as well based on the data. You can have a script to validate the correctness of the data. All the above transformations can be checked and tested.
Data transformation dusts off the villains from the data and presents a hero(clean data) to further work on. The other actor, like data scientists, data analysts, business users, operations teams, etc., use this clean, usable, and improved data to gain insights, build charts, experiment, predict and bring out the true value of data.