The T of ETL: Transform

ETL-Extract, Transform, Load
ETL-Extract, Transform, Load

Data has always been the hero! I say not just the heroine, but everything revolves around data. Agree? Data is being generated from time immemorial. Storing it is also not something ancient. Data is stored in books, receipts, papers, ledgers, databases, data warehouses, etc., and a lot of other storage devices and forms. But, till recent years, this stored data just stayed behind the curtains, it did not come into the limelight.

Now think of your city library, there are umpteen books. Now, the librarian has all the data regarding the books. The author, publisher, the day the book was brought to the library, the different records of the number of people who have issued the book, and so on. Every day, a lot of books would be borrowed and a lot would be returned. So, every day a lot of rows of data were added and perhaps modified too. Month on month, year on year, data just kept added to the database.

One day, you(an avid book reader), request the librarian for the list of all the books you have borrowed, so what is to be done — Filter by your name. Your request to provide the book names you borrowed month-wise for the whole of last year. But she realizes the date is stored in mm/dd/yyyy format but you need it for each month and the year before. The need of adding new columns arises. Month and year — Derived columns. These are just trivial examples of how we could use the stored data. The ultimate need for the transformation of the stored data is to make it more meaningful and sensible.

So What is Transformation?

Data transformation is the process of changing the format, structure, or values of data. Transformation refers to the cleansing and aggregation that may need to happen to data to prepare it for analysis.

Transforming Data
Transforming Data

Data from the sources are raw, may or may not be clean, and might have discrepancies or null values. All of these deficiencies should be resolved before making sense of the data for analysis, experimentation, or prediction. For analytics, data may be transformed in two stages,

  • ETL — where the transformation is the middle step. Transformation occurs after loading all the source data into a staging area. After staging, the data is transformed and loaded into a data warehouse
  • ELT — where the transformation takes place after pre-loading the data into a data lake and then is transformed at query time or when the data scientists would want to tailor it.

Other processes like data migration, data wrangling, data warehousing, and data integration may involve data transformation. The end goal of data transformation is to ensure that data is understandable, useful, and ultimate!

What are the steps in Data transformation?

Step 1: Data Discovery — Understanding the data in its native form is the first step in the Data transformation process. This is where the data engineers and analysts take a look at the data to understand it and explore to find the hidden gems of information.

Step 2: Data Mapping — This is a crucial part of the data transformation process where the fields are mapped from one database to another. If there is any oversee in this step, it would make the entire data transformation effort go in vain.

Step 3: Code Generation — In order to perform the transformation, code is generated in this step. Data engineers can write scripts for this process or organizations could use tools that simplify this task.

Step 4: Code Execution — As the name suggests, execute the code that is generated to run the transformation tasks.

Step 5: Verification — The transformed output is checked for accuracy. This can either be done through scripts or automated tools that can help in checking the correctness of the data transformed.

Along with this data does undergo filtering, column splits, merging, enriching, and removing data. It is important to note that not all data can be transformed and not all data needs to be transformed. Data can be used as in from the source systems if it is usable.

What are the Data transformation types?

Transformation is an important step to bring out the hero and all the other characters of data, but how?

  • Remove duplicates: Duplicate records are removed. Let’s say there are two librarians, one of them adds a book to the record base without checking for an already available record. Now there are two records for the same book. That’s not good, is it? So, such records are to be identified and removed keeping only one copy of it.
  • Cleaning: Handling null values, and making sure NA is handled are some tasks under cleaning data. For example, there could some old books which do not have the published date or the date of issue. So such values would be null and would need to be handled. You could either keep it as is or update it with an old date.
  • Data Standardisation: Making the date formats consistent, changing “Male” to ‘M’ or removing the prefixes Ms/Mrs, change of unit measurement, character set conversion, etc.
  • Derivation or Attribute Construction: Creating new columns such as year from the date or new calculated values from existing data. The best example is making a new year column from the date of issue, another instance could be creating a column for “Number of days since last issued”, this would calculate the difference between the date issued and today’s date.
  • Filtering: Selecting certain required rows and columns for analysis. If you are like me, you would be only interested in horror stories, so you can ask the librarian to filter by horror and give you a list. But in an ETL process, you might filter by certain years and load it to the data warehouse in order to comfortably analyze only those years’ data.
  • Splitting: Dividing a single column into multiple columns. Suppose the head of the library decides to have the borrower’s names split. So their full name column could be split into first name and last name.
  • Merging: Combining data from different tables or databases or data sources. Let’s say we have a John Doe, who borrows a book and returns it back. After a considerable time, he again comes back to borrow a new book and says he is new. So the library database has two records of him. Not cool! So these records have to be merged together. Merging does not mean only this, it could mean merging from different data sources, tables, and sources too.
  • Aggregation: Consolidate raw data to get the aggregations like sum, avg, etc. Time and spatial aggregations can be done as well. The librarian can get the average reading time, by checking the number of days between the issue date and the return date.
  • Summarization: Storing the totals of columns at different levels. The librarian might need the total number of books that are borrowed, that are returned, books that are available on mental health, etc.
  • Validation: Validate the data to check for correct and clean values. Can add complex validations as well based on the data. You can have a script to validate the correctness of the data. All the above transformations can be checked and tested.

Data transformation dusts off the villains from the data and presents a hero(clean data) to further work on. The other actor like data scientists, data analysts, business users, operations teams, etc. use this clean, usable, and improved data to gain insights, build charts, experiment, predict and bring out the true value of data.

About Post Author

Spread the love

Leave a Comment

Your email address will not be published. Required fields are marked *

Must Read

Scroll to Top