The E of ETL: Extract
Today, DATA is everywhere! From the simplest spreadsheets containing the monthly budget calculations to the enormous amounts of large-sized videos — everything is DATA!
Data has an origin, is categorized into numerous types, can be added, modified, deleted, reside in manifold places, and most importantly, has meaning. It accentuates the purpose of organizations by providing critical and relevant information. Stakeholders, business users, data scientists, etc., make the most of data to carry out data-driven decisions, experimentation, and prediction.
This data that resides in varied systems and forms needs to be accumulated in one place to be transformed, and this can be achieved by extracting the data from all the disparate sources.
To make sense of the data, Data engineering is essential, ETL process is essential. The purpose of this blog is to explore the E of the ETL process.
What is Extraction of Data?
Extraction is the first and foremost step of the Extract, Transform, and Load process(ETL). It nevertheless remains the first step in the ELT process too. Before we get to organizing the data, & making it clean and relevant, it needs to be extracted or pulled from appropriate sources and compiled together. Like extracting gold from rocks, and honey from beehives, data is extracted from sources in this step.
Data Extraction is the process of collecting or retrieving disparate types of data from a variety of sources.
Data extraction makes it possible to consolidate, process, and refine data so that it can be stored in a centralized location can be transformed. These locations may be on-site, cloud-based, or a hybrid of the two. The centralized location is called a staging area.
Why Staging Area?
You might wonder — Why the staging of data? Why not just dump it directly into the data warehouse? Why not just work on the data in its native source system? Well,
- Need Transformation: Data might be corrupted in its source system. It is prone to missing values or duplicates. To work on legit clean data, it needs to be placed together in a staging area where transformations are done.
- Avoid Data corruption: Working directly on the source system data might corrupt the data.
- Decoupling: Having a constant connection with the source system till all the direct transformation is done and moved to a data warehouse is not ideal. Hence, connecting to a data source, reading in one go, storing in a staging area, disconnecting from the source, and then processing the data is the ideal solution.
- Delta Loading: The diverse data sources have a limited time period to extract data. Hence, it is essential to extract all the data and place it in the staging area before the time slot is over. Data is incrementally read from the source.
- Time Dependency: Data extraction time slots for different data sources may vary per time zone, operational hours, and traffic.
- Disaster Recovery: In case of changing to the original data, having a local copy helps.
- Debugging: In case of issues, there is no need to go to the source data for examination. Staging area data can be used for this purpose.
In the case of ETL, a database can be used as a staging area. Whereas with ELT, the Data lake is used as the staging area.
Data Sources
Based on the source of origin, data can be divided into two types,
- Physical sources: Data present on a physical object such as brochures, phone directories, journals, newspapers, magazines, letters, and paper invoices categorizes into physical data. The extraction of physical data involves manual effort and is tedious. Optical Character Recognition(OCR) is used to scan and further convert printed text to digital text.
- Digital sources: Data present in digital sources such as databases, spreadsheets, webpages, pdfs, emails, messages, reviews, tweets, etc., constitute digital data. Data from these digital sources can be available directly for extraction or might need tools for web scrapping or data scrapping.
Based on the structure, data can be divided into two types,
- Structured data: Highly organized data that can be arranged into rows and columns of tables. Data residing in Docs, spreadsheets, databases, etc., are structured data.
- Unstructured data: Data without a pre-defined structure and with various forms. Examples are images, videos, audio files, etc.
Usually, a group of databases is the source system, but these days, data can be obtained from SaaS tools as well. Some common areas from where disparate data from Saas tools are here,
Saas(Software as a Service) tools like Google Analytics, Facebook Ads, and Hubspot are used predominantly by businesses these days to run the business based on customer engagement and other key components. Extracting data from these tools is done by integrating with APIs provided by the SaaS tools.
Data Extraction Methods
Compiling data from relevant sources and organizing it to make it consistent is the key to the Data extraction process. Based on the volume of data, data source types, and data sources, the methods of extraction can be,
- Update Notification: This being the easiest extraction method, initiates extraction only when a notification is received from the source data systems of any update or append. This is also called Partial Extraction — with an update notification.
- Incremental Extraction: In a scenario where systems cannot provide notification of updates that occurred but can identify and provide the records modified, incremental or partial extraction without update notification method is used. Changes are tracked based on the timestamp since the last successful extraction was done, then the changes are incrementally extracted and loaded. Some implementations of incremental extraction use Change Data Capture(CDC) to track the data that has changed so that action can be taken on the changed data. CDC can be achieved in real-time or asynchronously by looking at the source data. Log-based CDC, Trigger-based CDC or CDC script can accomplish this.
- Full Extraction: The data is extracted completely from the source. The last successful extraction copy can be compared to know the changes in the data. This is usually done as a one-time practice or when small tables are involved. Full extraction would be a very elaborate process to be done often. Full Extraction should be used when changes that have occurred since the last extraction time need not be tracked.
Types of Data Extraction tools
- Batch processing tools: Data is extracted from the source system in batches. Batch processing is usually performed outside of business hours. In legacy systems or data that is in obsolete forms, moving data in batches is the best solution. This method is used when the data is not too complex.
- Open source tools: Several open-source extraction tools are available when the budget is limited and the use case is not complex. Data engineers are equipped with the knowledge to use these tools to extract or replicate data.
- Cloud-based tools: These tools are the new-gen extraction tools that are a part of the ETL process. They focus on the real-time extraction of data. It is mostly a no-code solution where the users can collect and store data without the burden of logic.
Data Extraction Process
Irrespective of the method or tool used for extraction, these are the steps involved,
- Check for structural changes of the data, addition, modification, or deletion of tables, columns, or raw data.
- Retrieve the objects that have been affected due to change.
- Extract the suitable data.
Imagine a small business that has data on spreadsheets, pdfs, which include invoices, hand-written invoices, etc. Extracting data from digital sources is straightforward, but physical sources could involve manual effort. The paper invoices are scanned and converted into digital data using OCR. Further, this extracted data, along with spreadsheet data, is stored in a staging area and can then be processed and used for analysis. Natural Language Processing can be used to gain insights. Any modification or deletion should be accounted for, and change needs to reflect in the accumulated data store.
Hence, To receive the advantages of analytics, BI programs, and prediction of data, understanding the data’s universe is imperative. The origin, type, and destination of data, extraction, and storing of it have to be chalked out. Before processing and transforming data to suit a use case, it needs to be extracted and stored for business benefits.