We live in a data-driven world where anything an individual does online turns out to be data. Be it simple click-throughs or complex transactions, the network keeps track of all the actions. This quickly generated data is collected, processed into desired formats, and stored in the target repositories for future reference.
In this article, we’ll explore a few techniques of ETL and have a detailed analysis on Batch ETL vs Streaming ETL.
What is Extract, Transform and Load?
The ETL process has been in existence since 1970. ETL is the acronym of ‘Extract, Transform and Load’ that denotes the three steps of this technique.
- Extract: The extract function collects data of all types from various origins. The gathered data may be structured or unstructured i.e. database, CSV files, multimedia, datasets or numerical, and many more. This data is imported and then consolidated into a single depot.
- Transform: Transform operation converts the data collected from numerous sources into a format suitable for further operations. This process is the most important part of ETL as data transformation remarkably boosts data integrity. An example could be the removal of duplicates from data. After transformation, data becomes fully compatible and ready.
- Load: Load operation stores these converted forms of data into a single database or data warehouse, making it easy to access for upcoming analysis and predictions. There are two fundamental types of approaches for loading data: i.e. Full load and Incremental Load.
In full load, all transformed data is loaded into a destination e.g. database/data warehouse. This option has its limitations for growing data as, with time, datasets would become very difficult to handle.
Incremental load, on the other hand, is more feasible for day-to-day operations even with exponentially growing data. We would only load the changed/affected data into our destination.
Data migration and testing are the primary applications of ETL that migrates data from one server location to another. Some other cases like Data Integration and Data warehousing techniques also employ ETL to create a bridge between applications by incorporating the data. Business intelligence, testing, networks, and a few other domains highly count on ETL.
Data experts started working with batch ETL techniques in the early 70s. Batch ETL is better suited for organizations with heavy data loads that aren’t relying on accessing data in real-time. The batch ETL gathers data from all the possible sources at regular intervals from applications, websites, or databases.
There can be numerous ways to make data batches e.g. hourly/daily/weekly etc. depending upon the business requirement. Each batch in this process holds a large pile of data. Before loading the data into the target data warehouse, this function will transform the collected data by a method suitable for their business needs.
Batch ETLs are always based upon some time or triggers. For example, we can schedule batch ETLs recursively according to our need using any scheduling tools or it can be triggered based upon some event. Also, there can be a trigger that as soon as new data arrives in the source folder, batch ETL starts.
Applications of Batch ETL
Industries like chemical production, textiles, banks, payroll processes, and hospitals use the batch ETL technique for various use cases, where updating the data in real-time is not necessary. Weekly reports, historical analyses and yearly reviews are examples of data processing that doesn’t rely on real-time data access.
The Batch ETL function will collect and load the data in increments into a data lake or data warehouse. The time duration between batch loads may vary according to the use case requirements, workload, and the tool we opt to use for the procedure.
Simple to implement:
This system does not need to closely observe the recently generated data, and so the implementation process is simple. In most cases, batch processing is preferred due to its simplicity. Process monitoring also becomes easy.
The expenditure of Batch ETL is reasonable as it employs traditional methods and executes them repeatedly for all the batches.
Compatible with traditional systems:
There are a few organizations that still use legacy systems and software that are not compatible with any advanced technique of ETL. Batch ETL is compatible with such systems.
Large Volumes of Data:
Batch ETL can be a viable option when we are dealing with huge volumes of data that do not need to be delivered in real-time.
- As batch ETL works with a huge amount of data, a slight failure in one set of data could possibly destroy the operation of the whole pile as well. Failure in one row among 100 rows will eventually result in the failure of the remaining 99 rows as well.
- If the system crashes at the eleventh hour of Batch ETL processing the entire data stored for operations would fail.
- Organizations with monotonous operations will mostly employ Batch ETL. In case any new data type enters, the system will not recognize and cause inaccuracy.
Tools and Frameworks
- Alteryx, IBM InfoSphere DataStage, Microsoft SSIS, Talend DI and Oracle Data Integrator are the popular tools employed in Batch ETL.
- Google Big Query, Map Reduce, and Red Shift are some of the frameworks that support the Batch ETL process.
The streaming ETL process is highly preferred by industries that produce back-to-back data, similar to a rushing river stream. This is a real-time streaming process as they work on the data of the nearest timestamp. This ETL structure will neither stay idle for a long time nor wait till the data lake is filled. Instead, it starts executing the extract, transform and load operations as soon as the data streams in. The data may differ in size, flowing speed, source, and type. More than 60% of industries currently utilize real-time data processing.
Jobs for streaming data processing run all the time and process the incoming data into the destination directory continuously. As opposed to batch processing, the size of each data chunk is small but the real difference is that it is processed in real-time.
So, we can say that data will always be up-to-date when we use streaming ETL processing. The business end-users will have a clear picture of data demographics at any given moment and any issues that come in the way can be handled and resolved quickly and efficiently.
As the name shows, streaming data comes in the form of streams from the source. We can use tools like Kafka, ActiveMQ, RabbitMQ, etc. for generating these real-time data streams.
Streaming ETL methodology uses stream-based data pipelines to handle the continuous data flow from source to destination. These pipelines pull data from various sources and load the processed data into cloud storage such as Amazon S3. Tools for performing streaming ETL and for converting from batch to streaming applications are available in various markets.
Applications of Streaming ETL
Industries that heavily depend on live data choose Streaming ETL. Applications like weather forecasting, ticket booking, banking fraud and share market, need to update the data streaming into their target systems every second.
What if there is an update delay in a ticket booking system? The system might present that there are still 3000 tickets available, when in truth only 500 remain. The negative domino effects of inaccurate, unreliable and untimely data can range from dissatisfied customers to significant revenue loss. Streaming ETL can enable businesses with trusted data in real-time.
- Rate of speed: The attractive feature of Streaming ETL is speed. Streaming ETL offers continuous throughput, and in some use cases, lower latency than Batch ETL as well.
- Compatible with new technologies: Latest technologies like Cloud and Business Intelligence use Streaming ETL to extract, transform and load data to target locations.
- Minimum delay: The data processing starts at the very instance of its arrival. Streaming ETL makes sure the user leverages the data right away.
- Data-Driven Insights: Streaming ETL helps organizations leverage data coming into their systems in real-time, enabling better tracking of data demographics and patterns allowing for more powerful, data-driven decisions to guide the business.
- As this model is working with live data, there will be no time for recovery.
- Streaming ETL alone cannot read the data in the repairing process. Hence one should look for a tool that also offers a modern, multi-modal Change Data Capture component with high availability and failover protection if systems go down.
- This advanced operation requires highly capable platforms and hardware so performance, latency and throughput along with ease of use must be evaluated.
Tools and Frameworks
- Kafka, Apache Spark, Vertica, Apache Flume, and Apache Flink are the tools available in this market.
- Apache Spark is the best open-source framework used in streaming ETL because this framework can set multiple nodes to handle petabytes of data without any issues.
Batch ETL vs Streaming ETL
Netflix is a streaming company that presents 450 billion events that have 100 million audiences in over 190 countries. In 2017, Arora, a senior data engineer presented a paper on “Migrating the Netflix architecture from Batch processing to real-time streaming as she felt Netflix would function better in streaming architecture to entertain users with live updates.
Migrating from Batch ETL to Streaming ETL:
The input data that are stored in s3 buckets in Batch ETL should be allowed to pass through streaming pipelines to proceed with the simultaneous operations.
The data engineering team of Netflix explored Streaming ETL solutions vs purely batch ETL as that data processing technique could tolerate mass data loads and handle the latency, throughputs, and other metrics required.
The methodology that replaced Batch ETL is “micro batching” which is a sub-model of streaming ETL. As the organizations worked with Batch ETL, it is easier to switch them into micro-batches where the time duration comes down from hours to minutes or even seconds.
Netflix needed to customize real-time events like logs and transactions effectively.
As streaming ETL doesn’t require as much storage as batch ETL, the cost of storage to use ETL was significantly lower. The application required less turnaround time to cope with the speed of the data generation. The application integrated with real-time systems and promoted real-time auditing. The application became efficient in training with new machine learning algorithms.
Although some say that Batch ETL is dead, many organizations still leverage Batch processing for specific use cases not dependent on real-time data. Batch is still frequently used for migrating large data sets, particularly in traditional industries where Streaming ETL is simply not feasible.
Streaming ETL does offer real-time data processing of rapidly changing data sets. Add Change Data Capture into the mix, and the power to capture changes to the data as they happen, and only stream those changes versus the entire data set, makes the Streaming ETL approach all the more dynamic and powerful.
The most vital aspect of data processing is flexibility. Different projects come with different requirements, and each solution for processing data must be evaluated based on your use case. Ideally, you can explore a data integration solution that supports all of your core, data use cases to provide flexibility and a future-proof approach to ingestion.