Equalum Releases Enhanced Data Integration Platform, Steps Up Channel Initiatives

With the newest release of its Continuous Data Integration Platform Equalum supports multiple data integration use cases and is going head-to-head with some of the big data industry’s major players.

By Rick Whiting

Equalum is promising that the latest release of its flagship software can support just about every data integration use case – from batch ETL and change data capture to real-time data streaming.

The general availability of the Equalum Continuous Data Integration Platform 3.0 this week comes amid rising demand for data integration capabilities, driven by increasing data volumes, growing numbers of internal and external data sources, and the need to utilize and manage data that’s distributed across on-premises, multi-cloud and hybrid environments.

The Equalum announcement also comes as the Sunnyvale, Calif.-based company reports increased momentum in the channel among VAR and MSP partners who offer Equalum’s software as a key component of their customers’ digital transformation initiatives.

[Related: The Big Data 100]

“We are the unified platform for mid-market and enterprise data integration solutions. No one is really providing what Equalum has,” said CEO Guy Eilon in an interview with CRN.

Data integration has traditionally included batch ETL (extract, transform and load), data replication and change data capture technologies used to move data from operational systems into data warehouses for business analysis tasks and for copying data for backup and disaster recovery purposes. More recently demand has been increasing for software that can integrate streaming data in real time…

READ THE FULL ARTICLE

Batch ETL vs Streaming ETL

We live in a data-driven world where anything an individual does online turns out to be data. Be it simple click-throughs or complex transactions, the network keeps track of all the actions. This quickly generated data is collected, processed into desired formats, and stored in the target repositories for future reference.

In this article, we’ll explore a few techniques of ETL and have a detailed analysis on Batch ETL vs Streaming ETL.

What is Extract, Transform and Load?

The ETL process has been in existence since 1970. ETL is the acronym of ‘Extract, Transform and Load’ that denotes the three steps of this technique.

  • Extract: The extract function collects data of all types from various origins. The gathered data may be structured or unstructured i.e. database, CSV files, multimedia, datasets or numerical, and many more. This data is imported and then consolidated into a single depot.
  • Transform: Transform operation converts the data collected from numerous sources into a format suitable for further operations. This process is the most important part of ETL as data transformation remarkably boosts data integrity. An example could be the removal of duplicates from data. After transformation, data becomes fully compatible and ready.
  • Load: Load operation stores these converted forms of data into a single database or data warehouse, making it easy to access for upcoming analysis and predictions. There are two fundamental types of approaches for loading data: i.e. Full load and Incremental Load.

In full load, all transformed data is loaded into a destination e.g. database/data warehouse. This option has its limitations for growing data as, with time, datasets would become very difficult to handle.

Incremental load, on the other hand, is more feasible for day-to-day operations even with exponentially growing data. We would only load the changed/affected data into our destination.

Data migration and testing are the primary applications of ETL that migrates data from one server location to another. Some other cases like Data Integration and Data warehousing techniques also employ ETL to create a bridge between applications by incorporating the data. Business intelligence, testing, networks, and a few other domains highly count on ETL.

Batch ETL

Data experts started working with batch ETL techniques in the early 70s. Batch ETL is better suited for organizations with heavy data loads that aren’t relying on accessing data in real-time. The batch ETL gathers data from all the possible sources at regular intervals from applications, websites, or databases.

There can be numerous ways to make data batches e.g. hourly/daily/weekly etc. depending upon the business requirement. Each batch in this process holds a large pile of data. Before loading the data into the target data warehouse, this function will transform the collected data by a method suitable for their business needs.

Batch ETLs are always based upon some time or triggers. For example, we can schedule batch ETLs recursively according to our need using any scheduling tools or it can be triggered based upon some event. Also, there can be a trigger that as soon as new data arrives in the source folder, batch ETL starts.

Applications of Batch ETL

Industries like chemical production, textiles, banks, payroll processes, and hospitals use the batch ETL technique for various use cases, where updating the data in real-time is not necessary. Weekly reports, historical analyses and yearly reviews are examples of data processing that doesn’t rely on real-time data access.

The Batch ETL function will collect and load the data in increments into a data lake or data warehouse. The time duration between batch loads may vary according to the use case requirements, workload, and the tool we opt to use for the procedure.

BENEFITS

Simple to implement:
This system does not need to closely observe the recently generated data, and so the implementation process is simple. In most cases, batch processing is preferred due to its simplicity. Process monitoring also becomes easy.

Cost-efficient:

The expenditure of Batch ETL is reasonable as it employs traditional methods and executes them repeatedly for all the batches.

Compatible with traditional systems:
There are a few organizations that still use legacy systems and software that are not compatible with any advanced technique of ETL. Batch ETL is compatible with such systems.

Large Volumes of Data:
Batch ETL can be a viable option when we are dealing with huge volumes of data that do not need to be delivered in real-time.

Shortcomings

  • As batch ETL works with a huge amount of data, a slight failure in one set of data could possibly destroy the operation of the whole pile as well. Failure in one row among 100 rows will eventually result in the failure of the remaining 99 rows as well.
  • If the system crashes at the eleventh hour of Batch ETL processing the entire data stored for operations would fail.
  • Organizations with monotonous operations will mostly employ Batch ETL. In case any new data type enters, the system will not recognize and cause inaccuracy.

Tools and Frameworks

  • Alteryx, IBM InfoSphere DataStage, Microsoft SSIS, Talend DI and Oracle Data Integrator are the popular tools employed in Batch ETL.
  • Google Big Query, Map Reduce, and Red Shift are some of the frameworks that support the Batch ETL process.

Streaming ETL

The streaming ETL process is highly preferred by industries that produce back-to-back data, similar to a rushing river stream. This is a real-time streaming process as they work on the data of the nearest timestamp. This ETL structure will neither stay idle for a long time nor wait till the data lake is filled. Instead, it starts executing the extract, transform and load operations as soon as the data streams in. The data may differ in size, flowing speed, source, and type. More than 60% of industries currently utilize real-time data processing.

Jobs for streaming data processing run all the time and process the incoming data into the destination directory continuously. As opposed to batch processing, the size of each data chunk is small but the real difference is that it is processed in real-time.

So, we can say that data will always be up-to-date when we use streaming ETL processing. The business end-users will have a clear picture of data demographics at any given moment and any issues that come in the way can be handled and resolved quickly and efficiently.

As the name shows, streaming data comes in the form of streams from the source. We can use tools like Kafka, ActiveMQ, RabbitMQ, etc. for generating these real-time data streams.

Streaming ETL methodology uses stream-based data pipelines to handle the continuous data flow from source to destination. These pipelines pull data from various sources and load the processed data into cloud storage such as Amazon S3. Tools for performing streaming ETL and for converting from batch to streaming applications are available in various markets.

Applications of Streaming ETL

Industries that heavily depend on live data choose Streaming ETL. Applications like weather forecasting, ticket booking, banking fraud and share market, need to update the data streaming into their target systems every second.

What if there is an update delay in a ticket booking system? The system might present that there are still 3000 tickets available, when in truth only 500 remain. The negative domino effects of inaccurate, unreliable and untimely data can range from dissatisfied customers to significant revenue loss. Streaming ETL can enable businesses with trusted data in real-time.

Benefits

  • Rate of speed: The attractive feature of Streaming ETL is speed. Streaming ETL offers continuous throughput, and in some use cases, lower latency than Batch ETL as well.
  • Compatible with new technologies: Latest technologies like Cloud and Business Intelligence use Streaming ETL to extract, transform and load data to target locations.
  • Minimum delay: The data processing starts at the very instance of its arrival. Streaming ETL makes sure the user leverages the data right away.
  • Data-Driven Insights: Streaming ETL helps organizations leverage data coming into their systems in real-time, enabling better tracking of data demographics and patterns allowing for more powerful, data-driven decisions to guide the business.

Shortcomings

  • As this model is working with live data, there will be no time for recovery.
  • Streaming ETL alone cannot read the data in the repairing process. Hence one should look for a tool that also offers a modern, multi-modal Change Data Capture component with high availability and failover protection if systems go down.
  • This advanced operation requires highly capable platforms and hardware so performance, latency and throughput along with ease of use must be evaluated.

Tools and Frameworks

  • Kafka, Apache Spark, Vertica, Apache Flume, and Apache Flink are the tools available in this market.
  • Apache Spark is the best open-source framework used in streaming ETL because this framework can set multiple nodes to handle petabytes of data without any issues.

Batch ETL vs Streaming ETL

Real-time Scenario:


Cause:

Netflix is a streaming company that presents 450 billion events that have 100 million audiences in over 190 countries. In 2017, Arora, a senior data engineer presented a paper on “Migrating the Netflix architecture from Batch processing to real-time streaming as she felt Netflix would function better in streaming architecture to entertain users with live updates.

Migrating from Batch ETL to Streaming ETL:


The input data that are stored in s3 buckets in Batch ETL should be allowed to pass through streaming pipelines to proceed with the simultaneous operations.

The data engineering team of Netflix explored Streaming ETL solutions vs purely batch ETL as that data processing technique could tolerate mass data loads and handle the latency, throughputs, and other metrics required.

The methodology that replaced Batch ETL is “micro batching” which is a sub-model of streaming ETL. As the organizations worked with Batch ETL, it is easier to switch them into micro-batches where the time duration comes down from hours to minutes or even seconds.


Netflix needed to customize real-time events like logs and transactions effectively.

Results:

As streaming ETL doesn’t require as much storage as batch ETL, the cost of storage to use ETL was significantly lower. The application required less turnaround time to cope with the speed of the data generation. The application integrated with real-time systems and promoted real-time auditing. The application became efficient in training with new machine learning algorithms.

Conclusion:

Although some say that Batch ETL is dead, many organizations still leverage Batch processing for specific use cases not dependent on real-time data. Batch is still frequently used for migrating large data sets, particularly in traditional industries where Streaming ETL is simply not feasible.

Streaming ETL does offer real-time data processing of rapidly changing data sets. Add Change Data Capture into the mix, and the power to capture changes to the data as they happen, and only stream those changes versus the entire data set, makes the Streaming ETL approach all the more dynamic and powerful.

The most vital aspect of data processing is flexibility. Different projects come with different requirements, and each solution for processing data must be evaluated based on your use case. Ideally, you can explore a data integration solution that supports all of your core, data use cases to provide flexibility and a future-proof approach to ingestion.

5 Tips to Trusting the Data in your AI Dashboard

If you’ve spent any time in the corporate world, then you know the value of proving ROI. In matters of data, just having access to it isn’t enough. You need to make sense of what the data shows you, understand what’s important to look at and what to toss, gather insights and leverage those insights to make important business adjustments. In today’s ever-changing world, having answers to business problems in real-time makes the pressure even greater.


Most organizations head straight for flashy BI tools and AI dashboards (who doesn’t like a well-designed graph or bar chart that clearly predicts existing trends and future opportunities?). But without the right streaming and continuous data integration needed to provide fresh, reliable, and accurate data in real-time from the backend, you’re left with beautiful dashboards with yesterday’s data. That stale, inaccurate, and untimely information can lead to dire consequences for your business.


Here’s an example: the stockpile of 1,000 items you thought was sitting in your warehouse is actually down to 500 units because of a large order early in the day, but your dashboards show inventory is fully stocked due to insufficient real-time transfer of inventory data. When the next order comes in for 600 units and is promptly sold, how will your customer feel when told their order is 100 units short? Do they wait the two months needed for production? Or just cancel the order altogether? With a real-time, continuous data integration platform fueling your AI and BI systems, you can trust that the insights you’re collecting operate off of fresh, trusted, accurate information. Just having data isn’t enough to prove ROI in any business practice, much less a fancy analytic dashboard. It’s how you aggregate and transform that data, interpret what it’s showing you and apply that to business innovations that drive your company’s competitive edge.

Here are the five steps your organization needs to take in order to get the most out of your data.

#1) Do an audit and analysis of your current data strategy.

Ask yourself pressing questions like, are you meeting SLAs? Are contracts, invoices, inventory, and other business matters being handled in a timely way or are things falling through the cracks because data used to make decisions is late, incomplete, or inaccurate? Do you have a streaming-first, real-time approach? Is it as performant as it could be? Your data sources, volumes, and velocities will only increase in the years to come. Do you have a modern, scalable, flexible infrastructure to support your business growth? Or are you relying on legacy tools that will hamper your modernization? Or worse, are you relying on your IT department to not only source an analytics tool but extract the data from it as well.

#2) Identify business objectives – then look to your data to solve them.

Ask questions like, are we trying to discover new revenue streams? Optimize current operations? Improve pipeline? Eliminate waste? Make sound hiring decisions? These valuable, foundational questions will inform the approach needed to identify, collect, and transform your data to get the answers you need in the time you need them. It’s critical to identify a data integration solution that can support the numerous approaches for data processing like Streaming ETL / ELT / EtLT plus Batch ETL and Replication. Look for flexible, scalable, and robust solutions capable of handling modern and legacy integration with ease.

#3) Take stock of where data lives and how you’re capturing it.

So many data sources…and formats…and languages…and so little time to catch your breath? Data silos got you down? You’re not alone. As you look to modernize, evaluate your current data sources and how you’re structured. Do you have important, private data that has to live on-prem? Are you making the move to the cloud? Maybe deploying a multi-cloud strategy? Only evaluating SaaS solutions that will require someone on staff to manage? Look for a solution that can operate in any of these environments – offering you flexibility and scalability.


Additionally, not all legacy data integration systems are capable of connecting to modern data sources, or able to leverage modern Change Data Capture to look for just the changes to your source data in real-time. Ensure that your data integration solution can access data from any source, in any format either in Streaming or Batch. With so many new, modern sources being added to your data infrastructure, from social platforms to CRMs and other tools, you need to have a solution that equalizes your data. It’s all-important, and when aligned appropriately together, can show you those yellow bricks pointing to the emerald city.

#4) To ETL or ELT – which approach is best? Yes, we’re getting technical here…

Well…it depends on the use case (have you heard that one before?), and not all use cases are created equal. Sure, it’s appealing to push all of your data straight to the cloud, knowing that at some point in the future you will transform, enrich and aggregate it for AI dashboards. But would you do that with your closet? Just take all of the dirty, mismatched, and out-of-date clothes out of the closet and throw them in the back of your car, knowing that at some point in the future you will stop by Goodwill to donate, and the resale shop to make some extra cash? Not efficient.

While extract, load, and transform (or ELT) is the right method to fuel AI and BI insights in many cases, you may also consider some small transformations and data enrichments before your data hits the cloud (also known as EtLT). Applying the “little t” before loading can reduce cloud compute costs, and lead to cleaner, enriched data at your target with faster load times for your AI dashboards. As you look towards use cases like Application Integration or Event Subscriptions by various business users, applying Streaming ETL is best, allowing for transformations to occur in the data pipeline before loading onto the target.

Ultimately, look for a solution that can do both. You’ll then have the flexibility to decide as various use cases arise.

#5) Take a cold, hard look at your data and ask yourself: what’s missing?

So often, hidden inside of the piles of data you are collecting lies unrealized opportunity. Are you capturing the buying history of your customers? Maybe identifying a threshold at which they transition to a new direction? Could you leverage your data for anonymized industry benchmarks that would be useful to other customers? Can you see unexpected trends in your customer base pointing towards a new line of business? There may be more to mine from your data than you realize. With the ability to easily and quickly aggregate data and enrich it from various sources, you will gain visibility into what’s possible. Go ahead…surprise yourself!

Being agile with real-time data integration and ultimately, applying it to your business use case, will be the difference between modern and forward-thinking organizations and those that can’t keep up.

Ready to Get Started?

Experience Enterprise-Grade Data Integration + Real-Time Streaming

Get A Demo Test Drive