Optimizing Performance, Throughput and Source Overhead when using Change Data Capture (CDC)

The growing analytics requirements by business decision makers forces data engineers to come up with innovative, effective ways to make data from operational systems available to BI and AI initiatives.

For these decision makers, the freshness of the data they are using to make decisions compared to the reality of the business (or in technical terms, are the analytic dashboards based on near real-time data), is a key element.

Batch ingestion does not meet this requirement as it does not enable near real-time replication of data, or streaming ingestion. It is also resource intensive and requires off-hour operation windows, that in some cases are not long enough, to avoid operational systems overhead during operation hours.

Change Data Capture Ingests and Replicates Streaming Data from Operational Sources with Minimal Overhead Enabling Real-Time Insights

Change data capture (CDC) enables a different strategy – continuously streaming, ingesting and replicating data from operational systems in an incremental fashion, looking only at changes made to the data – possibly after initial extraction of historical data once.

CDC approach provides the benefit of being able to seamlessly stream data from operational systems, with minimal overhead (if performed right), eliminating long batch windows, providing a real-time view of the data to analytic initiatives and most importantly, without forcing any application changes. In most cases, the application does not even know changes are being tracked.

That said, there are many approaches, potential pitfalls, decision points, optimizations and considerations to achieving successful CDC ingestion.

Three Main Considerations for Successful CDC Ingestion

1) Evaluate Performance, Latency and Source Overhead Tradeoffs for various CDC “listening” mechanisms

There are usually multiple ways to extract changes from a data set, depending on the data source type. A CDC replication process is only as quick as its weakest point. Therefore, it does not matter how scalable your replication or ingestion pipeline is if the CDC extraction method used on the source is slow.

EXAMPLE:
In an Oracle database, we can implement CDC in many different ways, such as triggers, JDBC queries based on a date pseudo column or a redo log extraction. The first two are extremely inefficient, both from performance perspective and more importantly database overhead concerns.

Database triggers are extremely expensive, synchronous operations that will delay every single database transaction to execute the trigger. JDBC query based extraction involves expansive range scans on the data tables that can create significant overhead even if the table is indexed and partitioned correctly.

CDC using redo log extraction is the most efficient way to extract changes in an asynchronous fashion, without slowing down operations and without additional I/O on the data tables. But even then, there is more than one way to extract changes from the redo log.

We can use Log Miner for CDC, an API provided by Oracle, originally intended for forensics analysis and manual logical recovery scenarios. LogMiner, however, was not intended to be used for Change Data Capture, but can be for now even as Oracle slowly deprecates some of its capabilities. It’s important to note that Log Miner is, however, limited to one core in the database which roughly translates into anything up to 10,000 changes per second. Anything more and LogMiner starts accumulating lag, which will keep growing continuously until a restream occurs, after which the lag will start growing again.

The most effective method available (tenfold more complicated than LogMiner or other approaches), is a direct redo log parsing. A direct redo log parsing reads the redo log in a binary fashion and parses the bytes based on offset positions, essentially reverse engineering the binary output. While only a handful of solutions have this capability, this method offers speeds of up to 100,000 per second, with even lower overhead on the database.

Performance, latency and source overhead tradeoffs are the key factors in picking the right CDC strategy.

2) Choose the right persistence layer and level of asynchronous handling of uncommitted changes when reading high volume transactions.

In relevant databases that offer transaction approach (more than one change to the data is committed or rolled back all together), another layer of complexity is added to the replication process – Reading uncommitted changes.

There are a few reasons to read uncommitted changes:

  • High volume transactions can lead to hitting memory limits if they are only being read after the commit has occurred.
  • Persisting uncommitted changes helps maintain a replication process latency SLAs, since there’s no need to wait to the commit before the extraction starts. Persisting these uncommitted changes also helps reduce the time to recovery of a CDC process failure which is not source related.

The challenge, of course, is the need to manage the uncommitted changes. For example, in a rollback scenario, you might remove these changes from the persistence layer so it does not get replicated, but also does not waste disk space. The second important aspect is that persisting the uncommitted data creates another I/O process that can damage performance, throughput and latency, therefore it’s very important to choose the right persistence layer and the right level of asynchronous handling.

3) Synchronize the current state (Initial Capture) of the data with CDC changes to achieve near real-time Replication

A replication often requires to first get the current “version” of the data and then apply all CDC changes to it. We call the first step initial capture. There are a few optimization concerns of initial capture and the synchronization between initial capture and CDC:

  • Initial Capture should be Robust and Extract Data Using Bulk, Parallel Processing:Initial capture, the process of extracting once the current state of the data set, should be robust and extract the data set using bulk, parallel processing methods. There is usually more than one way of extracting data from a source, either through JDBC, or APIs. Picking and optimizing the best approach is key to ensuring a multi-object replication maintains acceptable service levels. Each source has a unique, optimized way to perform bulk extraction. Restreaming is very common, due to many potential reasons, such as nightly loads, out of sync objects, failures and more, therefore initial capture is not a one-off process.
  • Identify Automatic, Reliable Ways to ensure proper Synchronization – Avoid Manual Sync:A replication process requires a combination of the existing state of the data set as well as all the capturing all the changes from that point in time. The point in time is typically referred to as the synchronization offset. This point is usually tracked by either a timestamp, but more accurately, by looking at an internal change sequence number. The synchronization offset is required to achieve a successful, exactly once data replication, regardless of the source type, for any source type that stores data at rest. Trying to manually sync between the initial capture and CDC changes is a tedious task that is prone to failures, due to the high frequency of changes in the data set. It’s important to ensure there are automatic, reliable ways to ensure a proper synchronization.
  • Decouple Initial Capture and CDC process dependency to Improve Performance:Initial capture, even if optimized, can still take a long time to complete, due to the size of the data set. It’s important to eliminate dependency between the completion of the two processes, by allowing both processes to run in parallel. This allows the overall process of achieving a real-time replication to reach a “synced” state much quicker and in most sources the initial capture data and the CDC changes are not extracted from the same logical storage anyway, so leveraging that to improve overall performance and “time to synced” is recommended. This becomes especially critical when handling many objects in a queueing fashion, as the “time to synced” can delay further for every consecutive object in the queue.

These are just a few of the optimization challenges and considerations that are key to streaming ingestion using Change Data Capture. Getting the basics can feel relatively simple, but like anything in life, the devil is in the details…and these are just a few of them. A poorly executed DIY or even commercially acquired CDC operation can not only risk the analytic project, but more importantly may risk the operational system leaving the organization in jeopardy. Using these considerations to drive innovation in data ingestion from operational systems will result in positive gains for BI and AI objectives.

Migrating to the Cloud – 3 Common Use Cases

Migrating to the cloud used to be considered an opportunity for the enterprise – an area of business growth and potential marked as something to pursue, but not necessarily an immediate imperative. With the advent of COVID-19 and the exponential pressure on digital transformation and real-time response to ever changing consumer behavior, cloud migration and ingestion has become more of a business risk than future plan. There are many drivers pushing the enterprise and businesses towards Cloud Migration.

A few common use cases are the need for:

  1. Zero-downtime migration from on-prem to the cloud
  2. Enabling Real Time Analytics
  3. Enabling Cloud BI

These new use cases come with challenges to overcome including:

  1. Migrating data from older, legacy technologies to new state-of-the-art-technologies which are nothing alike. You will have to deal with data type conversions, different syntax, drivers, etc.
  2. Extracting data in real-time in a non-intrusive way that will not negatively impact your operational systems.
  3. Managing and monitoring hundreds and, in some cases, thousands of ingestion processes and then find and treat any possible errors.
  4. Working with and managing multiple tools to accomplish the desired data ingestion/migration scope of work.

For example, in order to facilitate zero-downtime migration, you will need a change data capture powered replication tool. To enable real-time analytics, you will need a stream processing tool. For traditional batch ETL, you will need a dedicated tool, ending up with three tools to manage, monitor and pay for.

Many organizations are feeling the pressure of either legacy, proprietary tools feeding their architectures, driving up costs and locking them in. Others are looking for avenues to upgrade and/or streamline for reduced cost, real-time insights and ease of use.

Check out Equalum’s answer to these three Cloud Ingestion & Migration Use Cases in these brief platform Demos.

USE CASE #1: Zero Downtime Migration from On-Prem to Cloud

Equalum Example: Replicate Changes from Oracle to AWS Postgres

Data Replication from Oracle to AWS Postgres

USE CASE #2: Enabling Real-Time Cloud Analytics

Equalum Example: Stream, Transform & Load Data in Real-Time from On-Prem Kafka to Azure Data Lake

Real-time ETL from Kafka to Azure Data Lake

USE CASE #3: Enabling Cloud BI

Equalum Example: On-Prem Batch ETL into Snowflake Data Warehouse

Batch ETL into Snowflake Data Warehouse

Ready to Get Started?

Experience Enterprise-Grade Data Integration + Real-Time Streaming

Get A Demo Test Drive