Discover more from Junaid Effendi | Sharing knowledge for Engineers

Covering tech, career, data, growth experiences from my journey.

Over 6,000 subscribers

Already have an account? Sign in

Data Pipeline - Incremental vs Full Load

Learn the pros, cons and use cases about the data pipeline design patterns, full load and incremental commonly used across the industry.

Junaid Effendi

Apr 13, 2024

I assume you have seen my previous article where I covered Data Engineering end to end where I shared high level scope, this article will focus on one of the core components from it `Design Patterns` covering the two popular approaches Incremental and Full Load.

Data Pipeline - Incremental vs Full Load

In this article, we will cover important aspects of the Incremental and Full Load Patterns, we will go over:

High level examples
Benefits
Challenges with their Solutions
Use cases

Incremental

Incremental Pipeline processes data in increments, where increment can be of any size depending on the use case. These can be of either a batch or a streaming. E.g.

A batch pipeline that processes data in increments, like daily runs that processes the last 24 hours of data from a streaming source, a daily run that processes a daily data from a batch source like file drop.
A streaming pipelines is a incremental pipeline by default, it processes data in increments or micro batches, like a streaming pipeline that runs 24/7 fetching data from the queue in near real time.

Benefits

Cost effective as it processes a subset.
Quicker to recover and reprocess in case of failures due to shorter run times.
Provides opportunity for shorter SLAs.

Challenges & Solutions

When upstream data source is batch, backfilling becomes a challenge as it require additional functionality depending on the design and requirements. To tackle this:
- A separate pipeline may be required to process data from certain point in history using time ranges.
If business logic changes frequently that require a backfill then this may feel like a full load approach.
When upstream data source is streaming, late arriving data needs attention. You are expected to have late arriving data which may not be processed in the given run. To tackle this:
- Overlap time window while handling duplicates through `Upserts` or Async `DeDup` job depending on the use case.
- Provide flexibility in a pipeline to go back to certain time range to do a mini backfill, happens only when there is an incident with the upstream source.
This does not apply if the incremental pipeline is itself a streaming as it will eventually catch up.

Full Load

Full load pipeline processes the whole dataset. This is suited for batch. E.g.

A batch pipeline that processes the entire dataset, like a daily drop of file triggers the pipeline that reads the whole history.

Benefits

Backfilling is not a concern as every run processes the whole history, no separate pipeline or functionality required.
Snapshot versioning makes it easy to compare between different versions of data when needed.
Very east to maintain and build a pipeline.

Challenges & Solutions

Compute heavy and slower to process since it runs on full set. To tackle this:
- Optimization can be considered to reduce run time and compute cost. Learn optimization techniques:
  - Spark Optimization
  - Saving $70k a month
Storage costs can be super high as it stores multiple version of full copy of data. To tackle this:
- Setting retention policies is a good approach.
Harder to recover from failed state as it could take quite a long time to rerun. To tackle this:
- Checkpointing data into intermediate temporary storage can speed up.
In ETL/ELT, the L (load) can take quite a bit of time depending of data size. To tackle this:
- Estimating the load time and set the SLAs accordingly.

Use Cases

Lets look at how to pick one over the other using the common factors to check:

Let me know in the comments, which one do you use and how you make the most of it.

Related content:

Incremental Processing by Netflix
Design Patterns by Joseph Machado

Exciting Content:

End to End Data Engineering

Junaid Effendi

February 17, 2024

Sharing a Data Engineering end to end image, covering the General tools, DevOps focused tech, important Data Engineering concepts and the Tech Stack needed to be a successful Data Engineer. There could be some debatable stuff here, so lets clarify some through the following key points:

Read full story

Top picks from Data Engineering:

Top picks from the community:

Thank you for reading, like, comment and share!🙏

Subscribe to Junaid Effendi | Sharing knowledge for Engineers

Launched a year ago

Covering tech, career, data, growth experiences from my journey.

By subscribing, I agree to Substack's Terms of Use, and acknowledge its Information Collection Notice and Privacy Policy.

Data Pipeline - Incremental vs Full Load

Learn the pros, cons and use cases about the data pipeline design patterns, full load and incremental commonly used across the industry.

Incremental

Benefits

Challenges & Solutions

Full Load

Benefits

Challenges & Solutions

Use Cases

End to End Data Engineering

Subscribe to Junaid Effendi | Sharing knowledge for Engineers

Discussion about this post