Uber Data Tech Stack

Learn about the Data Tech Stack used by Uber to process trillions of events every day.

Junaid Effendi

Jun 26, 2024

Uber handle massive scale, from event data in streams to data at rest in the warehouse. Uber data stack is pretty solid leveraging the popular open source technologies. The data stack processes trillions of events and petabytes everyday as per this Real-time Data Infrastructure. Today, lets dive into the tech stack from Data Engineering perspective.

The data tech stack is extracted mainly from the Uber Tech Blog.

Since Uber use both on premise and multi cloud, it will be harder to get a cleaner picture of what exactly they use for data needs, I expect it would be a mix of all of them for several years.

Lets go over high level into each component:

Platform

On Premise

Uber have been maintaining their own data centers for a decade, they have three in total in California, Arizona and Virginia.

AWS

Uber have been using couple of AWS services, could not find enough on this but assuming they are used for certain projects, read more here.

GCP

In 2023, Uber signed a contract to move to cloud and recently they started the migration planning of Batch workloads according to the recent article.

📖Interesting read by
Gergely Orosz
: Uber Move to Cloud

Storage

Hudi

Uber use Hudi open table format to provide ACID capabilities to the Data Lake along with lot of other benefits, time travel and data compaction.

HDFS

HDFS is core part of Uber Data Platform, its the source of truth for all kinds of data. HDFS and Hudi works together to provide a seamless Lakehouse architecture.

💡Uber host one of the largest Hadoop infrastructure, containing data at exabyte scale.

S3

Uber core data resides in HDFS but some of the projects also leverage Cloud technologies like S3, not 100% sure if this is used by Data Teams.

Pinot

For Low latency real time analytical purpose, Uber provide Pinot as a service as part of their Data Platform.

Processing

Kafka

Uber utilize Kafka as their centralized messaging system that receives all types of events from various sources Elastic Search, KV Store, Postgres and Microservices.

Flink

Uber leverage the open source Flink to do real time data ingestion. Learn More: Uber Big Data Platform.

Presto

Presto is used as SQL layer on top of the Lakehouse. This provides data consumers ability to write SQL to read large scale data directly from HDFS.

Spark

Uber use Spark for batch processing workloads, typically Spark batch jobs read and write from the Lakehouse. Also providing Notebook support for experimental use cases.