Uber Data Tech Stack
Learn about the Data Tech Stack used by Uber to process trillions of events every day.
Uber handle massive scale, from event data in streams to data at rest in the warehouse. Uber data stack is pretty solid leveraging the popular open source technologies. The data stack processes trillions of events and petabytes everyday as per this Real-time Data Infrastructure. Today, lets dive into the tech stack from Data Engineering perspective.
The data tech stack is extracted mainly from the Uber Tech Blog.
Since Uber use both on premise and multi cloud, it will be harder to get a cleaner picture of what exactly they use for data needs, I expect it would be a mix of all of them for several years.
Lets go over high level into each component:
Platform
On Premise
Uber have been maintaining their own data centers for a decade, they have three in total in California, Arizona and Virginia.
AWS
Uber have been using couple of AWS services, could not find enough on this but assuming they are used for certain projects, read more here.
GCP
In 2023, Uber signed a contract to move to cloud and recently they started the migration planning of Batch workloads according to the recent article.
📖Interesting read by
: Uber Move to Cloud
Storage
Hudi
Uber use Hudi open table format to provide ACID capabilities to the Data Lake along with lot of other benefits, time travel and data compaction.
HDFS
HDFS is core part of Uber Data Platform, its the source of truth for all kinds of data. HDFS and Hudi works together to provide a seamless Lakehouse architecture.
💡Uber host one of the largest Hadoop infrastructure, containing data at exabyte scale.
S3
Uber core data resides in HDFS but some of the projects also leverage Cloud technologies like S3, not 100% sure if this is used by Data Teams.
Pinot
For Low latency real time analytical purpose, Uber provide Pinot as a service as part of their Data Platform.
Processing
Kafka
Uber utilize Kafka as their centralized messaging system that receives all types of events from various sources Elastic Search, KV Store, Postgres and Microservices.
Flink
Uber leverage the open source Flink to do real time data ingestion. Learn More: Uber Big Data Platform.
Presto
Presto is used as SQL layer on top of the Lakehouse. This provides data consumers ability to write SQL to read large scale data directly from HDFS.
Spark
Uber use Spark for batch processing workloads, typically Spark batch jobs read and write from the Lakehouse. Also providing Notebook support for experimental use cases.
Dashboard
Dashbuilder
In house custom visualization software, it is still heavily and widely used.
Looker
Looker is leveraged by some of the teams to do internal reporting since their agreement with Google last year.
💡Uber also have a in house Search and Discovery platform called DataBook.
⭐Interested in more content like this? Checkout Netflix Data Tech Stack
What else to read?
💬Let me know in the comments if I missed or misrepresented something.
Great article! Pure wisdom!