LinkedIn Data Tech Stack

Learn how LinkedIn handle trillions of events per day from billion customers using Apache Beam and more.

Junaid Effendi

Oct 16, 2024

LinkedIn is a leading tech company leveraging advanced tools for large-scale data processing. Over the years, their engineering teams have made significant contributions by open-sourcing several technologies, including well-known ones like Kafka, Pinot and Samza. Today, we'll explore their data tech stack and get an overview of how they handle massive data processing.

Content is based on multiple sources including LinkedIn Blog, Open Source websites and news articles, you will find links as you read through the article.

Platform

On Premise

LinkedIn have been maintaining their own data centers since its inception, they have data centers across the US and one in Singapore.

LinkedIn initial plan after getting acquired by Microsoft was to move to Azure, however in early 2022 the plan was paused as per this source. Since they paused the migration after few years, it is likely that some services are already in Azure.

Storage

HDFS

HDFS has been a core part of LinkedIn Data Infrastructure containing petabytes of data running on the Hadoop Clusters. LinkedIn have built lot of open source abstractions to deal with massive scale challenges, read their article about storage infrastructure.

Iceberg

LinkedIn use Iceberg open table format to provide ACID capabilities to the Data Lake along with lot of other benefits, time travel and data compaction.

OpenHouse

OpenHouse is another recent open source initiative by LinkedIn data infra teams. It is a control plane for managing Iceberg tables. It includes a RESTful declarative catalog, and suite of data services for table maintenance.

Pinot

For Low latency real time analytical purpose, LinkedIn use Pinot which was actually created at LinkedIn and later became part of Apache.

Processing

Kafka

Another popular and successful tool built at LinkedIn. LinkedIn leverage Kafka to process trillions of events every day.

Beam

LinkedIn adopted Apache Beam in 2023 to unify their batch and streaming needs. Beam allows easy integrations with distributed engines like Spark, Samza and Flink.

📖 Read article: Revolutionizing Real-Time Streaming Processing

Samza

Samza was created at LinkedIn and open source in 2013. It has been used in streaming workflows, while more recently its used in conjunction with Apache Beam Model as shared above.

Spark

LinkedIn use Spark for batch processing workloads that deals with petabyte scale. LinkedIn has customized Spark shuffling service through in house tool called Magnet.

Trino

Trino is used to query data sitting in HDFS using SQL. It is used for both purposes; performing quick adhoc analysis and running in a batch pipeline.

Dashboard

Tableau

LinkedIn use Tableau to empower their analytics and sales team.

Considering their wide range of in house/open source tools, they may have a dashboard data tool along with Tableau.

📖Recommended Reading: LinkedIn Data Infrastructure

Related Content:

Netflix Data Tech Stack

Junaid Effendi

May 8, 2024

Read full story

Airbnb Data Tech Stack

Junaid Effendi

August 14, 2024

Read full story

💬LinkedIn is pretty big and most likely using lot of other technologies that I could not mention. If you think I missed important ones, feel free to comment below.