
Meta Data Tech Stack
Learn what data tech stack Meta leverages to process and store massive amount of data every day in their data centers.
Meta is one of the largest tech companies, relying heavily on data to make informed decisions since its early days. It hosts exabyte-scale data in its warehouse while processing terabytes per second from millions of producers.
Meta has open-sourced several tools like Hive and Presto, while others remain internal—some of which we will discuss in today’s article.
Content is based on multiple sources including Meta Engineering Blog, Meta Research Papers, third party articles. You will find references as you read.
Today’s post is brought to you by Multiplayer:
Multiplayer's makes debugging distributed systems easier with deep session replays. From frontend screens to backend traces, metrics, and logs, you have every detail you need to find and fix a bug in one place.
Platform
On Premise
Since its inception in 2004, Meta has operated out of its on-premise data centers spread across the globe, including the United States, Europe, and Asia-Pacific.
💡Meta has 35 data centers as per this source.
Storage
Hive
Hive was created at Meta and open-sourced in 2008. It is an exabyte-scale data warehouse storing millions of tables across multiple data centers. Hive leverages the ORC columnar format to efficiently store massive datasets.
Scuba
Scuba is an in-house tool designed for real-time data analysis. It provides multiple ways to access data at high speed through a UI or programmatic interfaces. Scuba excels at handling ad-hoc queries, with most responses returning in under a second.
💡Scuba is like Apache Druid at a very high level, read more here.
Laser
Laser is a high-throughput, low-latency key-value storage service built on RocksDB. It reads data in real-time from Scribe streams and daily from Hive tables. It powers the Facebook product and integrates with apps like Puma and Stylus.
📖 Recommended reading: Realtime Data Processing at Facebook
Processing
Scribe
Scribe processes over 2.5 TB per second of input from millions of producers and outputs 7+ TB per second to hundreds of thousands of consumers, primarily sending data to Scuba and Hive. Apache Kafka is an open-source alternative to Scribe.
📖 Recommended Reading: Scribe: Transporting petabytes per hour via a distributed, buffered queueing system
Puma / Swift / Stylus
Meta Engineering has built three in-house tools to read and write data in real-time to and from Scribe.
The following help you understand where these components sit in the system.
Puma:
Puma is a stream processing system where apps are written in a SQL-like language with Java UDFs.
Puma processes Scribe streams with a few seconds' delay, outputting to another stream, real-time processor, or data store.
It's optimized for compiled queries, not ad-hoc analysis.
Swift:
Swift is a basic stream processing engine providing checkpointing for Scribe.
It offers a simple API to read streams with checkpoints, allowing apps to restart from the latest checkpoint.
Swift is ideal for low-throughput, stateless processing, with client apps often written in scripting languages like Python.
Stylus:
Stylus is a low-level stream processing framework written in C++.
It supports both stateless and stateful processors.
Its processing API is similar to other procedural stream processing systems.
Presto
Presto, open sourced by Meta in 2013, is a successful big data product widely used for SQL-based data processing. At Meta, it enables fast data access, ranging from seconds to minutes, for billions of records.
⭐ If you are interested in learning how big data tech evolved over the years then read Data Processing in 21st Century.
Spark
Spark is another query engine offering an alternative to Presto. Unlike Presto, it provides greater flexibility, allowing users to leverage Java, Scala, and Python APIs for complex transformations.
Dataswarm
Dataswarm is an in-house orchestration tool similar to Apache Airflow, built in Python. It enables job orchestration and scheduling in a DAG-based pipeline and supports Presto and Spark jobs.
📹 Recommended video: Dataswarm
Dashboard
UniDash
UniDash is an in-house visualization tool for creating dashboards, accessible via a web interface or Python API. It is powered by Presto and includes an additional caching layer through RaptorX.
📖 Read More: High-Level Overview of the internal tech stack
Related Content: Tech Stack Series
💬 Let me know in the comments if I missed something.