Meta Data Tech Stack

Learn what data tech stack Meta leverages to process and store massive amount of data every day in their data centers.

Junaid Effendi

Mar 08, 2025

Meta is one of the largest tech companies, relying heavily on data to make informed decisions since its early days. It hosts exabyte-scale data in its warehouse while processing terabytes per second from millions of producers.

Meta has open-sourced several tools like Hive and Presto, while others remain internal—some of which we will discuss in today’s article.

Meta Data Tech Stack (Icon source: flaticon.com)

Content is based on multiple sources including Meta Engineering Blog, Meta Research Papers, third party articles. You will find references as you read.

Today’s post is brought to you by Multiplayer:

Multiplayer's makes debugging distributed systems easier with deep session replays. From frontend screens to backend traces, metrics, and logs, you have every detail you need to find and fix a bug in one place.

Platform

On Premise

Since its inception in 2004, Meta has operated out of its on-premise data centers spread across the globe, including the United States, Europe, and Asia-Pacific.

💡Meta has 35 data centers as per this source.

Storage

Hive

Hive was created at Meta and open-sourced in 2008. It is an exabyte-scale data warehouse storing millions of tables across multiple data centers. Hive leverages the ORC columnar format to efficiently store massive datasets.

Scuba

Scuba is an in-house tool designed for real-time data analysis. It provides multiple ways to access data at high speed through a UI or programmatic interfaces. Scuba excels at handling ad-hoc queries, with most responses returning in under a second.

💡Scuba is like Apache Druid at a very high level, read more here.

Laser

Laser is a high-throughput, low-latency key-value storage service built on RocksDB. It reads data in real-time from Scribe streams and daily from Hive tables. It powers the Facebook product and integrates with apps like Puma and Stylus.

📖 Recommended reading: Realtime Data Processing at Facebook

Processing

Scribe

Scribe processes over 2.5 TB per second of input from millions of producers and outputs 7+ TB per second to hundreds of thousands of consumers, primarily sending data to Scuba and Hive. Apache Kafka is an open-source alternative to Scribe.

📖 Recommended Reading: Scribe: Transporting petabytes per hour via a distributed, buffered queueing system

Puma / Swift / Stylus

Meta Engineering has built three in-house tools to read and write data in real-time to and from Scribe.

The following help you understand where these components sit in the system.

Image Source: https://research.facebook.com/publications/realtime-data-processing-at-facebook/

Puma:

Puma is a stream processing system where apps are written in a SQL-like language with Java UDFs.
Puma processes Scribe streams with a few seconds' delay, outputting to another stream, real-time processor, or data store.
It's optimized for compiled queries, not ad-hoc analysis.

Swift:

Swift is a basic stream processing engine providing checkpointing for Scribe.
It offers a simple API to read streams with checkpoints, allowing apps to restart from the latest checkpoint.
Swift is ideal for low-throughput, stateless processing, with client apps often written in scripting languages like Python.

Stylus:

Stylus is a low-level stream processing framework written in C++.
It supports both stateless and stateful processors.
Its processing API is similar to other procedural stream processing systems.

Presto

Presto, open sourced by Meta in 2013, is a successful big data product widely used for SQL-based data processing. At Meta, it enables fast data access, ranging from seconds to minutes, for billions of records.

⭐ If you are interested in learning how big data tech evolved over the years then read Data Processing in 21st Century.

Spark

Spark is another query engine offering an alternative to Presto. Unlike Presto, it provides greater flexibility, allowing users to leverage Java, Scala, and Python APIs for complex transformations.

Dataswarm

Dataswarm is an in-house orchestration tool similar to Apache Airflow, built in Python. It enables job orchestration and scheduling in a DAG-based pipeline and supports Presto and Spark jobs.

📹 Recommended video: Dataswarm

Dashboard

UniDash

UniDash is an in-house visualization tool for creating dashboards, accessible via a web interface or Python API. It is powered by Presto and includes an additional caching layer through RaptorX.