Distributed Storage

Cloud Storage Design (S3)

Architecting a planet-scale object store: Decoupling Metadata from Data, Erasure Coding for 99.999999999% durability, and Strong Consistency.

High-Level Architecture

The core principle is separation of concerns: Metadata (the directory structure) is stored separately from Block Data (the actual file content).

Client
Load Balancer
API Service

    
Metadata Store
(DynamoDB/Cassandra)
    
Block Store
(HDD/SSD Cluster)

Component Responsibilities

  • API Service: Authentication (IAM), Rate Limiting, and Request Orchestration.
  • Metadata Store: Stores `Buckets`, `ObjectKeys`, `ACLs`. Needs fast Key-Value lookups.
  • Block Store: Stores immutable chunks of data (Blobs). Optimized for throughput (MB/s).

Data Partitioning

You cannot store Petabytes on one machine. We rely on Consistent Hashing to distribute data across thousands of nodes.

Namespace Partitioning

Objects are distributed based on a hash of the BucketName + ObjectKey. This prevents hot spots unless a single key gets massive traffic (handled by caching).

Consistent Hashing

A virtual ring topology. When a node is added/removed, only 1/N keys need to be remapped, minimizing data shuffling.

Erasure Coding vs Replication

How does S3 achieve 11 9s of durability without bankrupting AWS?

Strategy Concept Storage Overhead Use Case
3x Replication Stores 3 full copies of data. 200% Overhead (3GB for 1GB data) Fast Reads, Low Latency access (Hot Tier).
Erasure Coding (EC) Splits data into `n` chunks + `k` parity. ~50% Overhead (1.5GB for 1GB data) High Durability, Cold/Warm Storage (S3 Standard).
Example (4+2 Encoding): Split file into 4 data chunks + calculate 2 parity chunks. You can lose ANY 2 chunks and still recover the file. Overhead is just 50%!

Implementation: Multipart Upload (Python Boto3)

For large files (>100MB), passing a single stream is risky (network failure). Use Multipart Uploads to upload chunks in parallel.

import boto3
from boto3.s3.transfer import TransferConfig

s3 = boto3.client('s3')
config = TransferConfig(
    multipart_threshold=1024 * 25, # 25MB threshold
    max_concurrency=10,
    multipart_chunksize=1024 * 25,
    use_threads=True
)

file_path = '/path/to/large_video.mp4'
key = 'uploads/video.mp4'
bucket = 'my-streaming-bucket'

def upload_file():
    # Boto3 handles the complex logic of:
    # 1. InitiateMultipartUpload
    # 2. UploadPart (in parallel threads)
    # 3. CompleteMultipartUpload
    s3.upload_file(file_path, bucket, key, Config=config)
    print("Upload Complete!")

# Presigned URL for Secure Direct-to-S3 Upload (Client-Side)
def generate_presigned_url():
    return s3.generate_presigned_url(
        'put_object',
        Params={'Bucket': bucket, 'Key': key},
        ExpiresIn=3600
    )

Consistency Models: Strong vs Eventual

Historically, S3 was Eventual Consistency (updates might take seconds to propogate). In 2020, AWS moved to Strong Consistency.

How Strong Consistency Works (Simplified)

When you write a new object, the request doesn't return 200 OK until the data is safely replicated to a quorum of nodes. Subsequent GET requests are guaranteed to see the new object immediately.

Summary

  • Decouple Metadata: Separate the filename/index (DynamoDB) from the blob content (HDD).
  • Erasure Coding: Use EC for massive durability at low cost compared to replication.
  • Multipart Uploads: Essential for reliability and speed with large files.
  • Consistency: Modern object stores prioritize Strong Consistency for better DX.