Cloud Storage System Design | S3 Architecture, Erasure Coding & Consistency

High-Level Architecture

The core principle is separation of concerns: Metadata (the directory structure) is stored separately from Block Data (the actual file content).

Client

→

Load Balancer

→

API Service

⬇ ⬇

Metadata Store
(DynamoDB/Cassandra)

Block Store
(HDD/SSD Cluster)

Component Responsibilities

API Service: Authentication (IAM), Rate Limiting, and Request Orchestration.
Metadata Store: Stores `Buckets`, `ObjectKeys`, `ACLs`. Needs fast Key-Value lookups.
Block Store: Stores immutable chunks of data (Blobs). Optimized for throughput (MB/s).

Data Partitioning

You cannot store Petabytes on one machine. We rely on Consistent Hashing to distribute data across thousands of nodes.

Namespace Partitioning

Objects are distributed based on a hash of the BucketName + ObjectKey. This prevents hot spots unless a single key gets massive traffic (handled by caching).

Consistent Hashing

A virtual ring topology. When a node is added/removed, only 1/N keys need to be remapped, minimizing data shuffling.

Erasure Coding vs Replication

How does S3 achieve 11 9s of durability without bankrupting AWS?

Strategy	Concept	Storage Overhead	Use Case
3x Replication	Stores 3 full copies of data.	200% Overhead (3GB for 1GB data)	Fast Reads, Low Latency access (Hot Tier).
Erasure Coding (EC)	Splits data into `n` chunks + `k` parity.	~50% Overhead (1.5GB for 1GB data)	High Durability, Cold/Warm Storage (S3 Standard).

Example (4+2 Encoding): Split file into 4 data chunks + calculate 2 parity chunks. You can lose ANY 2 chunks and still recover the file. Overhead is just 50%!

Implementation: Multipart Upload (Python Boto3)

For large files (>100MB), passing a single stream is risky (network failure). Use Multipart Uploads to upload chunks in parallel.

import boto3
from boto3.s3.transfer import TransferConfig

s3 = boto3.client('s3')
config = TransferConfig(
    multipart_threshold=1024 * 25, # 25MB threshold
    max_concurrency=10,
    multipart_chunksize=1024 * 25,
    use_threads=True
)

file_path = '/path/to/large_video.mp4'
key = 'uploads/video.mp4'
bucket = 'my-streaming-bucket'

def upload_file():
    # Boto3 handles the complex logic of:
    # 1. InitiateMultipartUpload
    # 2. UploadPart (in parallel threads)
    # 3. CompleteMultipartUpload
    s3.upload_file(file_path, bucket, key, Config=config)
    print("Upload Complete!")

# Presigned URL for Secure Direct-to-S3 Upload (Client-Side)
def generate_presigned_url():
    return s3.generate_presigned_url(
        'put_object',
        Params={'Bucket': bucket, 'Key': key},
        ExpiresIn=3600
    )

Consistency Models: Strong vs Eventual

Historically, S3 was Eventual Consistency (updates might take seconds to propogate). In 2020, AWS moved to Strong Consistency.

How Strong Consistency Works (Simplified)

When you write a new object, the request doesn't return 200 OK until the data is safely replicated to a quorum of nodes. Subsequent GET requests are guaranteed to see the new object immediately.

Summary

Decouple Metadata: Separate the filename/index (DynamoDB) from the blob content (HDD).
Erasure Coding: Use EC for massive durability at low cost compared to replication.
Multipart Uploads: Essential for reliability and speed with large files.
Consistency: Modern object stores prioritize Strong Consistency for better DX.

Cloud Storage Design (S3)