High-Level Architecture
The core principle is separation of concerns: Metadata (the directory structure) is stored separately from Block Data (the actual file content).
⬇ ⬇
(DynamoDB/Cassandra)
(HDD/SSD Cluster)
Component Responsibilities
- API Service: Authentication (IAM), Rate Limiting, and Request Orchestration.
- Metadata Store: Stores `Buckets`, `ObjectKeys`, `ACLs`. Needs fast Key-Value lookups.
- Block Store: Stores immutable chunks of data (Blobs). Optimized for throughput (MB/s).
Data Partitioning
You cannot store Petabytes on one machine. We rely on Consistent Hashing to distribute data across thousands of nodes.
Namespace Partitioning
Objects are distributed based on a hash of the
BucketName + ObjectKey. This prevents hot spots unless a single key gets massive traffic
(handled by caching).
Consistent Hashing
A virtual ring topology. When a node is added/removed, only 1/N
keys need to be remapped, minimizing data shuffling.
Erasure Coding vs Replication
How does S3 achieve 11 9s of durability without bankrupting AWS?
| Strategy | Concept | Storage Overhead | Use Case |
|---|---|---|---|
| 3x Replication | Stores 3 full copies of data. | 200% Overhead (3GB for 1GB data) | Fast Reads, Low Latency access (Hot Tier). |
| Erasure Coding (EC) | Splits data into `n` chunks + `k` parity. | ~50% Overhead (1.5GB for 1GB data) | High Durability, Cold/Warm Storage (S3 Standard). |
Implementation: Multipart Upload (Python Boto3)
For large files (>100MB), passing a single stream is risky (network failure). Use Multipart Uploads to upload chunks in parallel.
import boto3
from boto3.s3.transfer import TransferConfig
s3 = boto3.client('s3')
config = TransferConfig(
multipart_threshold=1024 * 25, # 25MB threshold
max_concurrency=10,
multipart_chunksize=1024 * 25,
use_threads=True
)
file_path = '/path/to/large_video.mp4'
key = 'uploads/video.mp4'
bucket = 'my-streaming-bucket'
def upload_file():
# Boto3 handles the complex logic of:
# 1. InitiateMultipartUpload
# 2. UploadPart (in parallel threads)
# 3. CompleteMultipartUpload
s3.upload_file(file_path, bucket, key, Config=config)
print("Upload Complete!")
# Presigned URL for Secure Direct-to-S3 Upload (Client-Side)
def generate_presigned_url():
return s3.generate_presigned_url(
'put_object',
Params={'Bucket': bucket, 'Key': key},
ExpiresIn=3600
)
Consistency Models: Strong vs Eventual
Historically, S3 was Eventual Consistency (updates might take seconds to propogate). In 2020, AWS moved to Strong Consistency.
How Strong Consistency Works (Simplified)
When you write a new object, the request doesn't return 200 OK until the data
is safely replicated to a quorum of nodes. Subsequent GET requests are guaranteed to see the
new object immediately.
Summary
- Decouple Metadata: Separate the filename/index (DynamoDB) from the blob content (HDD).
- Erasure Coding: Use EC for massive durability at low cost compared to replication.
- Multipart Uploads: Essential for reliability and speed with large files.
- Consistency: Modern object stores prioritize Strong Consistency for better DX.