Scalable Storage

File Upload & Media Management

Designing robust upload pipelines: S3 Presigned URLs, chunked multipart uploads, virus scanning, and global CDN delivery.

High-Level Architecture (Direct Upload)

Avoid uploading files to your backend API servers. It wastes bandwidth and blocks threads. Upload directly to Object Storage (S3).

1
Client Request: App requests an upload URL from API (GET /api/upload-url).
2
Generate URL: API validates permissions and generates a temporary S3 Presigned URL.
3
Direct Upload: Client uploads file directly to S3 using the Presigned URL (PUT request).
4
Processing: S3 event triggers a Lambda function (Virus Scan, Thumbnail generation).
5
CDN Delivery: Processed files are served globally via CloudFront.

S3 Presigned URLs

A secure way to grant temporary upload access without exposing AWS credentials.

✅ Benefits
  • Scalability: Your servers don't handle file data.
  • Security: URL expires quickly (e.g., 5 mins).
  • Access Control: You can restrict file type (MIME) and max size.
❌ Limitations
  • Complexity: Client needs to make 2 requests (Get URL + Upload).
  • CORS: Need to configure S3 CORS correctly.
Python (Boto3) Implementation
import boto3
from botocore.exceptions import ClientError

def generate_presigned_url(bucket_name, object_name, expiration=300):
    """Generate a presigned URL to share an S3 object"""
    s3_client = boto3.client('s3')
    try:
        response = s3_client.generate_presigned_url('put_object',
                                                    Params={'Bucket': bucket_name,
                                                            'Key': object_name,
                                                            'ContentType': 'image/jpeg'},
                                                    ExpiresIn=expiration)
    except ClientError as e:
        print(e)
        return None
    return response

# Usage
url = generate_presigned_url('my-bucket', 'uploads/user1/avatar.jpg')
print(f"Upload here: {url}")

Large Files: Multipart Upload

For files > 100MB, standard upload fails if connection drops. Use Multipart Upload.

Feature Standard Upload Multipart Upload
Method Single PUT request Split into chunks (e.g., 5MB parts)
Failure Recovery Restart from 0% Retry only failed chunks
Speed Linear Parallel uploads (faster)
Use Case Avatars, Documents (<100MB) Videos, Large Archives (>100MB)
Resumable Uploads: Combine Multipart Upload with a backend (Redis) tracking uploaded "Parts". If connection breaks, client asks "which parts do you have?" and sends missing ones.

Security Best Practices

Virus Scanning

Trigger a Lambda function on S3 upload to scan with ClamAV. Quarantine malicious files immediately.

Validate MIME Types

Don't trust the file extension. Inspect the "Magic Bytes" (file signature) to ensure it's a valid image/pdf.

Private Bucket

Keep the S3 bucket Private. Only serve public content via CloudFront (CDN) or Signed URLs for private files.

Media Processing Pipeline

Never serve raw user uploads. They are usually unoptimized (5MB 4K image for a 50px avatar).

  1. Trigger: S3 ObjectCreated event sends message to SQS.
  2. Worker: Consumes SQS, downloads image.
  3. Process: Resizes (thumbnails), optimizes (WebP/AVIF), strips metadata (EXIF).
  4. Save: Uploads processed versions to a "Public/Processed" bucket.
  5. Serve: Application uses URL of processed version.

Summary

  • Use Presigned URLs for direct client-to-S3 uploads (scalable).
  • Implement Multipart Uploads for large files to support retries and parallel chunks.
  • Scan every file for Viruses asynchronously.
  • Serve content via CDN (CloudFront/Cloudflare) for speed and caching.
  • Never override the original file; store processed versions separately.