Building the Heap: Racking 30 Petabytes of Hard Drives for Pretraining By Standard Intelligence Team | September 30, 2025 --- Overview Standard Intelligence built a 30 petabyte (PB) video data storage cluster in downtown San Francisco for under $500,000. This cluster stores 90 million hours of video for pretraining AI models aimed at solving computer use. Compared to text language models that require ~60 TB of data, their video training requires 500x more storage. Renting colocation space reduced the cost approximately 40x compared to AWS, lowering expenses from $12 million to $354k per year (including depreciation). --- Why Build In-House Storage? ML training tolerates data loss: Unlike enterprise storage requiring very high data integrity, losing 5% of pretraining data has minor impact. Cost Efficiency: Cloud storage providers price storage significantly above actual cost; petabyte-scale users are rare and cost is still a small fraction of total spend for big enterprises. Feasibility: Local datacenter solutions are cost-effective and manageable without taking too much team time. Benchmark: The Internet Archive also found self-hosting storage 10x cheaper than AWS despite discounts. --- Cost Breakdown: Cloud vs. In-House Monthly Recurring Costs (Datacenter) | Item | Cost | Notes | |------------|---------------|--------------------------------| | Internet | $7,500/month | 100Gbps Dedicated Internet Access (DIA) from Zayo, 1 year term | | Electricity| $10,000/month | Includes power, cooling, cabinet space, 1 kW/PB at $330/kW | | Total | $17,500/month | | One-Time Costs (CapEx) | Category | Item | Cost | Details | |----------------------|----------------------|-----------|------------------------------------| | Storage | 2,400 Hard Drives | $300,000 | Mostly 12TB used enterprise SATA/SAS drives | | Storage Infrastructure| NetApp DS4246 chassis | $35,000 | 100 chassis, each 4U, dual SATA/SAS | | Compute | CPU head nodes | $6,000 | 10 Intel RR2000 servers from eBay | | Datacenter Setup | Installation fee | $38,500 | One-time datacenter install | | Labor | Contractors | $27,000 | Physical install and wiring help | | Networking & Misc | Cabling, router, NICs| $20,000 | Power cables, Arista router, NICs | | Total | | $426,500 | | With three-year depreciation, total monthly cost: $29,500 (fixed + depreciation). Cloud Pricing Comparison | Provider | Monthly Cost | Cost Per TB / Month | Notes | |-----------------|--------------------|---------------------|---------------------------------------| | AWS | $1,130,000 | $38 | Includes $630k storage + $500k egress | | Cloudflare R2 | $270,000 (discounted)| $10 | No egress fees; bulk-discounted pricing | | In-House Datacenter | $29,500 | $1 | 40x cheaper than AWS, 10x cheaper than Cloudflare | | Backblaze (AI product) | ~$6 to $15 | $6-15 | Cheaper but limited egress & speed | --- Setup Process: "Storage Stacking Saturday" (S3) Held a 36-hour hard drive stacking party in San Francisco with friends, food, and custom-engraved HDDs. Racked and wired all 30 PB of hardware quickly. Used contractors later for installation support. Software Very simple setup: 200 lines of Rust for writes, nginx for reads, SQLite for metadata. Chose simplicity over complex solutions like Ceph or MinIO because: No need for elaborate features like redundancy or sharding. Easier to debug and maintain. All drives formatted with XFS filesystem. --- Post-Mortem: