AWS “Best Practice” Doesn’t Work at Petabyte Scale.

Written by Michael Right | Apr 28, 2026 5:21:05 PM

AWS has a well-documented approach to protecting data lakes using services like AWS Backup, cross-region replication, vaults, and policy-based automation (AWS, Best Practices for Data Lake Protection with AWS Backup). On paper, it looks comprehensive.

In practice, it doesn’t work.

The problem is not that the recommendations are wrong. It’s that they are built on assumptions that no longer hold once data reaches petabyte scale, and they come from a model where the platform providing the guidance also benefits from the additional storage, replication, and services required to implement it.

What AWS Recommends

The guidance is consistent across AWS materials:

Define backup policies and schedules
Tag resources and automate backups
Store copies across accounts and regions
Use vaults, immutability, and encryption
Regularly test restore procedures
Monitor and audit backup operations

AWS Backup is positioned as a centralized, policy-driven way to manage this (AWS Backup overview). At a high level, the strategy is simple: copy the data, secure the copies, and restore when needed.

This model also aligns directly with how cloud platforms monetize data. More copies mean more storage, more replication, more data transfer, and more services managing it. The same platform providing the guidance also benefits from the way that guidance is implemented, creating an inherent tension between recommended architecture and consumption.

Where This Breaks Down

This model works for traditional workloads, but it breaks down at the ever increasing petabyte scale.

Risk: Copy-Based Protection Becomes Impractical
Risk: Recovery Becomes Operationally Heavy
Risk: Centralized Control Creates Systemic Risk
Risk: Failure Is Assumed, Not Prevented

AWS explicitly recommends (AWS prescriptive guidance):

cross-region copies
cross-account backups
multiple recovery points
long retention policies

That means duplicating large datasets, paying for storage multiple times, operating multiple environments, and expanding the attack surface with each additional copy.

At terabyte scale, this is can be manageable. At petabyte scale, it becomes financially and operationally impractical.

A multi-petabyte data lake is not copied once. It is replicated across regions, accounts, and retention windows. Costs compound across storage and data transfer, operational complexity increases, and each additional copy becomes another system that must be secured, monitored, and governed.

At scale, resilience becomes tied to consumption. The more you try to protect, the more infrastructure you are required to maintain, and the cost model grows faster than the data itself.

AWS recommends (AWS backup & recovery guidance):

defining RTO/RPO targets
testing restores
managing restore workflows

In reality, restoring petabyte-scale data takes time. Rehydrating backups slows pipelines. Rebuilding environments is not instantaneous.

For AI systems, this is not a minor inconvenience. Pipelines depend on continuous access to data. Models depend on consistent training inputs. Downstream systems depend on both. Recovery is no longer about getting files back. It is about restoring an entire operating system of data. That process introduces delays at exactly the moment systems need to recover quickly.

AWS Backup centralizes policies, access, backup operations, and retention rules.

That creates a dependency on a single control plane managing an increasingly large and interconnected dataset. As the data lake grows, so does the blast radius of any mistake, with risk scaling alongside size and centralization.

One misconfigured policy, one compromised role, or one lifecycle error can affect the entire environment. Backups can be deleted, copies can be corrupted or poisoned, and retention guarantees can fail.

The entire approach is reactive. Detect an issue, restore from backup, and recover the system.

AI introduces a different failure mode: silent poisoning.

Small changes to training data, feature values, or transformation logic can alter model behavior without triggering traditional alerts. The data remains available. The backups remain intact. The system continues to run. The outputs are simply wrong.

Backup systems are not designed to detect or prevent this type of failure. They assume data loss or corruption is visible and recoverable. That assumption does not hold for modern data systems.

What this Reveals

All these problems come from the same assumption: data protection is achieved by creating and managing copies. That assumption breaks under the weight of scale, cost, and dependency.

At petabyte scale, copying data introduces cost that compounds over time, operational overhead that slows systems down, and additional surfaces that must be secured. At the same time, centralized control over that data creates a single point of failure with a disproportionate impact.

The architecture does not fail because of a single bad component. It fails because the model itself does not scale. Adding more controls does not fix the problem. It reinforces it.

What Actually Works

Resilience at this scale cannot depend on copying data or rebuilding environments after failure. It requires an approach where protection is built into the data itself, not added afterward through additional systems.

Resilience Built Into the Data Layer

Myota protects data at the data layer rather than through duplication.

Data is sharded and distributed at write time across independent storage locations, including across regions. Protection is continuous rather than scheduled. There is no separate backup system and no secondary environment that needs to be maintained.

Myota does not replicate full copies of data across regions. It distributes data in a way that allows it to be reconstructed even if entire regions become unavailable. The system is designed to survive infrastructure failure without relying on duplicate environments.

Because of this, resilience does not depend on creating or managing redundant copies. There are no backup pipelines to operate, no replication jobs to monitor, and no parallel storage estates to secure. Recovery does not involve rehydrating data or rebuilding environments. It is simply accessing data that has already been protected.

The cost model changes as well. Protection no longer scales with the number of copies, regions, or retention windows. There is no replication tax, no cross-region duplication cost, and no need to maintain multiple full datasets.

As data grows, the model remains stable instead of becoming more expensive and more complex. In contrast, AWS Backup best practices are built on the assumption that copying, centralization, and post-failure recovery remain practical.

At petabyte scale, those assumptions introduce cost, complexity, and systemic risk. If your architecture depends on copying large datasets across environments and restoring them after failure, the issue is not execution. It is the model itself.

Myota takes a different approach by removing the dependency on copies and embedding resilience directly into the data layer, making protection practical at any scale.

View full post