AWS has a well-documented approach to protecting data lakes using services like AWS Backup, cross-region replication, vaults, and policy-based automation (AWS, Best Practices for Data Lake Protection with AWS Backup). On paper, it looks comprehensive.
In practice, it doesn’t work.
The problem is not that the recommendations are wrong. It’s that they are built on assumptions that no longer hold once data reaches petabyte scale, and they come from a model where the platform providing the guidance also benefits from the additional storage, replication, and services required to implement it.
The guidance is consistent across AWS materials:
AWS Backup is positioned as a centralized, policy-driven way to manage this (AWS Backup overview). At a high level, the strategy is simple: copy the data, secure the copies, and restore when needed.
This model also aligns directly with how cloud platforms monetize data. More copies mean more storage, more replication, more data transfer, and more services managing it. The same platform providing the guidance also benefits from the way that guidance is implemented, creating an inherent tension between recommended architecture and consumption.
This model works for traditional workloads, but it breaks down at the ever increasing petabyte scale.
AWS explicitly recommends (AWS prescriptive guidance):
That means duplicating large datasets, paying for storage multiple times, operating multiple environments, and expanding the attack surface with each additional copy.
At terabyte scale, this is can be manageable. At petabyte scale, it becomes financially and operationally impractical.
A multi-petabyte data lake is not copied once. It is replicated across regions, accounts, and retention windows. Costs compound across storage and data transfer, operational complexity increases, and each additional copy becomes another system that must be secured, monitored, and governed.
At scale, resilience becomes tied to consumption. The more you try to protect, the more infrastructure you are required to maintain, and the cost model grows faster than the data itself.
AWS recommends (AWS backup & recovery guidance):
In reality, restoring petabyte-scale data takes time. Rehydrating backups slows pipelines. Rebuilding environments is not instantaneous.
For AI systems, this is not a minor inconvenience. Pipelines depend on continuous access to data. Models depend on consistent training inputs. Downstream systems depend on both. Recovery is no longer about getting files back. It is about restoring an entire operating system of data. That process introduces delays at exactly the moment systems need to recover quickly.
AWS Backup centralizes policies, access, backup operations, and retention rules.
That creates a dependency on a single control plane managing an increasingly large and interconnected dataset. As the data lake grows, so does the blast radius of any mistake, with risk scaling alongside size and centralization.
One misconfigured policy, one compromised role, or one lifecycle error can affect the entire environment. Backups can be deleted, copies can be corrupted or poisoned, and retention guarantees can fail.
The entire approach is reactive. Detect an issue, restore from backup, and recover the system.
AI introduces a different failure mode: silent poisoning.
Small changes to training data, feature values, or transformation logic can alter model behavior without triggering traditional alerts. The data remains available. The backups remain intact. The system continues to run. The outputs are simply wrong.
Backup systems are not designed to detect or prevent this type of failure. They assume data loss or corruption is visible and recoverable. That assumption does not hold for modern data systems.