109 Iowa L. Rev. 1087 (2024)



A data-generation revolution is underway. Until recently, most of the data used for algorithmic decision-making was collected from events that took place in the physical world (“collected” data). Yet it is forecast that by 2024, sixty percent of data used to train artificial intelligence systems around the world will be synthetic (!). Synthetic data is artificially generated data that has analytical value. For some purposes, synthetic datasets can replace collected data by preserving or mimicking its properties. For others, synthetic data can complement collected data in ways which increase its accuracy or enhance privacy or security protections. The importance of this data revolution for our economies and societies cannot be overstated. It affects data access and data flows, potentially changing the competitive dynamics in markets where data cannot be easily collected and affecting decision-making in many spheres of our life. In many ways, synthetic data does to data what synthetic threads did to cotton.

This data-generation revolution requires us to reevaluate and potentially restructure our legal data governance regime, which was designed with collected data in mind. As we show, synthetic data challenges the equilibrium erected by existing laws to ensure the protection of competing values, including data utility, privacy, security, and human rights. For instance, by revolutionizing data access, synthetic data challenges assumptions regarding the height of access barriers to data. As such, it may affect the need for and the application of antitrust and direct regulation to some firms whose comparative advantage is data-based.

Even more importantly, by potentially making data about individuals more granular, and by increasing the accuracy and completeness of data used for decision-making about individuals, synthetic data also challenges the governance structures and basic principles underpinning current privacy laws. Indeed, many argue that synthetic data does not constitute personal data, and thus avoids the application of privacy laws. We challenge this claim. We also show that synthetic data exposes deep conceptual flaws in the data governance framework. It raises fundamental questions, such as whether data which is not linked to a person in the original dataset should still be treated as personal data, and how inferences based on collected data should be treated.

We then reevaluate the justifications for legal requirements regarding data quality, such as data completeness and accuracy, as well as those relating to fair and informed decision-making, such as data transparency and explainability. The claim is often made that such obligations enhance social welfare. Yet, as we show, synthetic data changes the balance between the protected values, potentially leading to different optimal legal requirements in different contexts. For example, where synthetic data significantly increases consumer welfare, yet the underlying processes are not easily explained, requirements to look under the hood of datasets and provide a detailed explanation of what led to the decision might not always be welfare-maximizing. 

This Article seeks to bring state-of-the-art data generation methods into the legal debate and to propose legal reforms which capture the unique characteristics of synthetic data. While some of the challenges discussed here also arise with the use of collected data, synthetic data puts these challenges on steroids. 

Friday, March 15, 2024