Learning prior distribution behind relational tables

Synthesizing tables—creating artificial datasets that closely resemble real ones—plays a crucial role in supervised machine learning (ML), with a wide range of practical applications. These include data augmentation, where synthetic data enhances training datasets, and the publication of fake tables that maintain the privacy of real data. A core challenge is: given a real table, can we generate a synthetic version that allows ML models, trained on either the real or synthetic table, to perform similarly on an unseen test set?

Most existing approaches to table synthesis, typically employing deep generative models like GANs (Generative Adversarial Networks) or VAEs (Variational Autoencoders), focus on learning the data distribution of a real table from sampled records. However, these methods often treat records independently, neglecting potential correlations between them. In practice, this assumption is frequently violated. For example, purchase records for the same product are likely to be correlated, and failing to capture such relationships can result in synthetic tables that differ significantly from the real data. This discrepancy in data structure can lead to poor performance when ML models are trained on synthetic tables compared to real ones.

To solve the problems, a research team led by Yaoyu ZHU published their new research on 15 March 2026 in Frontiers of Computer Science co-published by Higher Education Press and Springer Nature.

Their method explicitly models record correlations by grouping data based on user-defined categorical values (e.g., records with the same (Market, Product) pair should belong to the same group). By leveraging these groups, the team applies conditional GANs to model both discrete (categorical) and continuous (numerical) values within each record, ensuring that both global (overall table) and local (within-group) data distributions are preserved in the synthetic table.

In addition, the team extends previous work on differentially private GANs (DPGANs), which only ensured privacy for the discriminator, by further securing the privacy of original data embeddings and sample frequencies. This added layer of protection ensures that the synthetic data not only retains its usefulness but also guarantees stronger privacy protections.

Experimental results demonstrate that this approach significantly outperforms current state-of-the-art table synthesis methods for supervised ML tasks, offering both high utility and robust privacy protection.

Frontiers of Computer Science

10.1007/s11704-025-40424-2

Experimental study

Not applicable

Synthesizing tables for supervised learning

15-Mar-2026

Learning prior distribution behind relational tables

GoPro HERO13 Black

Keywords

Article Information

Contact Information

Source

How to Cite This Article