Author: Matthew Renze
Published: 2021-04-15

How do we use AI to generate new tabular data from scratch?

In my last article in this series on The AI Developer’s Toolkit, I introduced you to the three most popular AI tools for tabular data analysis. These tools allowed us to extract useful information from tables of data.

However, there are many cases where want to generate new tabular data from scratch. This set of tasks is referred to as tabular data synthesis.

In this article, I’ll introduce you to the three most popular AI tools for tabular data synthesis.

Data Imputation

Data imputation allows us to predict the value of missing categorical or numerical data. It answers the question “what data should go here?” However, it does this in a statistically-sensible way.

For example, imagine we have a table of weather data, but we are missing various sensor readings on several days. We provide the data-imputation model with our table of data containing missing values as input. Then the model produces a complete table with those missing values filled in with sensible replacement values as output.

Imputation is useful anytime you have missing values in your data, but you need complete data to perform your task. For example:

  • creating data visualizations that require all rows of data
  • performing aggregate analysis that requires summing values
  • creating machine-learning datasets that require complete data for training

Data Generation

Data generation allows us to synthesize entire tables of data from scratch. It creates a new table of data while preserving the statistical characteristics of the original data.

For example, if our production database contains sensitive data, but we need a proxy database for Q/A testing, we can use data generation to create synthetic data. We provide the data-generation model with our original table of data as input. Then the model produces a completely new table of realistic, (yet completely synthetic) values as output.

Data generation is useful anytime you need many rows of data that statistically mirror (but are not identical to) your original table of data. For example:

  • testing the performance of applications, via load testing
  • creating anonymized data sets, to protect sensitive information
  • augmenting small training datasets (like fraud data) for machine-learning

Data Transformation

Data transformation allows us to automatically convert data from one format into data in a second format. It allows us to create data transformation scripts (automatically) based on just a few examples of what we want the transformed data to look like.

For example, we can automatically transform a table of weather data from its original format into the required target format. We provide the data-transformation model with the original table of data and a few examples of what we want the transformed data to look like. Then, the model produces the transformed data as output.

Data transformation is useful for automating a variety of data-processing tasks. For example:

  • working with data in spreadsheets like Excel
  • performing ad-hoc analysis, on raw or messy data
  • creating transformation steps in your data pipeline or data ETL process

As we can see, tabular data synthesis allows us to generate synthetic data for a variety of applications.


If you’d like to learn how to use all of the tools listed above, please watch my online course: The AI Developer’s Toolkit.

The future belongs who those who invest in AI today. Don’t get left behind!

Start Now!

Share this Article