May 1, 2024 Author: Matthew Renze

What are the key ideas that Artificial Intelligence is built upon?

In my previous article in this series, we discussed some of the key ideas in Computer Science.

In this article, we’ll discuss some of the most important ideas from Data Science.

Once again, I’ll do my best to keep everything as simple and easy to understand as possible.

Bayes Theorem

In 1763, Thomas Bayes created a method to calculate conditional probabilities. A conditional probability is the likelihood of some event happening, given that another related event has already happened. For example, if it’s cloudy outside, Bayes Theorem helps predict how likely it will rain given that it’s cloudy.

Essentially, Bayes Theorem helps us to update our predictions based on new evidence. It combines both our prior beliefs and new evidence to make more accurate predictions about the future. Bayes Theorem is the foundation of modern statistics and has led to several subfields, including Bayesian Inference, Bayesian Networks, the Bayesian Brain Hypothesis and more.

The Law of Large Numbers

The Law of Large Numbers was first introduced by Jacob Bernoulli. It states that as the size of a sample of data increases, the sample mean will get closer and closer to the population mean. In statistics, the sample mean is the average of the observations you’ve made; the population mean is the actual average.

Essentially, if we keep making observations, our sample will get bigger. The bigger the sample becomes, the more likely the average of our sample will approximate the true average of the entire population. This law also applies to other aspects of a distribution of observations like its variance and shape.

The Central Limit Theorem

The Central Limit Theorem (CLT) was developed by several mathematicians, including Abraham de Moivre and Pierre-Simon Laplace, in the 18th and 19th centuries. It says that if you take a number of samples from any distribution, their averages will approximate the normal distribution (i.e., bell curve).

This works for any distribution of observations (e.g., uniform, binomial, Poisson, etc.), provided you take enough samples. This is important because it allows us to use a bunch of really powerful statistical methods that require normally distributed data — even if our data are not normally distributed.

The Pareto Principle

The Pareto Principle, also known as the 80/20 rule, was created by Vilfredo Pareto in the late 1800s. It states that in systems that are modeled by power-law distributions, 80% of the inputs lead to 20% of the outputs. Inversely, 20% of the inputs lead to 80% of the outputs.

A power-law distribution is simply a mathematical curve whose output value increases by a power of the input value. You can find these distributions, and thus the 80/20 rule, all throughout our world in various fields like sociology, economics, computer science, etc. It’s a useful guide in many situations.

The Bootstrap

In 1979, Bradley Efron introduced a resampling technique called “The Bootstrap“. Bootstrapping allows you to estimate the distribution of a statistic (like mean, variance, etc.) by repeatedly sampling from the dataset with replacement. We call these repeated samples with replacement “bootstrap samples”.

Essentially, by collecting a bunch of bootstrapped samples and calculating their statistics, it ends up approximating the statistics as if you had a bunch of actual data. This allows you to estimate the parameters of the population. Honestly, it feels like it couldn’t possibly work, but it does. It’s magic!

Shannon Entropy

Shannon Entropy is a concept from Information Theory developed in 1948 by Claude Shannon. Like entropy in physics, Shannon Entropy measures the amount of uncertainty or randomness. However, Shannon Entropy measures randomness or uncertainty in data.

Essentially, Shannon Entropy quantifies how much information is contained in a sequence of data. The lower the entropy, the more likely each bit of data is, so the data have less information, randomness, or uncertainty. The higher the entropy, the less likely any bit of data is, so each bit is more informative.

The Grammar of Graphics

In 1999, Leland Wilkinson published a book titled The Grammar of Graphics. In it, he developed a systematic way to describe and create data visualizations. Using this visual language, you can create essentially any type of data visualization frequently used in Data Science and AI.

The language has concepts for defining data, geometries, aesthetics, scales, coordinates, etc. You can use this grammar to create basic data visualizations (e.g., bar charts, line charts, pie charts, etc.) or more complex, highly custom, and multi-faceted data visualizations. Check it out by using ggplot2 in R.


Once again, despite the fact that these are relatively complex ideas, it’s actually quite easy to understand their key insight. So, I hope that this encourages you to dig deeper into the foundations of AI. It’s not as difficult as you might think — provided things are explained in simple and easy-to-understand terms.

To learn more, be sure to check out my latest article in this series on The Ideas That Built AI from Neuroscience.

Share this Article