What are the key ideas that Artificial Intelligence is built upon?
In my previous article in this series, we discussed some of the key ideas in Neuroscience.
In this final article, we’ll discuss some of the most important ideas from Machine Learning.
As always, I’ll do my best to keep everything as simple and easy to understand as possible.
Imagine you’re hiking in the hills, and a dense fog rolls in. You can’t see where you’re going, but you need to get to the bottom of the hills before a storm. If you can’t see where you’re going, how do you get to the bottom? The answer is you take small steps in the direction of the downward slope.
In AI, we use the same concept to train neural networks. We want to minimize the error between a prediction and the correct answer. However, the shape of the neural network’s inner landscape is filled with a multitude of hills and valleys. So, with each step, we nudge the error in a downward direction (i.e., a gradient descent).
If you do this enough times, you’ll eventually get the error to the bottom of the valley, where it’s close to zero. When your error is minimized, your predictions are most accurate. However, sometimes you’ll get stuck in another valley that isn’t the lowest one possible. So, we have a bunch of tricks to prevent this.
In 1950, Alan Turing proposed a thought experiment to test if a machine was intelligent. He envisioned a test where there was a human judge, a human participant, and a machine participant. The judge would only be able to interact with the other human and the machine via a text-based computer terminal.
The judge would individually ask the human and the machine questions. Then the human and machine would provide their responses. The goal was for the judge to decide which one was the human and which one was the machine. If the machine could fool the judge often enough, then it was likely intelligent.
While the Turing test is a useful thought experiment. We’ve learned since then that LLMs trained from human-written text become very good at simulating the statistical patterns in human language (i.e., a stochastic parrot). However, they don’t understand what they are saying and struggle with reasoning.
In 1986, Hinton, Rumelhart, and Williams formalized an idea called “backpropagation of error“. It’s a technique that allows a neural network (NN) to use feedback from their predictions (i.e., error) to update the weights of the NN in earlier layers. It allows us to train deep neural networks (DNN).
Imagine we’re creating an AI to detect cats in photos. We show it a picture and it makes a guess. Then we check the guess to see if its answer is correct or not. If it’s wrong, we generate an error signal and send it backward through the DNN. Along the way, we use the error to make tiny adjustments.
Backpropagation allows us to pull that error signal back through the layers of the NN and update the weights in proportion to their contribution to the error. The weights are like dials that change how the AI makes guesses for each photo. If you do this enough times, you’ll get accurate predictions of a cat or not a cat.
Imagine you have a bunch of apples and bananas. You only have data about their length and width. If you were to plot their lengths and widths on a graph, the apples would end up in the lower left-hand corner (because they are round) and the bananas in the lower right-hand corner (because they are long).
If you want to classify the apples and bananas, you can just draw a straight line through the plot to perfectly separate them into two groups. However, real-world datasets often have more than 2 dimensions and often aren’t perfectly separable by a straight line. This is where the kernel trick helps us out.
Imagine you now have apples and oranges all mixed together on a 2D scatter plot. They are all mixed together because they are all round and similar in size. However, the kernel trick makes the 2D plot a 3D plot and floats all of the oranges to the top of the z-axis while all of the apples stay at the bottom.
Now, we can draw a line (or a plane in this case) to perfectly separate them into two groups. That’s the kernel trick. It’s a very clever trick indeed.
The Universal Approximation Theorem is a key concept in neural networks (NNs). It states that a neural network with a single hidden layer can approximate any continuous function. A neural network with two or more hidden layers can approximate any arbitrary function.
Essentially, this means that a shallow neural network can learn to predict any simple input-output relationships like temperature, speed, growth, etc. However, a deep neural network can model any input-output relationships like text, images, video, etc. This is a very powerful insight!
If you’re still not convinced, I recommend you check out the articles Can Neural Networks Really Learn Any Function and Visual Proof that Neural Networks Can Compute Any Function
And there you have it — the ideas that built AI from CS, DS, NS, and ML. Despite their complexity, I hope that you were able to follow along and understand the key insights of each. I also hope this will encourage you to learn more. AI isn’t all that difficult if you have the right resources and teachers.
If you haven’t already read the previous two articles in this series, I recommend that you do that first. Then, once you’re ready to dig deeper into AI, be sure to check out all of my online courses.