Playground
Neural networks as universal approximators
Neural networks with can approximate any continuous function to arbitrary precision in just one layer and enough neurons. This sounds like a superpower but it comes with a catch: the Universal Approximation Theorem says nothing about how efficiently a network learns, or whether a single layer is sufficient for every problem. The question is not whether a network can learn something; it is how deep it needs to be to do so tractably.
Depth (more layers) adds expressive power that width (more neurons) alone cannot replicate cheaply, if at all. A shallow network may need an exponentially large number of neurons to represent what a deeper network can express with far fewer parameters. The best way to see this concretely is with the simplest possible classification problem: logic gates.
Logic gates and linear separability
A logic gate takes two binary inputs and produces a binary output. Most gates — AND, OR, NAND, NOR — are linearly separable: you can draw a single straight line through the input space that correctly divides the 0-outputs from the 1-outputs. A single-layer perceptron is exactly a learned linear boundary, so it can solve these gates directly.
XOR is different; no single line can separate the 1-outputs from the 0-outputs. A single-layer perceptron network will not be able to converge to a loss near 0 as one point will always be misclassified. You need at least one hidden layer to compose the nonlinear boundary XOR requires. The below plots show the decision boundaries of a single-layer perceptron trained on each gate. What could a non-linear boundary look like for XOR?
Try it yourself
Configure the network architecture below and hit Train. Since the network weights are initialized randomly, you may need to hit Train a few times to see the loss fail to converge. Try different combinations of layers and neurons and watch how the loss curve changes!
| A | B | Target | Predicted |
|---|---|---|---|
| 0 | 0 | 0 | 0.500 |
| 0 | 1 | 1 | 0.500 |
| 1 | 0 | 1 | 0.500 |
| 1 | 1 | 0 | 0.500 |
Architecture: [2, 3, 1] · sigmoid activation · lr = 0.5 · mini-batch gradient descent (batch = 16)