Neural Networks & MLPs
A Neural Network is a series of matrix multiplications interspersed with non-linear functions. The simplest form is a Multi-Layer Perceptron (MLP) — data flows from an input layer, through one or more hidden layers, to an output layer.
Each hidden unit computes: h = activation(W · x + b). Here W is the weight matrix, x is the input, and b is a bias vector. The weight encodes how much each input matters; the bias shifts the threshold of activation.
The activation function is the source of a network's power. Without it, every layer is just a linear transformation, and stacking them collapses to a single matrix multiply — useless for complex problems. ReLU (max(0, x)) is the dominant choice: it's fast, avoids vanishing gradients, and rarely saturates. Sigmoid squashes to (0, 1) — useful for probabilities. Tanh squashes to (-1, 1) — often better for hidden layers than sigmoid.
The Universal Approximation Theorem guarantees that an MLP with even one hidden layer and enough neurons can approximate any continuous function. In practice, deeper networks (more layers) are far more parameter-efficient than wider ones.