Machine Learning

“smoothly fitting high-dimensional data points requires not just n parameters, but n × d parameters, where d is the dimension of the input (for example, 784 for a 784-pixel image). In other words, if you want a network to robustly memorize its training data, overparameterization is not just helpful — it’s mandatory.”

https://www.quantamagazine.org/computer-scientists-prove-why-bigger-neural-networks-do-better-20220210/

Sigmoid

Classification: many, 1 OR none
$$\sigma(x) = \frac{1}{1 + e^{-x}}$$
It maps any real-valued number to a value between 0 and 1, commonly used in logistic regression and neural networks to introduce non-linearity. Thus the output values are NOT mutually exclusive. What sigmoid does is that it allows you to have a high probability for all your classes or some of them, or none of them.
Can cause vanishing gradients.

Softmax

Classification: picks only 1
$$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}$$
It transforms a vector of real numbers into a probability distribution, where the sum of all probabilities is 1, often used in multi-class classification problems. The output values are mutually exclusive.

Loss functions

Mean Squared Error (MSE)

Regression tasks

Negative Log Likelihood

Classification tasks