Machine Learning
“smoothly fitting high-dimensional data points requires not just n parameters, but n × d parameters, where d is the dimension of the input (for example, 784 for a 784-pixel image). In other words, if you want a network to robustly memorize its training data, overparameterization is not just helpful — it’s mandatory.”
Sigmoid
Classification: many, 1 OR none
$$\sigma(x) = \frac{1}{1 + e^{-x}}$$
It maps any real-valued number to a value between 0 and 1, commonly used in logistic regression and neural networks to introduce non-linearity. Thus the output values are NOT mutually exclusive. What sigmoid does is that it allows you to have a high probability for all your classes or some of them, or none of them.
Can cause vanishing gradients.
Softmax
Classification: picks only 1
$$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}$$
It transforms a vector of real numbers into a probability distribution, where the sum of all probabilities is 1, often used in multi-class classification problems. The output values are mutually exclusive.
Loss functions
Mean Squared Error (MSE)
Regression tasks
Negative Log Likelihood
Classification tasks