Tags: quiz-07, SGD, neural networks, gradient descent, lecture-14
True or False: in stochastic gradient descent, each update step moves in the direction of steepest descent of the empirical risk.
False.
Each update step moves in the direction of steepest descent of the risk computed on the mini-batch, not the full training set. The mini-batch gradient is only an approximation of the true gradient of the empirical risk. This is what makes each iteration of SGD fast --- computing the gradient over a small mini-batch of \(m\) points is much cheaper than computing it over all \(n\) points --- but it means that the update direction is noisy.
Tags: quiz-07, SGD, neural networks, gradient descent, lecture-14
True or False: stochastic gradient descent generally requires fewer iterations than gradient descent to converge.
False.
SGD typically requires more iterations than gradient descent to converge, because each update uses a noisy approximation of the true gradient. However, each iteration of SGD is much cheaper (\(O(md)\) vs. \(O(nd)\) for a mini-batch of size \(m \ll n\)), so SGD often converges faster in terms of total computation time.