# How Good is SGD with Random Shuffling?

@article{Safran2019HowGI, title={How Good is SGD with Random Shuffling?}, author={Itay Safran and Ohad Shamir}, journal={ArXiv}, year={2019}, volume={abs/1908.00045} }

We study the performance of stochastic gradient descent (SGD) on smooth and strongly-convex finite-sum optimization problems. In contrast to the majority of existing theoretical works, which assume that individual functions are sampled with replacement, we focus here on popular but poorly-understood heuristics, which involve going over random permutations of the individual functions. This setting has been investigated in several recent works, but the optimal error rates remain unclear. In this… Expand

#### Tables and Topics from this paper

#### 24 Citations

Random Shuffling Beats SGD Only After Many Epochs on Ill-Conditioned Problems

- Computer Science
- ArXiv
- 2021

It is proved that when the condition number is taken into account, without-replacement SGD does not significantly improve on with-replacements SGD in terms of worst-case bounds, unless the number of epochs (passes over the data) is larger than the conditionNumber. Expand

Permutation-Based SGD: Is Random Optimal?

- Computer Science, Mathematics
- ArXiv
- 2021

The results suggest that a general convergence characterization of optimal permutations cannot capture the nuances of individual function classes, and can mistakenly indicate that one cannot do much better than random. Expand

On Tight Convergence Rates of Without-replacement SGD

- Mathematics, Computer Science
- ArXiv
- 2020

This work analyzing step sizes that vary across epochs of without-replacement SGD shows that the rates hold after $\kappa^c\log(nK)$ epochs for some $c>0$. Expand

Optimal Rates for Random Order Online Optimization

- Computer Science, Mathematics
- ArXiv
- 2021

This analysis relies on novel connections between algorithmic stability and generalization for sampling withoutreplacement analogous to those studied in the with-replacement i.i.d. setting, as well as on a refined average stability analysis of stochastic gradient descent. Expand

Random Reshuffling with Variance Reduction: New Analysis and Better Rates

- Computer Science, Mathematics
- ArXiv
- 2021

This work provides the first analysis of SVRG under Random Reshuffling (RR-SVRG) for general finite-sum problems and obtains the first sublinear rate for general convex problems. Expand

Shuffling Gradient-Based Methods with Momentum

- Computer Science
- ArXiv
- 2020

We combine two advanced ideas widely used in optimization for machine learning: shuffling strategy and momentum technique to develop a novel shuffling gradient-based method with momentum to… Expand

Fast Distributionally Robust Learning with Variance Reduced Min-Max Optimization

- Computer Science, Mathematics
- ArXiv
- 2021

This work revisits Wasserstein DRSL through the lens of min-max optimization and derive scalable and efficiently implementable stochastic extra-gradient algorithms which provably achieve faster convergence rates than existing approaches. Expand

On the Comparison between Cyclic Sampling and Random Reshuffling

- Computer Science, Mathematics
- ArXiv
- 2021

A norm is introduced, which is defined based on the sampling order, to measure the distance to solution and is applied on proximal Finito/MISO algorithm to identify the optimal fixed ordering, which can beat random reshuffling by a factor up to log(n)/n in terms of the best-known upper bounds. Expand

Stochastic Newton and Cubic Newton Methods with Simple Local Linear-Quadratic Rates

- Computer Science, Mathematics
- ArXiv
- 2019

This work presents two new remarkably simple stochastic second-order methods for minimizing the average of a very large number of sufficiently smooth and strongly convex functions and establishes local linear-quadratic convergence results. Expand

Random Reshuffling is Not Always Better

- Computer Science
- NeurIPS
- 2020

This work gives a counterexample to the Operator Inequality of Noncommutative Arithmetic and Geometric Means, a longstanding conjecture that relates to the performance of random reshuffle in learning algorithms, and gives an example of a learning task and algorithm for which with-replacement random sampling outperforms random reshuffling. Expand

#### References

SHOWING 1-10 OF 20 REFERENCES

Random Shuffling Beats SGD after Finite Epochs

- Mathematics, Computer Science
- ICML
- 2019

It is proved that under strong convexity and second-order smoothness, the sequence generated by RandomShuffle converges to the optimal solution at the rate O(1/T^2 + n^3/ T^3), where n is the number of components in the objective, and T is the total number of iterations. Expand

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

- Mathematics, Computer Science
- ICML
- 2012

This paper investigates the optimality of SGD in a stochastic setting, and shows that for smooth problems, the algorithm attains the optimal O(1/T) rate, however, for non-smooth problems the convergence rate with averaging might really be Ω(log(T)/T), and this is not just an artifact of the analysis. Expand

Why random reshuffling beats stochastic gradient descent

- Computer Science, Mathematics
- Math. Program.
- 2021

The convergence rate of the random reshuffling method is analyzed and it is shown that when the component functions are quadratics or smooth and the sum function is strongly convex, RR with iterate averaging and a diminishing stepsize converges at rate $\Theta(1/k^{2s})$ with probability one in the suboptimality of the objective value, thus improving upon the $\Omega( 1/k)$ rate of SGD. Expand

SGD without Replacement: Sharper Rates for General Smooth Convex Functions

- Mathematics, Computer Science
- ICML
- 2019

The first non-asymptotic results for stochastic gradient descent when applied to general smooth, strongly-convex functions are provided, which show that sgdwor converges at a rate of O(1/K^2) while sgd is known to converge at $O( 1/K) rate. Expand

Stochastic Learning Under Random Reshuffling With Constant Step-Sizes

- Computer Science, Mathematics
- IEEE Transactions on Signal Processing
- 2019

The analysis establishes analytically that random reshuffling outperforms uniform sampling and derives an analytical expression for the steady-state mean-square-error performance of the algorithm, which helps clarify in greater detail, the differences between sampling with and without replacement. Expand

Without-Replacement Sampling for Stochastic Gradient Methods

- Computer Science, Mathematics
- NIPS
- 2016

This paper provides competitive convergence guarantees for without-replacement sampling under several scenarios, focusing on the natural regime of few passes over the data, yielding a nearly-optimal algorithm for regularized least squares under broad parameter regimes. Expand

Beneath the valley of the noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences

- Mathematics
- 2012

Randomized algorithms that base iteration-level decisions on samples from some pool are ubiquitous in machine learning and optimization. Examples include stochastic gradient descent and randomized… Expand

Robust Stochastic Approximation Approach to Stochastic Programming

- Mathematics, Computer Science
- SIAM J. Optim.
- 2009

It is intended to demonstrate that a properly modified SA approach can be competitive and even significantly outperform the SAA method for a certain class of convex stochastic problems. Expand

Convergence Rate of Incremental Gradient and Newton Methods

- Mathematics
- 2015

The incremental gradient method is a prominent algorithm for minimizing a finite sum of smooth convex functions, used in many contexts including large-scale data processing applications and… Expand

Curiously Fast Convergence of some Stochastic Gradient Descent Algorithms

- Mathematics
- 2009

1 Context Given a finite set of m examples z 1 ,. .. , z m and a strictly convex differen-tiable loss function ℓ(z, θ) defined on a parameter vector θ ∈ R d , we are interested in minimizing the cost… Expand