Stein’s example – Wikipedia

Posted on March 28, 2021 by lordneo

Phenomenon in decision theory and estimation theory

In decision theory and estimation theory, Stein’s example (also known as Stein’s phenomenon or Stein’s paradox) is the observation that when three or more parameters are estimated simultaneously, there exist combined estimators more accurate on average (that is, having lower expected mean squared error) than any method that handles the parameters separately. It is named after Charles Stein of Stanford University, who discovered the phenomenon in 1955.^[1]

An intuitive explanation is that optimizing for the mean-squared error of a combined estimator is not the same as optimizing for the errors of separate estimators of the individual parameters. In practical terms, if the combined error is in fact of interest, then a combined estimator should be used, even if the underlying parameters are independent. If one is instead interested in estimating an individual parameter, then using a combined estimator does not help and is in fact worse.

Table of Contents

Formal statement[edit]

The following is the simplest form of the paradox, the special case in which the number of observations is equal to the number of parameters to be estimated. Let

{displaystyle {boldsymbol {theta }}}

${boldsymbol {theta }}$ be a vector consisting of

{displaystyle ngeq 3}

$ngeq 3$ unknown parameters. To estimate these parameters, a single measurement

{displaystyle X_{i}}

$X_{i}$ is performed for each parameter

{displaystyle theta _{i}}

$theta _{i}$ , resulting in a vector

{displaystyle mathbf {X} }

$mathbf {X}$ of length

{displaystyle n}

$n$ . Suppose the measurements are known to be independent, Gaussian random variables, with mean

{displaystyle {boldsymbol {theta }}}

${boldsymbol {theta }}$ and variance 1, i.e.,

{displaystyle mathbf {X} sim {mathcal {N}}({boldsymbol {theta }},mathbf {I} _{n})}

${displaystyle mathbf {X} sim {mathcal {N}}({boldsymbol {theta }},mathbf {I} _{n})}$ . Thus, each parameter is estimated using a single noisy measurement, and each measurement is equally inaccurate.

Under these conditions, it is intuitive and common to use each measurement as an estimate of its corresponding parameter. This so-called “ordinary” decision rule can be written as

{displaystyle {hat {boldsymbol {theta }}}=mathbf {X} }

${displaystyle {hat {boldsymbol {theta }}}=mathbf {X} }$ , which is the maximum likelihood estimator (MLE). The quality of such an estimator is measured by its risk function. A commonly used risk function is the mean squared error, defined as

{displaystyle mathbb {E} [|{boldsymbol {theta }}-{hat {boldsymbol {theta }}}|^{2}]}

${displaystyle mathbb {E} [|{boldsymbol {theta }}-{hat {boldsymbol {theta }}}|^{2}]}$ . Surprisingly, it turns out that the “ordinary” decision rule is suboptimal (inadmissible) in terms of mean squared error when

{displaystyle ngeq 3}

$ngeq 3$ . In other words, in the setting discussed here, there exist alternative estimators which always achieve lower mean squared error, no matter what the value of

{displaystyle {boldsymbol {theta }}}

${boldsymbol {theta }}$ is. For a given

${displaystyle {boldsymbol {theta }}}$

${boldsymbol {theta }}$ one could obviously define a perfect “estimator” which is always just

${displaystyle {boldsymbol {theta }}}$

${boldsymbol {theta }}$ , but this estimator would be bad for other values of

${displaystyle {boldsymbol {theta }}}$

${boldsymbol {theta }}$ .

The estimators of Stein’s paradox are, for a given

${displaystyle {boldsymbol {theta }}}$

${boldsymbol {theta }}$ , better than the “ordinary” decision rule

${displaystyle mathbf {X} }$

$mathbf {X}$ for some

${displaystyle mathbf {X} }$

$mathbf {X}$ but necessarily worse for others. It is only on average that they are better. More accurately, an estimator

{displaystyle {hat {boldsymbol {theta }}}_{1}}

${displaystyle {hat {boldsymbol {theta }}}_{1}}$ is said to dominate another estimator

{displaystyle {hat {boldsymbol {theta }}}_{2}}

${displaystyle {hat {boldsymbol {theta }}}_{2}}$ if, for all values of

{displaystyle {boldsymbol {theta }}}

${boldsymbol {theta }}$ , the risk of

{displaystyle {hat {boldsymbol {theta }}}_{1}}

${displaystyle {hat {boldsymbol {theta }}}_{1}}$ is lower than, or equal to, the risk of

{displaystyle {hat {boldsymbol {theta }}}_{2}}

${displaystyle {hat {boldsymbol {theta }}}_{2}}$ , and if the inequality is strict for some

{displaystyle {boldsymbol {theta }}}

${boldsymbol {theta }}$ . An estimator is said to be admissible if no other estimator dominates it, otherwise it is inadmissible. Thus, Stein’s example can be simply stated as follows: The “ordinary” decision rule of the mean of a multivariate Gaussian distribution is inadmissible under mean squared error risk.

Many simple, practical estimators achieve better performance than the “ordinary” decision rule. The best-known example is the James–Stein estimator, which shrinks

${displaystyle mathbf {X} }$

$mathbf {X}$ towards a particular point (such as the origin) by an amount inversely proportional to the distance of

${displaystyle mathbf {X} }$

$mathbf {X}$ from that point. For a sketch of the proof of this result, see Proof of Stein’s example. An alternative proof is due to Larry Brown: he proved that the ordinary estimator for an

${displaystyle n}$

$n$ -dimensional multivariate normal mean vector is admissible if and only if the

${displaystyle n}$

$n$ -dimensional Brownian motion is recurrent.^[2] Since the Brownian motion is not recurrent for

{displaystyle ngeq 3}

$ngeq 3$ , the MLE is not admissible for

{displaystyle ngeq 3}

$ngeq 3$ .

An intuitive explanation[edit]

For any particular value of

${displaystyle {boldsymbol {theta }}}$

${boldsymbol {theta }}$ the new estimator will improve at least one of the individual mean square errors

{displaystyle mathbb {E} [(theta _{i}-{hat {theta }}_{i})^{2}].}

${displaystyle mathbb {E} [(theta _{i}-{hat {theta }}_{i})^{2}].}$ This is not hard − for instance, if

{displaystyle {boldsymbol {theta }}}

${boldsymbol {theta }}$ is between −1 and 1, and

${displaystyle sigma =1}$

${displaystyle sigma =1}$ , then an estimator that linearly shrinks

{displaystyle mathbf {X} }

$mathbf {X}$ towards 0 by 0.5 (i.e.,

{displaystyle operatorname {sign} (X_{i})max(|X_{i}|-0.5,0)}

${displaystyle operatorname {sign} (X_{i})max(|X_{i}|-0.5,0)}$ , soft thresholding with threshold

{displaystyle 0.5}

${displaystyle 0.5}$ ) will have a lower mean square error than

{displaystyle mathbf {X} }

$mathbf {X}$ itself. But there are other values of

{displaystyle {boldsymbol {theta }}}

${boldsymbol {theta }}$ for which this estimator is worse than

{displaystyle mathbf {X} }

$mathbf {X}$ itself. The trick of the Stein estimator, and others that yield the Stein paradox, is that they adjust the shift in such a way that there is always (for any

${displaystyle {boldsymbol {theta }}}$

${boldsymbol {theta }}$ vector) at least one

{displaystyle X_{i}}

$X_{i}$ whose mean square error is improved, and its improvement more than compensates for any degradation in mean square error that might occur for another

{displaystyle {hat {theta }}_{i}}

${hat {theta }}_{i}$ . The trouble is that, without knowing

${displaystyle {boldsymbol {theta }}}$

${boldsymbol {theta }}$ , you don’t know which of the

${displaystyle n}$

$n$ mean square errors are improved, so you can’t use the Stein estimator only for those parameters.

An example of the above setting occurs in channel estimation in telecommunications, for instance, because different factors affect overall channel performance.

Implications[edit]

Stein’s example is surprising, since the “ordinary” decision rule is intuitive and commonly used. In fact, numerous methods for estimator construction, including maximum likelihood estimation, best linear unbiased estimation, least squares estimation and optimal equivariant estimation, all result in the “ordinary” estimator. Yet, as discussed above, this estimator is suboptimal.

Example[edit]

To demonstrate the unintuitive nature of Stein’s example, consider the following real-world example. Suppose we are to estimate three unrelated parameters, such as the US wheat yield for 1993, the number of spectators at the Wimbledon tennis tournament in 2001, and the weight of a randomly chosen candy bar from the supermarket. Suppose we have independent Gaussian measurements of each of these quantities. Stein’s example now tells us that we can get a better estimate (on average) for the vector of three parameters by simultaneously using the three unrelated measurements.

At first sight it appears that somehow we get a better estimator for US wheat yield by measuring some other unrelated statistics such as the number of spectators at Wimbledon and the weight of a candy bar. However, we have not obtained a better estimator for US wheat yield by itself, but we have produced an estimator for the vector of the means of all three random variables, which has a reduced total risk. This occurs because the cost of a bad estimate in one component of the vector is compensated by a better estimate in another component. Also, a specific set of the three estimated mean values obtained with the new estimator will not necessarily be better than the ordinary set (the measured values). It is only on average that the new estimator is better.

Sketched proof[edit]

The risk function of the decision rule

{displaystyle d(mathbf {x} )=mathbf {x} }

$d({mathbf {x}})={mathbf {x}}$ is

{displaystyle R(theta ,d)=operatorname {E} _{theta }[|{boldsymbol {theta }}-mathbf {X} |^{2}]}

{displaystyle =int ({boldsymbol {theta }}-mathbf {x} )^{T}({boldsymbol {theta }}-mathbf {x} )left({frac {1}{2pi }}right)^{n/2}e^{(-1/2)({boldsymbol {theta }}-mathbf {x} )^{T}({boldsymbol {theta }}-mathbf {x} )}dx}

{displaystyle =n.}

Now consider the decision rule

{displaystyle d'(mathbf {x} )=mathbf {x} -{frac {alpha }{|mathbf {x} |^{2}}}mathbf {x} ,}

where

{displaystyle alpha =n-2}

$alpha =n-2$ . We will show that

{displaystyle d’}

$d'$ is a better decision rule than

{displaystyle d}

$d$ . The risk function is

{displaystyle R(theta ,d’)=operatorname {E} _{theta }left[left|mathbf {theta -X} +{frac {alpha }{|mathbf {X} |^{2}}}mathbf {X} right|^{2}right]}

{displaystyle =operatorname {E} _{theta }left[|mathbf {theta -X} |^{2}+2(mathbf {theta -X} )^{T}{frac {alpha }{|mathbf {X} |^{2}}}mathbf {X} +{frac {alpha ^{2}}{|mathbf {X} |^{4}}}|mathbf {X} |^{2}right]}

{displaystyle =operatorname {E} _{theta }left[|mathbf {theta -X} |^{2}right]+2alpha operatorname {E} _{theta }left[{frac {mathbf {(theta -X)} ^{T}mathbf {X} }{|mathbf {X} |^{2}}}right]+alpha ^{2}operatorname {E} _{theta }left[{frac {1}{|mathbf {X} |^{2}}}right]}

— a quadratic in

{displaystyle alpha }

$alpha$ . We may simplify the middle term by considering a general “well-behaved” function

{displaystyle h:mathbf {x} mapsto h(mathbf {x} )in mathbb {R} }

${displaystyle h:mathbf {x} mapsto h(mathbf {x} )in mathbb {R} }$ and using integration by parts. For

{displaystyle 1leq ileq n}

$1leq ileq n$ , for any continuously differentiable

{displaystyle h}

$h$ growing sufficiently slowly for large

{displaystyle x_{i}}

$x_{i}$ we have:

{displaystyle operatorname {E} _{theta }[(theta _{i}-X_{i})h(mathbf {X} )mid X_{j}=x_{j}(jneq i)]=int (theta _{i}-x_{i})h(mathbf {x} )left({frac {1}{2pi }}right)^{n/2}e^{-(1/2)({boldsymbol {theta }}-mathbf {x} )^{T}({boldsymbol {theta }}-mathbf {x} )}dx_{i}}

{displaystyle =left[h(mathbf {x} )left({frac {1}{2pi }}right)^{n/2}e^{-(1/2)({boldsymbol {theta }}-mathbf {x} )^{T}({boldsymbol {theta }}-mathbf {x} )}right]_{x_{i}=-infty }^{infty }-int {frac {partial h}{partial x_{i}}}(mathbf {x} )left({frac {1}{2pi }}right)^{n/2}e^{-(1/2)({boldsymbol {theta }}-mathbf {x} )^{T}({boldsymbol {theta }}-mathbf {x} )}dx_{i}}

{displaystyle =-operatorname {E} _{theta }left[{frac {partial h}{partial x_{i}}}(mathbf {X} )mid X_{j}=x_{j}(jneq i)right].}

Therefore,

{displaystyle operatorname {E} _{theta }[(theta _{i}-X_{i})h(mathbf {X} )]=-operatorname {E} _{theta }left[{frac {partial h}{partial x_{i}}}(mathbf {X} )right].}

(This result is known as Stein’s lemma.) Now, we choose

{displaystyle h(mathbf {x} )={frac {x_{i}}{|mathbf {x} |^{2}}}.}

{displaystyle h}

$h$ met the “well-behaved” condition (it doesn’t, but this can be remedied—see below), we would have

{displaystyle {frac {partial h}{partial x_{i}}}={frac {1}{|mathbf {x} |^{2}}}-{frac {2x_{i}^{2}}{|mathbf {x} |^{4}}}}

and so

{displaystyle operatorname {E} _{theta }left[{frac {({boldsymbol {theta }}-mathbf {X} )^{T}mathbf {X} }{|mathbf {X} |^{2}}}right]=sum _{i=1}^{n}operatorname {E} _{theta }left[(theta _{i}-X_{i}){frac {X_{i}}{|mathbf {X} |^{2}}}right]}

{displaystyle =-sum _{i=1}^{n}operatorname {E} _{theta }left[{frac {1}{|mathbf {X} |^{2}}}-{frac {2X_{i}^{2}}{|mathbf {X} |^{4}}}right]}

{displaystyle =-(n-2)operatorname {E} _{theta }left[{frac {1}{|mathbf {X} |^{2}}}right].}

Then returning to the risk function of

{displaystyle d’}

$d'$ :

{displaystyle R(theta ,d’)=n-2alpha (n-2)operatorname {E} _{theta }left[{frac {1}{|mathbf {X} |^{2}}}right]+alpha ^{2}operatorname {E} _{theta }left[{frac {1}{|mathbf {X} |^{2}}}right].}

This quadratic in

{displaystyle alpha }

$alpha$ is minimized at

{displaystyle alpha =n-2}

$alpha =n-2$ , giving

{displaystyle R(theta ,d’)=R(theta ,d)-(n-2)^{2}operatorname {E} _{theta }left[{frac {1}{|mathbf {X} |^{2}}}right]}

which of course satisfies

{displaystyle R(theta ,d’)

${displaystyle R(theta ,d')<R(theta ,d).}$ making

{displaystyle d}

$d$ an inadmissible decision rule.

It remains to justify the use of

{displaystyle h(mathbf {X} )={frac {mathbf {X} }{|mathbf {X} |^{2}}}.}

This function is not continuously differentiable, since it is singular at

{displaystyle mathbf {x} =0}

${mathbf {x}}=0$ . However, the function

{displaystyle h(mathbf {X} )={frac {mathbf {X} }{varepsilon +|mathbf {X} |^{2}}}}

is continuously differentiable, and after following the algebra through and letting

{displaystyle varepsilon to 0}

$varepsilon to 0$ , one obtains the same result.

References[edit]

Lehmann, E. L.; Casella, G. (1998), “ch.5”, Theory of Point Estimation (2nd ed.), ISBN 0-471-05849-1
Stein, C. (1956). “Inadmissibility of the usual estimator for the mean of a multivariate distribution”. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability. Vol. 1. pp. 197–206. MR 0084922.
Samworth, R. J. (2012), “Stein’s Paradox” (PDF), Eureka, 62: 38–41

Stein’s example – Wikipedia

Formal statement[edit]

An intuitive explanation[edit]

Implications[edit]

Example[edit]

Sketched proof[edit]

See also[edit]

References[edit]

Recent Posts

Recent Comments

Archives

Categories

Meta