User:Boris Tsirelson/Sandbox1: Difference between revisions

From Citizendium
Jump to navigation Jump to search
imported>Boris Tsirelson
No edit summary
imported>Boris Tsirelson
(formula formatting)
Line 1: Line 1:
WP
Conditioning (probability)
16:17, 18 November 2008
Beliefs depend on the available information. This idea is formalized in [[probability theory]] by '''conditioning'''. Conditional [[probability|probabilities]], conditional [[Expected value|expectations]] and conditional [[Probability distribution|distributions]] are treated on three levels: [[Discrete probability distribution|discrete probabilities]], [[probability density function]]s, and [[measure theory]]. Conditioning leads to a non-random result if the condition is completely specified; otherwise, if the condition is left random, the result of conditioning is also random.
Beliefs depend on the available information. This idea is formalized in [[probability theory]] by '''conditioning'''. Conditional [[probability|probabilities]], conditional [[Expected value|expectations]] and conditional [[Probability distribution|distributions]] are treated on three levels: [[Discrete probability distribution|discrete probabilities]], [[probability density function]]s, and [[measure theory]]. Conditioning leads to a non-random result if the condition is completely specified; otherwise, if the condition is left random, the result of conditioning is also random.


Line 5: Line 9:
==Conditioning on the discrete level==
==Conditioning on the discrete level==


'''Example.''' A fair coin is tossed 10 times; the [[random variable]] ''X'' is the number of heads in these 10 tosses, and ''Y'' — the number of heads in the first 3 tosses. In spite of the fact that ''Y'' emerges before ''X'' it may happen that someone knows ''X'' but not ''Y''.
'''Example.''' A fair coin is tossed 10 times; the [[random variable]] <math>X</math> is the number of heads in these 10 tosses, and <math>Y</math> — the number of heads in the first 3 tosses. In spite of the fact that <math>Y</math> emerges before <math>X</math> it may happen that someone knows <math>X</math> but not <math>Y</math>.


===Conditional probability===
===Conditional probability===
{{Main|Conditional probability}}
{{Main|Conditional probability}}
Given that ''X'' = 1, the conditional probability of the event ''Y'' = 0 is {{nowrap begin}}P ( ''Y'' = 0 | ''X'' = 1 ) = P ( ''Y'' = 0, ''X'' = 1 ) / P ( ''X'' = 1 ) = 0.7.{{nowrap end}} More generally,
Given that <math>X=1,</math> the conditional probability of the event <math>Y=0</math> is <math>P(Y=0|X=1)=P(Y=0,X=1)/P(X=1)=0.7.</math> More generally,
: <math> \mathbb{P} (Y=0|X=x) = \frac{ \binom 7 x }{ \binom{10} x } = \frac{ 7! (10-x)! }{ (7-x)! 10! } </math>
: <math> \mathbb{P} (Y=0|X=x) = \frac{ \binom 7 x }{ \binom{10} x } = \frac{ 7! (10-x)! }{ (7-x)! 10! } </math>
for ''x'' = 0, 1, 2, 3, 4, 5, 6, 7; otherwise (for ''x'' = 8, 9, 10), {{nowrap begin}}P ( ''Y'' = 0 | ''X'' = ''x'' ) = 0.{{nowrap end}} One may also treat the conditional probability as a random variable, — a function of the random variable ''X'', namely,
for <math>x=0,1,2,3,4,5,6,7;</math> otherwise (for <math>x=8,9,10</math>), <math>P(Y=0|X=x)=0.</math> One may also treat the conditional probability as a random variable, — a function of the random variable <math>X</math>, namely,
: <math> \mathbb{P} (Y=0|X) = \begin{cases}
: <math> \mathbb{P} (Y=0|X) = \begin{cases}
  \binom 7 X / \binom{10}X &\text{for } X \le 7,\\
  \binom 7 X / \binom{10}X &\text{for } X \le 7,\\
Line 20: Line 24:
namely,
namely,
: <math>\sum_{x=0}^7 \frac{ \binom 7 x }{ \binom{10}x } \cdot \frac1{2^{10}} \binom{10}x = \frac 1 8 , </math>
: <math>\sum_{x=0}^7 \frac{ \binom 7 x }{ \binom{10}x } \cdot \frac1{2^{10}} \binom{10}x = \frac 1 8 , </math>
which is an instance of the [[law of total probability]] {{nowrap begin}}E ( P ( ''A'' | ''X'' ) ) = P ( ''A'' ).{{nowrap end}}
which is an instance of the [[law of total probability]] <math>E(P(A|X))=P(A).</math>


Thus, {{nowrap begin}}P ( ''Y'' = 0 | ''X'' = 1 ){{nowrap end}} may be treated as the value of the random variable {{nowrap begin}}P ( ''Y'' = 0 | ''X'' ){{nowrap end}} corresponding to ''X'' = 1. <cite id="EP8">On the other hand, {{nowrap begin}}P ( ''Y'' = 0 | ''X'' = 1 ){{nowrap end}} is well-defined irrespective of other possible values of ''X''.</cite>
Thus, <math>P(Y=0|X=1)</math> may be treated as the value of the random variable <math>P(Y=0|X)</math> corresponding to <math>X=1.</math> <cite id="EP8">On the other hand, <math>P(Y=0|X=1)</math> is well-defined irrespective of other possible values of <math>X</math>.</cite>


===Conditional expectation===
===Conditional expectation===
{{Main|Conditional expectation}}
{{Main|Conditional expectation}}
Given that ''X'' = 1, the conditional expectation of the random variable ''Y'' is {{nowrap begin}}E ( ''Y'' | ''X'' = 1 ) = 0.3.{{nowrap end}} More generally,
Given that <math>X=1,</math> the conditional expectation of the random variable <math>Y</math> is <math>E(Y|X=1)=0.3.</math> More generally,
: <math> \mathbb{E} (Y|X=x) = \frac3{10} x </math>
: <math> \mathbb{E} (Y|X=x) = \frac3{10} x </math>
for ''x'' = 0, ..., 10. (In this example it appears to be a linear function, but in general it is nonlinear.) One may also treat the conditional expectation as a random variable, — a function of the random variable ''X'', namely,
for <math>x=0,...,10.</math> (In this example it appears to be a linear function, but in general it is nonlinear.) One may also treat the conditional expectation as a random variable, — a function of the random variable <math>X</math>, namely,
: <math> \mathbb{E} (Y|X) = \frac3{10} X. </math>
: <math> \mathbb{E} (Y|X) = \frac3{10} X. </math>
The expectation of this random variable is equal to the (unconditional) expectation of ''Y'',
The expectation of this random variable is equal to the (unconditional) expectation of <math>Y</math>,
: <math> \mathbb{E} ( \mathbb{E} (Y|X) ) = \sum_x \mathbb{E} (Y|X=x) \mathbb{P} (X=x) = \mathbb{E} (Y), </math>
: <math> \mathbb{E} ( \mathbb{E} (Y|X) ) = \sum_x \mathbb{E} (Y|X=x) \mathbb{P} (X=x) = \mathbb{E} (Y), </math>
namely,
namely,
: <math>\sum_{x=0}^{10} \frac{3}{10} x \cdot \frac1{2^{10}} \binom{10}x = \frac 3 2 \, , </math> &nbsp; or simply &nbsp; <math> \mathbb{E} \Big( \frac3{10} X \Big) = \frac3{10} \mathbb{E} (X) = \frac3{10} \cdot 5 = \frac32 \, , </math>
: <math>\sum_{x=0}^{10} \frac{3}{10} x \cdot \frac1{2^{10}} \binom{10}x = \frac 3 2 \, , </math> &nbsp; or simply &nbsp; <math> \mathbb{E} \Big( \frac3{10} X \Big) = \frac3{10} \mathbb{E} (X) = \frac3{10} \cdot 5 = \frac32 \, , </math>
which is an instance of the [[law of total expectation]] {{nowrap begin}}E ( E ( ''Y'' | ''X'' ) ) = E ( ''Y'' ).{{nowrap end}}
which is an instance of the [[law of total expectation]] <math>E(E(Y|X))=E(Y).</math>


The random variable {{nowrap begin}}E ( ''Y'' | ''X'' ){{nowrap end}} is the best predictor of ''Y'' given ''X''. That is, it minimizes the mean square error {{nowrap begin}}E ( ''Y'' - ''f''(''X'') )<sup>2</sup>{{nowrap end}} on the class of all random variables of the form ''f'' (''X''). This class of random variables remains intact if ''X'' is replaced, say, with 2''X''. Thus, {{nowrap begin}}E ( ''Y'' | 2''X'' ) = E ( ''Y'' | ''X'' ).{{nowrap end}} It does not mean that {{nowrap begin}}E ( ''Y'' | 2''X'' ) = 0.3 × 2''X'';{{nowrap end}} rather, {{nowrap begin}}E ( ''Y'' | 2''X'' ) = 0.15 × 2''X'' = 0.3 ''X''.{{nowrap end}} In particular, {{nowrap begin}}E ( ''Y'' | 2''X''=2 ) = 0.3.{{nowrap end}} More generally, {{nowrap begin}}E ( ''Y'' | ''g''(''X'') ) = E ( ''Y'' | ''X'' ){{nowrap end}} for every function ''g'' that is one-to-one on the set of all possible values of ''X''. The values of ''X'' are irrelevant; what matters is the partition (denote it α<sub>''X''</sub>)  
The random variable <math>E(Y|X)</math> is the best predictor of <math>Y</math> given <math>X</math>. That is, it minimizes the mean square error <math>E(Y-f(X))^2</math> on the class of all random variables of the form <math>f(X).</math> This class of random variables remains intact if <math>X</math> is replaced, say, with <math>2X.</math> Thus, <math>E(Y|2X)=E(Y|X).</math> It does not mean that <math>E(Y|2X)=0.3×2X;</math> rather, <math>E(Y|2X)=0.15×2X=0.3X.</math> In particular, <math>E(Y|2X=2)=0.3.</math> More generally, <math>E(Y|g(X))=E(Y|X)</math> for every function <math>g</math> that is one-to-one on the set of all possible values of <math>X</math>. The values of <math>X</math> are irrelevant; what matters is the partition (denote it α<sub><math>X</math></sub>)  
: <math> \Omega = \{ X=x_1 \} \uplus \{ X=x_2 \} \uplus \dots </math>
: <math> \Omega = \{ X=x_1 \} \uplus \{ X=x_2 \} \uplus \dots </math>
of the sample space Ω into disjoint sets <math> \{ X=x_n \}. </math> (Here <math> x_1, x_2, \dots </math> are all possible values of ''X''.) Given an arbitrary partition α of Ω, one may define the random variable {{nowrap begin}}E ( ''Y'' | α ).{{nowrap end}} Still, {{nowrap begin}}E ( E ( Y | α ) ) = E ( ''Y'' ).{{nowrap end}}
of the sample space Ω into disjoint sets <math> \{ X=x_n \}. </math> (Here <math> x_1, x_2, \dots </math> are all possible values of <math>X</math>.) Given an arbitrary partition α of Ω, one may define the random variable <math>E(Y|α).</math> Still, <math>E(E(Y|α))=E(Y).</math>


Conditional probability may be treated as a special case of conditional expectation. Namely, {{nowrap begin}}P ( ''A'' | ''X'' ) = E ( ''Y'' | ''X'' ){{nowrap end}} if ''Y'' is the [[indicator function|indicator]] of ''A''. Therefore the conditional probability also depends on the partition α<sub>''X''</sub> generated by ''X'' rather than on ''X'' itself; {{nowrap begin}}P ( ''A'' | ''g''(''X'') ) = P ( ''A'' | ''X'' ) = P ( ''A'' | α ),{{nowrap end}} {{nowrap begin}}α = α<sub>''X''</sub> = α<sub>''g''(''X'')</sub>.{{nowrap end}}
Conditional probability may be treated as a special case of conditional expectation. Namely, <math>P(A|X)=E(Y|X)</math> if <math>Y</math> is the [[indicator function|indicator]] of <math>A</math>. Therefore the conditional probability also depends on the partition <math>α_X</math> generated by <math>X</math> rather than on <math>X</math> itself; <math>P(A|g(X))=P(A|X)=P(A|α),</math> <math>α=α_X=α_g(X).</math>


On the other hand, conditioning on an event ''B'' is well-defined, provided that {{nowrap begin}}P ( ''B'' ) ≠ 0,{{nowrap end}} irrespective of any partition that may contain ''B'' as one of several parts.
On the other hand, conditioning on an event <math>B</math> is well-defined, provided that <math>P(B)≠0,</math> irrespective of any partition that may contain <math>B</math> as one of several parts.


===Conditional distribution===
===Conditional distribution===
{{main|Conditional probability distribution}}
{{main|Conditional probability distribution}}
Given ''X'' = x, the conditional distribution of ''Y'' is
Given <math>X=x,@theconditionaldistributionofYis
: <math> \mathbb{P} ( Y=y | X=x ) = \frac{ \binom 3 y \binom 7 {x-y} }{ \binom{10}x } = \frac{ \binom x y \binom{10-x}{3-y} }{ \binom{10}3 } </math>
:<math>\mathbb{P}(Y=y|X=x)=\frac{\binom3y\binom7{x-y}}{\binom{10}x}=\frac{\binomxy\binom{10-x}{3-y}}{\binom{10}3}</math>
for {{nowrap begin}}0 ≤ ''y'' ≤ min ( 3, ''x'' ).{{nowrap end}} It is the [[hypergeometric distribution]] {{nowrap begin}}H ( ''x''; 3, 7 ),{{nowrap end}} or equivalently, {{nowrap begin}}H ( 3; ''x'', 10-''x'' ).{{nowrap end}} The corresponding expectation 0.3 ''x'', obtained from the general formula <math> n \frac{R}{R+W} </math> for {{nowrap begin}}H ( ''n''; ''R'', ''W'' ),{{nowrap end}} is nothing but the conditional expectation {{nowrap begin}}E ( ''Y'' | ''X'' = x ) = 0.3 ''x''.{{nowrap end}}
for<math>0≤y≤min(3,x).</math>Itisthe[[hypergeometricdistribution]]<math>H(x;3,7),</math>orequivalently,<math>H(3;x,10-x).</math>Thecorrespondingexpectation@0.3x,</math> obtained from the general formula <math> n \frac{R}{R+W} </math> for <math>H(n;R,W),</math> is nothing but the conditional expectation <math>E(Y|X=x)=0.3x.</math>


Treating {{nowrap begin}}H ( ''X''; 3, 7 ){{nowrap end}} as a random distribution (a random vector in the four-dimensional space of all measures on {0,1,2,3}), one may take its expectation, getting the unconditional distribution of ''Y'', — the [[binomial distribution]] {{nowrap begin}}Bin ( 3, 0.5 ).{{nowrap end}} This fact amounts to the the equality
Treating <math>H(X;3,7)</math> as a random distribution (a random vector in the four-dimensional space of all measures on <math>{0,1,2,3}),</math> one may take its expectation, getting the unconditional distribution of <math>Y</math>, — the [[binomial distribution]] <math>Bin(3,0.5).</math> This fact amounts to the the equality
: <math> \sum_{x=0}^{10} \mathbb{P} ( Y=y | X=x ) \mathbb{P} (X=x) = \mathbb{P} (Y=y) = \frac1{2^3} \binom 3 y </math>
: <math> \sum_{x=0}^{10} \mathbb{P} ( Y=y | X=x ) \mathbb{P} (X=x) = \mathbb{P} (Y=y) = \frac1{2^3} \binom 3 y </math>
for ''y'' = 0,1,2,3; just the law of total probability.
for <math>y=0,1,2,3;</math> just the law of total probability.


==Conditioning on the level of densities==
==Conditioning on the level of densities==
{{main|Probability density function|Conditional probability distribution}}
{{main|Probability density function|Conditional probability distribution}}
'''Example.''' A point of the sphere ''x''<sup>2</sup> + ''y''<sup>2</sup> + ''z''<sup>2</sup> = 1 is chosen at random according to the uniform distribution on the sphere <ref>[[n-sphere#Generating points on the surface of the n-ball]]</ref>
'''Example.''' A point of the sphere <math>x^2+y^2+z^2=1</math> is chosen at random according to the uniform distribution on the sphere <ref>[[n-sphere#Generating points on the surface of the n-ball]]</ref>
<ref>[http://en.wikibooks.org/wiki/Mathematica/Uniform_Spherical_Distribution wikibooks: Uniform Spherical Distribution]</ref>. The random variables ''X'', ''Y'', ''Z'' are the coordinates of the random point. The joint density of ''X'', ''Y'', ''Z'' does not exist (since the sphere is of zero volume), but the joint density ''f''<sub>''X'',''Y''</sub> of ''X'', ''Y'' exists,
<ref>[http://en.wikibooks.org/wiki/Mathematica/Uniform_Spherical_Distribution wikibooks: Uniform Spherical Distribution]</ref>. The random variables <math>X</math>, <math>Y</math>, <math>Z</math> are the coordinates of the random point. The joint density of <math>X</math>, <math>Y</math>, <math>Z</math> does not exist (since the sphere is of zero volume), but the joint density <math>f_X,Y</math> of <math>X</math>, <math>Y</math> exists,
: <math> f_{X,Y} (x,y) = \begin{cases}
: <math> f_{X,Y} (x,y) = \begin{cases}
   \frac1{2\pi\sqrt{1-x^2-y^2}} &\text{if } x^2+y^2<1,\\
   \frac1{2\pi\sqrt{1-x^2-y^2}} &\text{if } x^2+y^2<1,\\
   0 &\text{otherwise}.
   0 &\text{otherwise}.
  \end{cases} </math>
  \end{cases} </math>
(The density is non-constant because of a non-constant angle between the sphere and the plane<ref>[[Area#General formula]]</ref>.) The density of ''X'' may be calculated by integration,
(The density is non-constant because of a non-constant angle between the sphere and the plane<ref>[[Area#General formula]]</ref>.) The density of <math>X</math> may be calculated by integration,
: <math> f_X(x) = \int_{-\infty}^{+\infty} f_{X,Y}(x,y) \, \mathrm{d}y = \int_{-\sqrt{1-x^2}}^{+\sqrt{1-x^2}} \frac{ \mathrm{d}y }{ 2\pi\sqrt{1-x^2-y^2} } \, ; </math>
: <math> f_X(x) = \int_{-\infty}^{+\infty} f_{X,Y}(x,y) \, \mathrm{d}y = \int_{-\sqrt{1-x^2}}^{+\sqrt{1-x^2}} \frac{ \mathrm{d}y }{ 2\pi\sqrt{1-x^2-y^2} } \, ; </math>
surprisingly, the result does not depend on ''x'' in (-1,1),
surprisingly, the result does not depend on <math>x</math> in (-1,1),
: <math> f_X(x) = \begin{cases}
: <math> f_X(x) = \begin{cases}
  0.5 &\text{for } -1<x<1,\\
  0.5 &\text{for } -1<x<1,\\
  0 &\text{otherwise},
  0 &\text{otherwise},
\end{cases} </math>
\end{cases} </math>
which means that ''X'' is distributed uniformly on (-1,1). The same holds for ''Y'' and ''Z'' (and in fact, for {{nowrap begin}}''aX'' + ''bY'' + ''cZ''{{nowrap end}} whenever {{nowrap begin}}''a''<sup>2</sup> + b<sup>2</sup> + c<sup>2</sup> = 1).{{nowrap end}}
which means that <math>X</math> is distributed uniformly on <math>(-1,1).</math> The same holds for <math>Y</math> and <math>Z</math> (and in fact, for <math>aX+bY+cZ</math> whenever <math>a^2+b^2+c^2=1).</math>


===Conditional probability===
===Conditional probability===


====Calculation====
====Calculation====
Given that ''X'' = 0.5, the conditional probability of the event ''Y'' ≤ 0.75 is the integral of the conditional density,
Given that <math>X=0.5,</math> the conditional probability of the event <math>Y≤0.75</math> is the integral of the conditional density,
: <math> \begin{align}
: <math> \begin{align}
& f_{Y|X=0.5}(y) = \frac{ f_{X,Y}(0.5,y) }{ f_X(0.5) } = \begin{cases}
& f_{Y|X=0.5}(y) = \frac{ f_{X,Y}(0.5,y) }{ f_X(0.5) } = \begin{cases}
Line 85: Line 89:
More generally,
More generally,
: <math> \mathbb{P} (Y \le y|X=x) = \frac12 + \frac1{\pi} \arcsin \frac{ y }{ \sqrt{1-x^2} } </math>
: <math> \mathbb{P} (Y \le y|X=x) = \frac12 + \frac1{\pi} \arcsin \frac{ y }{ \sqrt{1-x^2} } </math>
for all ''x'' and ''y'' such that -1 < x < 1 (otherwise the denominator ''f''<sub>''X''</sub>(''x'') vanishes) and <math>\textstyle -\sqrt{1-x^2} < y < \sqrt{1-x^2} </math> (otherwise the conditional probability degenerates to 0 or 1). One may also treat the conditional probability as a random variable, — a function of the random variable ''X'', namely,
for all <math>x</math> and <math>y</math> such that <math>-1<x<1</math> (otherwise the denominator <math>f_X(x)</math> vanishes) and <math>\textstyle -\sqrt{1-x^2} < y < \sqrt{1-x^2} </math> (otherwise the conditional probability degenerates to 0 or 1). One may also treat the conditional probability as a random variable, — a function of the random variable <math>X</math>, namely,
: <math> \mathbb{P} (Y \le y|X) = \begin{cases}
: <math> \mathbb{P} (Y \le y|X) = \begin{cases}
  0 &\text{for } X^2 \ge 1-y^2 \text{ and } y<0,\\
  0 &\text{for } X^2 \ge 1-y^2 \text{ and } y<0,\\
Line 93: Line 97:
The expectation of this random variable is equal to the (unconditional) probability,
The expectation of this random variable is equal to the (unconditional) probability,
: <cite id="DPC8"> <math> \mathbb{E} ( \mathbb{P} (Y\le y|X) ) = \int_{-\infty}^{+\infty} \mathbb{P} (Y\le y|X=x) f_X(x) \, \mathrm{d}x = \mathbb{P} (Y\le y), </math> </cite>
: <cite id="DPC8"> <math> \mathbb{E} ( \mathbb{P} (Y\le y|X) ) = \int_{-\infty}^{+\infty} \mathbb{P} (Y\le y|X=x) f_X(x) \, \mathrm{d}x = \mathbb{P} (Y\le y), </math> </cite>
which is an instance of the [[law of total probability]] {{nowrap begin}}E ( P ( ''A'' | ''X'' ) ) = P ( ''A'' ).{{nowrap end}}
which is an instance of the [[law of total probability]] <math>E(P(A|X))=P(A).</math>


====Interpretation====
====Interpretation====
The conditional probability {{nowrap begin}}P ( ''Y'' ≤ 0.75 | ''X'' = 0.5 ){{nowrap end}} cannot be interpreted as {{nowrap begin}}P ( ''Y'' ≤ 0.75, ''X'' = 0.5 ) / P ( ''X'' = 0.5 ),{{nowrap end}} since the latter gives 0/0. Accordingly, {{nowrap begin}}P ( ''Y'' ≤ 0.75 | ''X'' = 0.5 ){{nowrap end}} cannot be interpreted via empirical frequencies, since the exact value ''X'' = 0.5 has no chance to appear at random, not even once during an infinite sequence of independent trials.
The conditional probability <math>P(Y≤0.75|X=0.5)</math> cannot be interpreted as <math>P(Y≤0.75,X=0.5)/P(X=0.5),</math> since the latter gives 0/0. Accordingly, <math>P(Y≤0.75|X=0.5)</math> cannot be interpreted via empirical frequencies, since the exact value <math>X=0.5</math> has no chance to appear at random, not even once during an infinite sequence of independent trials.


The conditional probability can be interpreted as a limit,
The conditional probability can be interpreted as a limit,
Line 106: Line 110:


===Conditional expectation===
===Conditional expectation===
The conditional expectation {{nowrap begin}}E ( ''Y'' | ''X'' = 0.5 ){{nowrap end}} is of little interest; it vanishes just by symmetry. It is more interesting to calculate {{nowrap begin}}E ( |''Z''| | ''X'' = 0.5 ){{nowrap end}} treating |''Z''| as a function of ''X'', ''Y'':
The conditional expectation <math>E(Y|X=0.5)</math> is of little interest; it vanishes just by symmetry. It is more interesting to calculate <math>E(|Z||X=0.5)</math> treating |<math>Z</math>| as a function of <math>X</math>, <math>Y</math>:
: <math> \begin{align}
: <math> \begin{align}
& |Z| = h(X,Y) = \sqrt{1-X^2-Y^2} \, ; \\
& |Z| = h(X,Y) = \sqrt{1-X^2-Y^2} \, ; \\
Line 114: Line 118:
More generally,
More generally,
: <math> \mathbb{E} ( |Z| | X=x ) = \frac2\pi \sqrt{1-x^2} </math>
: <math> \mathbb{E} ( |Z| | X=x ) = \frac2\pi \sqrt{1-x^2} </math>
for -1 < ''x'' < 1. One may also treat the conditional expectation as a random variable, — a function of the random variable X, namely,
for <math>-1<x<1.</math> One may also treat the conditional expectation as a random variable, — a function of the random variable X, namely,
: <math> \mathbb{E} ( |Z| | X ) = \frac2\pi \sqrt{1-X^2} \, . </math>
: <math> \mathbb{E} ( |Z| | X ) = \frac2\pi \sqrt{1-X^2} \, . </math>
The expectation of this random variable is equal to the (unconditional) expectation of |''Z''|,
The expectation of this random variable is equal to the (unconditional) expectation of <math>|Z|,</math>
: <math> \mathbb{E} ( \mathbb{E} ( |Z| | X ) ) = \int_{-\infty}^{+\infty} \mathbb{E} ( |Z| | X=x ) f_X(x) \, \mathrm{d}x = \mathbb{E} (|Z|) \, , </math>
: <math> \mathbb{E} ( \mathbb{E} ( |Z| | X ) ) = \int_{-\infty}^{+\infty} \mathbb{E} ( |Z| | X=x ) f_X(x) \, \mathrm{d}x = \mathbb{E} (|Z|) \, , </math>
namely,
namely,
: <math> \int_{-1}^{+1} \frac2\pi \sqrt{1-x^2} \cdot \frac{ \mathrm{d}x }2 = \frac12 \, , </math>
: <math> \int_{-1}^{+1} \frac2\pi \sqrt{1-x^2} \cdot \frac{ \mathrm{d}x }2 = \frac12 \, , </math>
which is an instance of the [[law of total expectation]] {{nowrap begin}}E ( E ( ''Y'' | ''X'' ) ) = E ( ''Y'' ).{{nowrap end}}
which is an instance of the [[law of total expectation]] <math>E(E(Y|X))=E(Y).</math>


The random variable {{nowrap begin}}E ( |''Z''| | ''X'' ){{nowrap end}} is the best predictor of |''Z''| given ''X''. That is, it minimizes the mean square error {{nowrap begin}}E ( |''Z''| - ''f''(''X'') )<sup>2</sup>{{nowrap end}} on the class of all random variables of the form ''f'' (''X''). Similarly to the discrete case, {{nowrap begin}}E ( |''Z''| | ''g''(''X'') ) = E ( |''Z''| | ''X'' ){{nowrap end}} for every measurable function ''g'' that is one-to-one on (-1,1).
The random variable <math>E(|Z||X)</math> is the best predictor of <math>|Z|</math> given <math>X</math>. That is, it minimizes the mean square error <math>E(|Z|-f(X))^2</math> on the class of all random variables of the form <math>f(X).</math> Similarly to the discrete case, <math>E(|Z||g(X))=E(|Z||X)</math> for every measurable function <math>g</math> that is one-to-one on <math>(-1,1).</math>


===Conditional distribution===
===Conditional distribution===
Given ''X'' = x, the conditional distribution of ''Y'', given by the density ''f''<sub>''Y''|''X''=''x''</sub>(y), is the (rescaled) arcsin distribution; its cumulative distribution function is
Given <math>X=x,</math> the conditional distribution of <math>Y</math>, given by the density <math>f_Y|X=x(y),</math> is the (rescaled) arcsin distribution; its cumulative distribution function is
: <math> F_{Y|X=x} (y) = \mathbb{P} ( Y \le y | X = x ) = \frac12 + \frac1\pi \arcsin \frac{y}{\sqrt{1-x^2}} </math>
: <math> F_{Y|X=x} (y) = \mathbb{P} ( Y \le y | X = x ) = \frac12 + \frac1\pi \arcsin \frac{y}{\sqrt{1-x^2}} </math>
for all ''x'' and ''y'' such that ''x''<sup>2</sup> + ''y''<sup>2</sup> < 1.The corresponding expectation of ''h''(''x'',''Y'') is nothing but the conditional expectation {{nowrap begin}}E ( ''h''(''X'',''Y'') | ''X''=''x'' ).{{nowrap end}} The [[Mixture density|mixture]] of these conditional distributions, taken for all ''x'' (according to the distribution of ''X'') is the unconditional distribution of ''Y''. This fact amounts to the equalities
for all <math>x</math> and <math>y</math> such that <math>x^2+y^2<1.</math> The corresponding expectation of <math>h(x,Y)</math> is nothing but the conditional expectation <math>E(h(X,Y)|X=x).</math> The [[Mixture density|mixture]] of these conditional distributions, taken for all <math>x</math> (according to the distribution of <math>X</math>) is the unconditional distribution of <math>Y</math>. This fact amounts to the equalities
: <math> \begin{align}
: <math> \begin{align}
& \int_{-\infty}^{+\infty} f_{Y|X=x} (y) f_X(x) \, \mathrm{d}x = f_Y(y) \, , \\
& \int_{-\infty}^{+\infty} f_{Y|X=x} (y) f_X(x) \, \mathrm{d}x = f_Y(y) \, , \\
Line 135: Line 139:


==What conditioning is not==
==What conditioning is not==
On the discrete level conditioning is possible only if the condition is of nonzero probability (one cannot divide by zero). On the level of densities, conditioning on ''X'' = ''x'' is possible even though {{nowrap begin}}P ( ''X'' = ''x'' ) = 0.{{nowrap end}} This success may create the illusion that conditioning is ''always'' possible. Regretfully, it is not, for several reasons presented below.
On the discrete level conditioning is possible only if the condition is of nonzero probability (one cannot divide by zero). On the level of densities, conditioning on <math>X=x</math> is possible even though <math>P(X=x)=0.</math> This success may create the illusion that conditioning is ''always'' possible. Regretfully, it is not, for several reasons presented below.


===Geometric intuition: caution===
===Geometric intuition: caution===
The result {{nowrap begin}}P ( ''Y'' ≤ 0.75 | ''X'' = 0.5 ) = 5/6,{{nowrap end}} mentioned above, is geometrically evident in the following sense. The points (''x'',''y'',''z'') of the sphere ''x''<sup>2</sup> + ''y''<sup>2</sup> + ''z''<sup>2</sup> = 1, satisfying the condition ''x'' = 0.5, are a circle ''y''<sup>2</sup> + ''z''<sup>2</sup> = 0.75 of radius <math> \sqrt{0.75} </math> on the plane ''x'' = 0.5. The inequality ''y'' ≤ 0.75 holds on an arc. The length of the arc is 5/6 of the length of the circle, which is why the conditional probability is equal to 5/6.
The result <math>P(Y≤0.75|X=0.5)=5/6,</math> mentioned above, is geometrically evident in the following sense. The points <math>(x,y,z)</math> of the sphere <math>x^2+y^2+z^2=1,</math> satisfying the condition <math>x=0.5,</math> are a circle <math>y^2+z^2=0.75</math> of radius <math> \sqrt{0.75} </math> on the plane <math>x=0.5.</math> The inequality <math>y≤0.75</math> holds on an arc. The length of the arc is 5/6 of the length of the circle, which is why the conditional probability is equal to 5/6.


This successful geometric explanation may create the illusion that the following question is trivial.
This successful geometric explanation may create the illusion that the following question is trivial.
Line 144: Line 148:
: A point of a given sphere is chosen at random (uniformly). Given that the point lies on a given plane, what is its conditional distribution?
: A point of a given sphere is chosen at random (uniformly). Given that the point lies on a given plane, what is its conditional distribution?


It may seem evident that the conditional distribution must be uniform on the given circle (the intersection of the given sphere and the given plane). Sometimes it really is, but in general it is not. Especially, ''Z'' is distributed uniformly on (-1,+1) and independent of the ratio ''Y''/''X'', thus, {{nowrap begin}}P ( ''Z'' ≤ 0.5 | ''Y''/''X'' ) = 0.75.{{nowrap end}} On the other hand, the inequality ''z'' ≤ 0.5 holds on an arc of the circle {{nowrap begin}}''x''<sup>2</sup> + ''y''<sup>2</sup> + ''z''<sup>2</sup> = 1,{{nowrap end}} {{nowrap begin}}''y'' = ''cx''{{nowrap end}} (for any given ''c''). The length of the arc is 2/3 of the length of the circle. However, the conditional probability is 3/4, not 2/3. This is a manifestation of the classical Borel paradox<ref>{{harvnb|Pollard|2002|loc=Sect. 5.5, Example 17 on page 122}}</ref> <ref>{{harvnb|Durrett|1996|loc=Sect. 4.1(a), Example 1.6 on page 224}}</ref>.
It may seem evident that the conditional distribution must be uniform on the given circle (the intersection of the given sphere and the given plane). Sometimes it really is, but in general it is not. Especially, <math>Z</math> is distributed uniformly on <math>(-1,+1)</math> and independent of the ratio <math>Y/X,</math> thus, <math>P(Z≤0.5|Y/X)=0.75.</math> On the other hand, the inequality <math>z≤0.5</math> holds on an arc of the circle <math>x^2+y^2+z^2=1,</math> <math>y=cx</math> (for any given <math>c</math>). The length of the arc is 2/3 of the length of the circle. However, the conditional probability is 3/4, not 2/3. This is a manifestation of the classical Borel paradox<ref>{{harvnb|Pollard|2002|loc=Sect. 5.5, Example 17 on page 122}}</ref> <ref>{{harvnb|Durrett|1996|loc=Sect. 4.1(a), Example 1.6 on page 224}}</ref>.


{{quote|Appeals to symmetry can be misleading if not formalized as invariance arguments.|Pollard<ref name="Pollard-5.5-122">{{harvnb|Pollard|2002|loc=Sect. 5.5, page 122}}</ref>}}
{{quote|Appeals to symmetry can be misleading if not formalized as invariance arguments.|Pollard<ref name="Pollard-5.5-122">{{harvnb|Pollard|2002|loc=Sect. 5.5, page 122}}</ref>}}
Line 151: Line 155:


===The limiting procedure===
===The limiting procedure===
Given an event ''B'' of zero probability, the formula <math>\textstyle \mathbb{P} (A|B) = \mathbb{P} ( A \cap B ) / \mathbb{P} (B) </math> is useless, however, one can try <math>\textstyle \mathbb{P} (A|B) = \lim_{n\to\infty} \mathbb{P} ( A \cap B_n ) / \mathbb{P} (B_n) </math> for an appropriate sequence of events ''B''<sub>''n''</sub> of nonzero probability such that ''B''<sub>''n''</sub> ↓ ''B'' (that is, <math>\textstyle B_1 \supset B_2 \supset \dots </math> and <math>\textstyle B_1 \cap B_2 \cap \dots = B </math>). One example is given [[#DPI5|above]]. Two more examples are [[Wiener process#Related processes|Brownian bridge and Brownian excursion]].  
Given an event <math>B</math> of zero probability, the formula <math>\textstyle \mathbb{P} (A|B) = \mathbb{P} ( A \cap B ) / \mathbb{P} (B) </math> is useless, however, one can try <math>\textstyle \mathbb{P} (A|B) = \lim_{n\to\infty} \mathbb{P} ( A \cap B_n ) / \mathbb{P} (B_n) </math> for an appropriate sequence of events <math>B_n</math> of nonzero probability such that <math>B_n↓B</math> (that is, <math>\textstyle B_1 \supset B_2 \supset \dots </math> and <math>\textstyle B_1 \cap B_2 \cap \dots = B </math>). One example is given [[#DPI5|above]]. Two more examples are [[Wiener process#Related processes|Brownian bridge and Brownian excursion]].  


In the latter two examples the law of total probability is irrelevant, since only a single event (the condition) is given. In contrast, in the example [[#DPI5|above]] the law of total probability [[#DPC8|applies]], since the event ''X'' = 0.5 is included into a family of events ''X'' = ''x'' where ''x'' runs over (-1,1), and these events are a partition of the probability space.
In the latter two examples the law of total probability is irrelevant, since only a single event (the condition) is given. In contrast, in the example [[#DPI5|above]] the law of total probability [[#DPC8|applies]], since the event <math>X=0.5</math> is included into a family of events <math>X=x</math> where <math>x</math> runs over <math>(-1,1),</math> and these events are a partition of the probability space.


In order to avoid paradoxes (such as the [[Borel's paradox]]), the following important distinction should be taken into account. If a given event is of nonzero probability then conditioning on it is well-defined (irrespective of any other events), as was noted [[#EP8|above]]. In contrast, if the given event is of zero probability then conditioning on it is ill-defined unless some additional input is provided. Wrong choice of this additional input leads to wrong conditional probabilities (expectations, distributions). In this sense, "''the concept of a conditional probability with regard to an isolated hypothesis whose probability equals 0 is inadmissible.''" ([[Andrey Kolmogorov|Kolmogorov]]; quoted in <ref name="Pollard-5.5-122">{{harvnb|Pollard|2002|loc=Sect. 5.5, page 122}}</ref>).
In order to avoid paradoxes (such as the [[Borel's paradox]]), the following important distinction should be taken into account. If a given event is of nonzero probability then conditioning on it is well-defined (irrespective of any other events), as was noted [[#EP8|above]]. In contrast, if the given event is of zero probability then conditioning on it is ill-defined unless some additional input is provided. Wrong choice of this additional input leads to wrong conditional probabilities (expectations, distributions). In this sense, "''the concept of a conditional probability with regard to an isolated hypothesis whose probability equals 0 is inadmissible.''" ([[Andrey Kolmogorov|Kolmogorov]]; quoted in <ref name="Pollard-5.5-122">{{harvnb|Pollard|2002|loc=Sect. 5.5, page 122}}</ref>).


The additional input may be (a) a symmetry (invariance group); (b) a sequence of events ''B''<sub>''n''</sub> such that ''B''<sub>''n''</sub> ↓ ''B'', P ( ''B''<sub>''n''</sub> ) > 0; (c) a partition containing the given event. Measure-theoretic conditioning (below) investigates Case (c), discloses its relation to (b) in general and to (a) when applicable.
The additional input may be (a) a symmetry (invariance group); (b) a sequence of events <math>B_n</math> such that <math>B_n↓B,</math> <math>P(B_n)>0;</math> (c) a partition containing the given event. Measure-theoretic conditioning (below) investigates Case (c), discloses its relation to (b) in general and to (a) when applicable.


Some events of zero probability are beyond the reach of conditioning. An example: let ''X''<sub>''n''</sub> be independent random variables distributed uniformly on (0,1), and ''B'' the event {{nowrap begin}}"''X''<sub>''n''</sub> → 0{{nowrap end}} as {{nowrap begin}}<math>\textstyle n\to\infty </math>";{{nowrap end}} what about {{nowrap begin}}P ( ''X''<sub>''n''</sub> < 0.5 | ''B'' ) ?{{nowrap end}} Does it tend to 1, or not? Another example: let ''X'' be a  random variable distributed uniformly on (0,1), and ''B'' the event "''X'' is a rational number"; what about {{nowrap begin}}P ( ''X'' = 1/''n'' | ''B'' ) ?{{nowrap end}}
Some events of zero probability are beyond the reach of conditioning. An example: let <math>X_n</math> be independent random variables distributed uniformly on <math>(0,1),</math> and <math>B</math> the event <math>"X_n→0</math> as <math><math>\textstylen\to\infty</math>";</math> what about <math>P(X_n<0.5|B)?</math> Does it tend to 1, or not? Another example: let <math>X</math> be a  random variable distributed uniformly on <math>(0,1),</math> and <math>B</math> the event "<math>X</math> is a rational number"; what about <math>P(X=1/n|B)?</math>
The only answer is that, once again,
The only answer is that, once again,
{{quote|the concept of a conditional probability with regard to an isolated hypothesis whose probability equals 0 is inadmissible.|Kolmogorov, quoted in <ref name="Pollard-5.5-122">{{harvnb|Pollard|2002|loc=Sect. 5.5, page 122}}.</ref>}}
{{quote|the concept of a conditional probability with regard to an isolated hypothesis whose probability equals 0 is inadmissible.|Kolmogorov, quoted in <ref name="Pollard-5.5-122">{{harvnb|Pollard|2002|loc=Sect. 5.5, page 122}}.</ref>}}
Line 165: Line 169:
==Conditioning on the level of measure theory==
==Conditioning on the level of measure theory==
{{main|Conditional expectation}}
{{main|Conditional expectation}}
'''Example.''' Let ''Y'' be a random variable distributed uniformly on (0,1), and ''X'' = ''f''(''Y'') where ''f'' is a given function. Two cases are treated below: ''f'' = ''f''<sub>1</sub> and ''f'' = ''f''<sub>2</sub>, where ''f''<sub>1</sub> is the continuous piecewise-linear function
'''Example.''' Let <math>Y</math> be a random variable distributed uniformly on <math>(0,1),</math> and <math>X=f(Y)</math> where <math>f</math> is a given function. Two cases are treated below: <math>f=f_1</math> and <math>f=f_2,</math> where <math>f_1</math> is the continuous piecewise-linear function
: <math> f_1(y) = \begin{cases}
: <math> f_1(y) = \begin{cases}
  3y &\text{for } 0 \le y \le 1/3,\\
  3y &\text{for } 0 \le y \le 1/3,\\
Line 171: Line 175:
  0.5 &\text{for } 2/3 \le y \le 1,
  0.5 &\text{for } 2/3 \le y \le 1,
\end{cases} </math>
\end{cases} </math>
and ''f''<sub>2</sub> is the [[Weierstrass function]].
and <math>f_2</math> is the [[Weierstrass function]].


===Geometric intuition: caution===
===Geometric intuition: caution===
Given ''X'' = 0.75, two values of ''Y'' are possible, 0.25 and 0.5. It may seem evident that both values are of conditional probability 0.5 just because one point is [[Congruence (geometry)|congruent]] to another point. However, this is an illusion; see below.
Given <math>X=0.75,</math> two values of <math>Y</math> are possible, 0.25 and 0.5. It may seem evident that both values are of conditional probability 0.5 just because one point is [[Congruence (geometry)|congruent]] to another point. However, this is an illusion; see below.


===Conditional probability===
===Conditional probability===
The conditional probability {{nowrap begin}}P ( ''Y'' ≤ 1/3 | ''X'' ){{nowrap end}} may be defined as the best predictor of the indicator
The conditional probability <math>P(Y≤1/3|X)</math> may be defined as the best predictor of the indicator
: <math> I = \begin{cases}
: <math> I = \begin{cases}
  1 &\text{if } Y \le 1/3,\\
  1 &\text{if } Y \le 1/3,\\
  0 &\text{otherwise},
  0 &\text{otherwise},
\end{cases} </math>
\end{cases} </math>
given X. That is, it minimizes the mean square error {{nowrap begin}}E ( ''I'' - ''g''(''X'') )<sup>2</sup>{{nowrap end}} on the class of all random variables of the form ''g'' (''X'').
given X. That is, it minimizes the mean square error <math>E(I-g(X))^2</math> on the class of all random variables of the form <math>g(X).</math>


In the case ''f'' = ''f''<sub>1</sub> the corresponding function ''g'' = ''g''<sub>1</sub> may be calculated explicitly,<ref group="details">
In the case <math>f=f_1</math> the corresponding function <math>g=g_1</math> may be calculated explicitly,<ref group="details">
Proof:
Proof:
: <math> \begin{align}
: <math> \begin{align}
Line 192: Line 196:
& = \frac13 \int_0^{0.5} (1-g(x))^2 \, \mathrm{d}x + \frac13 g^2(0.5) + \frac13 \int_{0.5}^1 ( (1-g(x))^2 + 2g^2(x) ) \, \mathrm{d}x \, ;
& = \frac13 \int_0^{0.5} (1-g(x))^2 \, \mathrm{d}x + \frac13 g^2(0.5) + \frac13 \int_{0.5}^1 ( (1-g(x))^2 + 2g^2(x) ) \, \mathrm{d}x \, ;
\end{align} </math>
\end{align} </math>
it remains to note that {{nowrap begin}}( 1 &minus; ''a'' )<sup>2</sup> + 2''a''<sup>2</sup>{{nowrap end}} is minimal at ''a'' = 1/3.
it remains to note that <math>(1&minus;a)^2+2a^2</math> is minimal at <math>a=1/3.</math>
</ref>
</ref>
: <math> g_1(x) = \begin{cases}
: <math> g_1(x) = \begin{cases}
Line 204: Line 208:
giving the same result.
giving the same result.


Thus, {{nowrap begin}}P ( ''Y'' ≤ 1/3 | ''X'' ) = ''g''<sub>1</sub> (''X'').{{nowrap end}} The expectation of this random variable is equal to the (unconditional) probability, {{nowrap begin}}E ( P ( ''Y'' ≤ 1/3 | ''X'' ) ) = P ( ''Y'' ≤ 1/3 ),{{nowrap end}} namely,
Thus, <math>P(Y≤1/3|X)=g_1(X).</math> The expectation of this random variable is equal to the (unconditional) probability, <math>E(P(Y≤1/3|X))=P(Y≤1/3),</math> namely,
: <math> 1 \cdot \mathbb{P} (X<0.5) + 0 \cdot \mathbb{P} (X=0.5) + \frac13 \cdot \mathbb{P} (X>0.5) = 1 \cdot \frac16 + 0 \cdot \frac13 + \frac13 \cdot \Big( \frac16 + \frac13 \Big) = \frac13 \, , </math>
: <math> 1 \cdot \mathbb{P} (X<0.5) + 0 \cdot \mathbb{P} (X=0.5) + \frac13 \cdot \mathbb{P} (X>0.5) = 1 \cdot \frac16 + 0 \cdot \frac13 + \frac13 \cdot \Big( \frac16 + \frac13 \Big) = \frac13 \, , </math>
which is an instance of the [[law of total probability]] {{nowrap begin}}E ( P ( ''A'' | ''X'' ) ) = P ( ''A'' ).{{nowrap end}}
which is an instance of the [[law of total probability]] <math>E(P(A|X))=P(A).</math>


In the case ''f'' = ''f''<sub>2</sub> the corresponding function ''g'' = ''g''<sub>2</sub> probably cannot be calculated explicitly. Nevertheless it exists, and can be computed numerically. Indeed, the [[Lp_space#Hilbert_spaces|space]] L<sub>2</sub> (Ω) of all square integrable random variables is a [[Hilbert space]]; the indicator ''I'' is a vector of this space; and random variables of the form ''g'' (''X'') are a (closed, linear) subspace. The [[Hilbert_space#Orthogonal_complements_and_projections|orthogonal projection]] of this vector to this subspace is well-defined. It can be computed numerically, using [[Galerkin method|finite-dimensional approximations]] to the infinite-dimensional Hilbert space.
In the case <math>f=f_2</math> the corresponding function <math>g=g_2</math> probably cannot be calculated explicitly. Nevertheless it exists, and can be computed numerically. Indeed, the [[Lp_space#Hilbert_spaces|space]] L<sub>2</sub> (Ω) of all square integrable random variables is a [[Hilbert space]]; the indicator <math>I</math> is a vector of this space; and random variables of the form <math>g(X)</math> are a (closed, linear) subspace. The [[Hilbert_space#Orthogonal_complements_and_projections|orthogonal projection]] of this vector to this subspace is well-defined. It can be computed numerically, using [[Galerkin method|finite-dimensional approximations]] to the infinite-dimensional Hilbert space.


Once again, the expectation of the random variable {{nowrap begin}}P ( ''Y'' ≤ 1/3 | ''X'' ) = ''g''<sub>2</sub> (''X''){{nowrap end}} is equal to the (unconditional) probability, {{nowrap begin}}E ( P ( ''Y'' ≤ 1/3 | ''X'' ) ) = P ( ''Y'' ≤ 1/3 ),{{nowrap end}} namely,
Once again, the expectation of the random variable <math>P(Y≤1/3|X)=g_2(X)</math> is equal to the (unconditional) probability, <math>E(P(Y≤1/3|X))=P(Y≤1/3),</math> namely,
: <math> \int_0^1 g_2 (f_2(y)) \, \mathrm{d}y = \frac13 \, . </math>
: <math> \int_0^1 g_2 (f_2(y)) \, \mathrm{d}y = \frac13 \, . </math>


However, the Hilbert space approach treats ''g''<sub>2</sub> as an equivalence class of functions rather than an individual function. Measurability of ''g''<sub>2</sub> is ensured, but continuity (or even [[Riemann integrability]]) is not. The value ''g''<sub>2</sub> (0.5) is determined uniquely, since the point 0.5 is an atom of the distribution of ''X''. Other values ''x'' are not atoms, thus, corresponding values ''g''<sub>2</sub> (''x'') are not determined uniquely. Once again, "''the concept of a conditional probability with regard to an isolated hypothesis whose probability equals 0 is inadmissible.''" ([[Andrey Kolmogorov|Kolmogorov]]; quoted in <ref name="Pollard-5.5-122">{{harvnb|Pollard|2002|loc=Sect. 5.5, page 122}}</ref>).
However, the Hilbert space approach treats <math>g_2</math> as an equivalence class of functions rather than an individual function. Measurability of <math>g_2</math> is ensured, but continuity (or even [[Riemann integrability]]) is not. The value <math>g_2(0.5)</math> is determined uniquely, since the point 0.5 is an atom of the distribution of <math>X</math>. Other values <math>x</math> are not atoms, thus, corresponding values <math>g_2(x)</math> are not determined uniquely. Once again, "''the concept of a conditional probability with regard to an isolated hypothesis whose probability equals 0 is inadmissible.''" ([[Andrey Kolmogorov|Kolmogorov]]; quoted in <ref name="Pollard-5.5-122">{{harvnb|Pollard|2002|loc=Sect. 5.5, page 122}}</ref>).


Alternatively, the same function ''g'' (be it ''g''<sub>1</sub> or ''g''<sub>2</sub>) may be defined as the [[Radon-Nikodym derivative]]
Alternatively, the same function <math>g</math> (be it <math>g_1</math> or <math>g_2</math>) may be defined as the [[Radon-Nikodym derivative]]
: <math> g = \frac{ \mathrm{d}\nu }{ \mathrm{d}\mu } \, , </math>
: <math> g = \frac{ \mathrm{d}\nu }{ \mathrm{d}\mu } \, , </math>
where measures μ, ν are defined by
where measures μ, ν are defined by
Line 222: Line 226:
  \nu (B) &= \mathbb{P} ( X \in B, \, Y \le 1/3 )
  \nu (B) &= \mathbb{P} ( X \in B, \, Y \le 1/3 )
\end{align} </math>
\end{align} </math>
for all Borel sets <math> B \subset \mathbb R. </math> That is, μ is the (unconditional) distribution of ''X'', while ν is one third of its conditional distribution,
for all Borel sets <math> B \subset \mathbb R. </math> That is, μ is the (unconditional) distribution of <math>X</math>, while ν is one third of its conditional distribution,
: <math> \nu (B) = \mathbb{P} ( X \in B | Y \le 1/3 ) \mathbb{P} ( Y \le 1/3 ) = \frac13 \mathbb{P} ( X \in B | Y \le 1/3 ) \, . </math>
: <math> \nu (B) = \mathbb{P} ( X \in B | Y \le 1/3 ) \mathbb{P} ( Y \le 1/3 ) = \frac13 \mathbb{P} ( X \in B | Y \le 1/3 ) \, . </math>


Both approaches (via the Hilbert space, and via the Radon-Nikodym derivative) treat ''g'' as an equivalence class of functions; two functions ''g'' and ''g′'' are treated as equivalent, if ''g'' (''X'') = ''g′'' (''X'') almost surely. Accordingly, the conditional probability {{nowrap begin}}P ( ''Y'' ≤ 1/3 | ''X'' ){{nowrap end}} is treated as an equivalence class of random variables; as usual, two random variables are treated as equivalent if they are equal almost surely.
Both approaches (via the Hilbert space, and via the Radon-Nikodym derivative) treat <math>g</math> as an equivalence class of functions; two functions <math>g</math> and ''g′'' are treated as equivalent, if <math>g(X)=g′(X)</math> almost surely. Accordingly, the conditional probability <math>P(Y≤1/3|X)</math> is treated as an equivalence class of random variables; as usual, two random variables are treated as equivalent if they are equal almost surely.


===Conditional expectation===
===Conditional expectation===
The conditional expectation {{nowrap begin}}E ( ''Y'' | ''X'' ){{nowrap end}} may be defined as the best predictor of ''Y'' given ''X''. That is, it minimizes the mean square error {{nowrap begin}}E ( ''Y'' - ''h''(''X'') )<sup>2</sup>{{nowrap end}} on the class of all random variables of the form ''h''(''X'').
The conditional expectation <math>E(Y|X)</math> may be defined as the best predictor of <math>Y</math> given <math>X</math>. That is, it minimizes the mean square error <math>E(Y-h(X))^2</math> on the class of all random variables of the form <math>h(X).</math>


In the case ''f'' = ''f''<sub>1</sub> the corresponding function ''h'' = ''h''<sub>1</sub> may be calculated explicitly,<ref group="details">
In the case <math>f=f_1</math> the corresponding function <math>h=h_1</math> may be calculated explicitly,<ref group="details">
Proof:
Proof:
<math> \begin{align}
<math> \begin{align}
Line 251: Line 255:
giving the same result.
giving the same result.


Thus, {{nowrap begin}}E ( ''Y'' | ''X'' ) = ''h''<sub>1</sub> (''X'').{{nowrap end}} The expectation of this random variable is equal to the (unconditional) expectation, {{nowrap begin}}E ( E ( ''Y'' | ''X'' ) ) = E ( ''Y'' ),{{nowrap end}} namely,
Thus, <math>E(Y|X)=h_1(X).</math> The expectation of this random variable is equal to the (unconditional) expectation, <math>E(E(Y|X))=E(Y),</math> namely,
: <math> \begin{align}
: <math> \begin{align}
& \int_0^1 h_1(f_1(y)) \, \mathrm{d}y = \int_0^{1/6} \frac{3y}3 \, \mathrm{d}y + \\
& \int_0^1 h_1(f_1(y)) \, \mathrm{d}y = \int_0^{1/6} \frac{3y}3 \, \mathrm{d}y + \\
& \quad + \int_{1/6}^{1/3} \frac{2-3y}3 \, \mathrm{d}y + \int_{1/3}^{2/3} \frac{ 2 - 1.5(1-y) }{ 3 } \, \mathrm{d}y + \int_{2/3}^1 \frac56 \, \mathrm{d}y = \frac12 \, ,
& \quad + \int_{1/6}^{1/3} \frac{2-3y}3 \, \mathrm{d}y + \int_{1/3}^{2/3} \frac{ 2 - 1.5(1-y) }{ 3 } \, \mathrm{d}y + \int_{2/3}^1 \frac56 \, \mathrm{d}y = \frac12 \, ,
\end{align} </math>
\end{align} </math>
which is an instance of the [[law of total expectation]] {{nowrap begin}}E ( E ( ''Y'' | ''X'' ) ) = E ( ''Y'' ).{{nowrap end}}
which is an instance of the [[law of total expectation]] <math>E(E(Y|X))=E(Y).</math>


In the case ''f'' = ''f''<sub>2</sub> the corresponding function ''h'' = ''h''<sub>2</sub> probably cannot be calculated explicitly. Nevertheless it exists, and can be computed numerically in the same way as ''g''<sub>2</sub> above, — as the orthogonal projection in the Hilbert space. The law of total expectation holds, since the projection cannot change the scalar product by the constant 1 belonging to the subspace.
In the case <math>f=f_2</math> the corresponding function <math>h=h_2</math> probably cannot be calculated explicitly. Nevertheless it exists, and can be computed numerically in the same way as <math>g_2</math> above, — as the orthogonal projection in the Hilbert space. The law of total expectation holds, since the projection cannot change the scalar product by the constant 1 belonging to the subspace.


Alternatively, the same function ''h'' (be it ''h''<sub>1</sub> or ''h''<sub>2</sub>) may be defined as the [[Radon-Nikodym derivative]]
Alternatively, the same function <math>h</math> (be it <math>h_1</math> or <math>h_2</math>) may be defined as the [[Radon-Nikodym derivative]]
: <math> h = \frac{ \mathrm{d}\nu }{ \mathrm{d}\mu } \, , </math>
: <math> h = \frac{ \mathrm{d}\nu }{ \mathrm{d}\mu } \, , </math>
where measures μ, ν are defined by
where measures μ, ν are defined by
Line 267: Line 271:
  \nu (B) &= \mathbb{E} ( Y, \, X \in B )
  \nu (B) &= \mathbb{E} ( Y, \, X \in B )
\end{align} </math>
\end{align} </math>
for all Borel sets <math> B \subset \mathbb R. </math> Here {{nowrap begin}}E ( ''Y''; ''A'' ){{nowrap end}} is the restricted expectation, not to be confused with the conditional expectation {{nowrap begin}}E ( ''Y'' | ''A'' ) = E (''Y''; ''A'' ) / P ( ''A'' ).{{nowrap end}}
for all Borel sets <math> B \subset \mathbb R. </math> Here <math>E(Y;A)</math> is the restricted expectation, not to be confused with the conditional expectation <math>E(Y|A)=E(Y;A)/P(A).</math>


===Conditional distribution===
===Conditional distribution===
{{main|Disintegration theorem|Regular conditional probability}}
{{main|Disintegration theorem|Regular conditional probability}}
In the case ''f'' = ''f''<sub>1</sub> the conditional [[cumulative distribution function]] may be calculated explicitly, similarly to ''g''<sub>1</sub>. The limiting procedure gives
In the case <math>f=f_1</math> the conditional [[cumulative distribution function]] may be calculated explicitly, similarly to <math>g_1.</math> The limiting procedure gives
: <math> \begin{align}
: <math> \begin{align}
& F_{Y|X=0.75} (y) = \mathbb{P} ( Y \le y | X = 0.75 ) = \\
& F_{Y|X=0.75} (y) = \mathbb{P} ( Y \le y | X = 0.75 ) = \\
Line 284: Line 288:
which cannot be correct, since a cumulative distribution function must be [[right-continuous]]!
which cannot be correct, since a cumulative distribution function must be [[right-continuous]]!


This paradoxical result is explained by measure theory as follows. For a given ''y'' the corresponding {{nowrap begin}}''F''<sub>''Y''|''X''=''x''</sub>(y) = P ( ''Y'' ≤ ''y'' | ''X'' = ''x'' ){{nowrap end}} is well-defined (via the Hilbert space or the Radon-Nikodym derivative) as an equivalence class of functions (of ''x''). Treated as a function of ''y'' for a given ''x'' it is ill-defined unless some additional input is provided. Namely, a function (of ''x'') must be chosen within every (or at least almost every) equivalence class. Wrong choice leads to wrong conditional cumulative distribution functions.
This paradoxical result is explained by measure theory as follows. For a given <math>y</math> the corresponding <math>F_Y|X=x(y)=P(Y≤y|X=x)</math> is well-defined (via the Hilbert space or the Radon-Nikodym derivative) as an equivalence class of functions (of <math>x</math>). Treated as a function of <math>y</math> for a given <math>x</math> it is ill-defined unless some additional input is provided. Namely, a function (of <math>x</math>) must be chosen within every (or at least almost every) equivalence class. Wrong choice leads to wrong conditional cumulative distribution functions.


A right choice can be made as follows. First, {{nowrap begin}}''F''<sub>''Y''|''X''=''x''</sub>(y) = P ( ''Y'' ≤ ''y'' | ''X'' = ''x'' ){{nowrap end}} is considered for rational numbers ''y'' only. (Any other dense countable set may be used equally well.) Thus, only a countable set of equivalence classes is used; all choices of functions within these classes are mutually equivalent, and the corresponding function of rational ''y'' is well-defined (for almost every ''x''). Second, the function is extended from rational numbers to real numbers by right continuity.
A right choice can be made as follows. First, <math>F_Y|X=x(y)=P(Y≤y|X=x)</math> is considered for rational numbers <math>y</math> only. (Any other dense countable set may be used equally well.) Thus, only a countable set of equivalence classes is used; all choices of functions within these classes are mutually equivalent, and the corresponding function of rational <math>y</math> is well-defined (for almost every <math>x</math>). Second, the function is extended from rational numbers to real numbers by right continuity.


In general the conditional distribution is defined for almost all ''x'' (according to the distribution of ''X''), but sometimes the result is continuous in ''x'', in which case individual values are acceptable. In the considered example this is the case; the correct result for ''x'' = 0.75,
In general the conditional distribution is defined for almost all <math>x</math> (according to the distribution of <math>X</math>), but sometimes the result is continuous in <math>x</math>, in which case individual values are acceptable. In the considered example this is the case; the correct result for <math>x=0.75,</math>
: <math> \begin{align}
: <math> \begin{align}
& F_{Y|X=0.75} (y) = \mathbb{P} ( Y \le y | X = 0.75 ) = \\
& F_{Y|X=0.75} (y) = \mathbb{P} ( Y \le y | X = 0.75 ) = \\
Line 296: Line 300:
  1 &\text{for } 1/2 \le y < \infty
  1 &\text{for } 1/2 \le y < \infty
\end{cases} \end{align} </math>
\end{cases} \end{align} </math>
shows that the conditional distribution of ''Y'' given ''X'' = 0.75 consists of two atoms, at 0.25 and 0.5, of probabilities 1/3 and 2/3 respectively.
shows that the conditional distribution of <math>Y</math> given <math>X=0.75</math> consists of two atoms, at 0.25 and 0.5, of probabilities 1/3 and 2/3 respectively.


Similarly, the conditional distribution may be calculated for all ''x'' in (0, 0.5) or (0.5, 1).
Similarly, the conditional distribution may be calculated for all <math>x</math> in <math>(0,0.5)</math> or <math>(0.5,1).</math>


The value ''x'' = 0.5 is an atom of the distribution of ''X'', thus, the corresponding conditional distribution is well-defined and may be calculated by elementary means (the denominator does not vanish); the conditional distribution of ''Y'' given ''X'' = 0.5 is uniform on (2/3, 1). Measure theory leads to the same result.
The value <math>x=0.5</math> is an atom of the distribution of <math>X</math>, thus, the corresponding conditional distribution is well-defined and may be calculated by elementary means (the denominator does not vanish); the conditional distribution of <math>Y</math> given <math>X=0.5</math> is uniform on <math>(2/3,1).</math> Measure theory leads to the same result.


The mixture of all conditional distributions is the (unconditional) distribution of ''Y''.
The mixture of all conditional distributions is the (unconditional) distribution of <math>Y</math>.


The conditional expectation {{nowrap begin}}E ( ''Y'' | ''X'' = ''x'' ){{nowrap end}} is nothing but the expectation with respect to the conditional distribution.
The conditional expectation <math>E(Y|X=x)</math> is nothing but the expectation with respect to the conditional distribution.


In the case ''f'' = ''f''<sub>2</sub> the corresponding {{nowrap begin}}''F''<sub>''Y''|''X''=''x''</sub>(y) = P ( ''Y'' ≤ ''y'' | ''X'' = ''x'' ){{nowrap end}} probably cannot be calculated explicitly. For a given ''y'' it is well-defined (via the Hilbert space or the Radon-Nikodym derivative) as an equivalence class of functions (of ''x''). The right choice of functions within these equivalence classes may be made as above; it leads to correct conditional cumulative distribution functions, thus, conditional distributions. In general, conditional distributions need not be [[discrete probability distribution|atomic]] or [[Absolutely continuous random variable|absolutely continuous]] (nor mixtures of both types). Probably, in the considered example they are [[Singular distribution|singular]] (like the [[Cantor distribution]]).
In the case <math>f=f_2</math> the corresponding <math>F_Y|X=x(y)=P(Y≤y|X=x)</math> probably cannot be calculated explicitly. For a given <math>y</math> it is well-defined (via the Hilbert space or the Radon-Nikodym derivative) as an equivalence class of functions (of <math>x</math>). The right choice of functions within these equivalence classes may be made as above; it leads to correct conditional cumulative distribution functions, thus, conditional distributions. In general, conditional distributions need not be [[discrete probability distribution|atomic]] or [[Absolutely continuous random variable|absolutely continuous]] (nor mixtures of both types). Probably, in the considered example they are [[Singular distribution|singular]] (like the [[Cantor distribution]]).


Once again, the mixture of all conditional distributions is the (unconditional) distribution, and the conditional expectation is the expectation with respect to the conditional distribution.
Once again, the mixture of all conditional distributions is the (unconditional) distribution, and the conditional expectation is the expectation with respect to the conditional distribution.

Revision as of 06:08, 29 June 2009

WP Conditioning (probability) 16:17, 18 November 2008

Beliefs depend on the available information. This idea is formalized in probability theory by conditioning. Conditional probabilities, conditional expectations and conditional distributions are treated on three levels: discrete probabilities, probability density functions, and measure theory. Conditioning leads to a non-random result if the condition is completely specified; otherwise, if the condition is left random, the result of conditioning is also random.

This article concentrates on interrelations between various kinds of conditioning, shown mostly by examples. For systematic treatment (and corresponding literature) see more specialized articles mentioned below.

Conditioning on the discrete level

Example. A fair coin is tossed 10 times; the random variable is the number of heads in these 10 tosses, and — the number of heads in the first 3 tosses. In spite of the fact that emerges before it may happen that someone knows but not .

Conditional probability

For more information, see: Conditional probability.

Given that the conditional probability of the event is More generally,

for otherwise (for ), One may also treat the conditional probability as a random variable, — a function of the random variable , namely,

The expectation of this random variable is equal to the (unconditional) probability,

namely,

which is an instance of the law of total probability

Thus, may be treated as the value of the random variable corresponding to On the other hand, is well-defined irrespective of other possible values of .

Conditional expectation

For more information, see: Conditional expectation.

Given that the conditional expectation of the random variable is More generally,

for (In this example it appears to be a linear function, but in general it is nonlinear.) One may also treat the conditional expectation as a random variable, — a function of the random variable , namely,

The expectation of this random variable is equal to the (unconditional) expectation of ,

namely,

  or simply  

which is an instance of the law of total expectation

The random variable is the best predictor of given . That is, it minimizes the mean square error on the class of all random variables of the form This class of random variables remains intact if is replaced, say, with Thus, It does not mean that Failed to parse (syntax error): {\displaystyle E(Y|2X)=0.3×2X;} rather, Failed to parse (syntax error): {\displaystyle E(Y|2X)=0.15×2X=0.3X.} In particular, More generally, for every function that is one-to-one on the set of all possible values of . The values of are irrelevant; what matters is the partition (denote it α)

of the sample space Ω into disjoint sets (Here are all possible values of .) Given an arbitrary partition α of Ω, one may define the random variable Failed to parse (syntax error): {\displaystyle E(Y|α).} Still, Failed to parse (syntax error): {\displaystyle E(E(Y|α))=E(Y).}

Conditional probability may be treated as a special case of conditional expectation. Namely, if is the indicator of . Therefore the conditional probability also depends on the partition Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle α_X} generated by rather than on itself; Failed to parse (syntax error): {\displaystyle P(A|g(X))=P(A|X)=P(A|α),} Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle α=α_X=α_g(X).}

On the other hand, conditioning on an event is well-defined, provided that Failed to parse (syntax error): {\displaystyle P(B)≠0,} irrespective of any partition that may contain as one of several parts.

Conditional distribution

For more information, see: Conditional probability distribution.

Given Failed to parse (syntax error): {\displaystyle X=x,@theconditionaldistributionofYis :<math>\mathbb{P}(Y=y|X=x)=\frac{\binom3y\binom7{x-y}}{\binom{10}x}=\frac{\binomxy\binom{10-x}{3-y}}{\binom{10}3}} forFailed to parse (syntax error): {\displaystyle 0≤y≤min(3,x).} Itisthehypergeometricdistributionorequivalently,Thecorrespondingexpectation@0.3x,</math> obtained from the general formula for is nothing but the conditional expectation

Treating as a random distribution (a random vector in the four-dimensional space of all measures on one may take its expectation, getting the unconditional distribution of , — the binomial distribution This fact amounts to the the equality

for just the law of total probability.

Conditioning on the level of densities

For more information, see: Probability density function and Conditional probability distribution.

Example. A point of the sphere is chosen at random according to the uniform distribution on the sphere [1] [2]. The random variables , , are the coordinates of the random point. The joint density of , , does not exist (since the sphere is of zero volume), but the joint density of , exists,

(The density is non-constant because of a non-constant angle between the sphere and the plane[3].) The density of may be calculated by integration,

surprisingly, the result does not depend on in (-1,1),

which means that is distributed uniformly on The same holds for and (and in fact, for whenever

Conditional probability

Calculation

Given that the conditional probability of the event Failed to parse (syntax error): {\displaystyle Y≤0.75} is the integral of the conditional density,

More generally,

for all and such that (otherwise the denominator vanishes) and (otherwise the conditional probability degenerates to 0 or 1). One may also treat the conditional probability as a random variable, — a function of the random variable , namely,

The expectation of this random variable is equal to the (unconditional) probability,

which is an instance of the law of total probability

Interpretation

The conditional probability Failed to parse (syntax error): {\displaystyle P(Y≤0.75|X=0.5)} cannot be interpreted as Failed to parse (syntax error): {\displaystyle P(Y≤0.75,X=0.5)/P(X=0.5),} since the latter gives 0/0. Accordingly, Failed to parse (syntax error): {\displaystyle P(Y≤0.75|X=0.5)} cannot be interpreted via empirical frequencies, since the exact value has no chance to appear at random, not even once during an infinite sequence of independent trials.

The conditional probability can be interpreted as a limit,

Conditional expectation

The conditional expectation is of little interest; it vanishes just by symmetry. It is more interesting to calculate treating || as a function of , :

More generally,

for One may also treat the conditional expectation as a random variable, — a function of the random variable X, namely,

The expectation of this random variable is equal to the (unconditional) expectation of

namely,

which is an instance of the law of total expectation

The random variable is the best predictor of given . That is, it minimizes the mean square error on the class of all random variables of the form Similarly to the discrete case, for every measurable function that is one-to-one on

Conditional distribution

Given the conditional distribution of , given by the density is the (rescaled) arcsin distribution; its cumulative distribution function is

for all and such that The corresponding expectation of is nothing but the conditional expectation The mixture of these conditional distributions, taken for all (according to the distribution of ) is the unconditional distribution of . This fact amounts to the equalities

the latter being the instance of the law of total probability mentioned above.

What conditioning is not

On the discrete level conditioning is possible only if the condition is of nonzero probability (one cannot divide by zero). On the level of densities, conditioning on is possible even though This success may create the illusion that conditioning is always possible. Regretfully, it is not, for several reasons presented below.

Geometric intuition: caution

The result Failed to parse (syntax error): {\displaystyle P(Y≤0.75|X=0.5)=5/6,} mentioned above, is geometrically evident in the following sense. The points of the sphere satisfying the condition are a circle of radius on the plane The inequality Failed to parse (syntax error): {\displaystyle y≤0.75} holds on an arc. The length of the arc is 5/6 of the length of the circle, which is why the conditional probability is equal to 5/6.

This successful geometric explanation may create the illusion that the following question is trivial.

A point of a given sphere is chosen at random (uniformly). Given that the point lies on a given plane, what is its conditional distribution?

It may seem evident that the conditional distribution must be uniform on the given circle (the intersection of the given sphere and the given plane). Sometimes it really is, but in general it is not. Especially, is distributed uniformly on and independent of the ratio thus, Failed to parse (syntax error): {\displaystyle P(Z≤0.5|Y/X)=0.75.} On the other hand, the inequality Failed to parse (syntax error): {\displaystyle z≤0.5} holds on an arc of the circle (for any given ). The length of the arc is 2/3 of the length of the circle. However, the conditional probability is 3/4, not 2/3. This is a manifestation of the classical Borel paradox[4] [5].

Appeals to symmetry can be misleading if not formalized as invariance arguments.

—Pollard[6]

Another example. A random rotation of the three-dimensional space is a rotation by a random angle around a random axis. Geometric intuition suggests that the angle is independent of the axis and distributed uniformly. However, the latter is wrong; small values of the angle are less probable.

The limiting procedure

Given an event of zero probability, the formula is useless, however, one can try for an appropriate sequence of events of nonzero probability such that Failed to parse (syntax error): {\displaystyle B_n↓B} (that is, and ). One example is given above. Two more examples are Brownian bridge and Brownian excursion.

In the latter two examples the law of total probability is irrelevant, since only a single event (the condition) is given. In contrast, in the example above the law of total probability applies, since the event is included into a family of events where runs over and these events are a partition of the probability space.

In order to avoid paradoxes (such as the Borel's paradox), the following important distinction should be taken into account. If a given event is of nonzero probability then conditioning on it is well-defined (irrespective of any other events), as was noted above. In contrast, if the given event is of zero probability then conditioning on it is ill-defined unless some additional input is provided. Wrong choice of this additional input leads to wrong conditional probabilities (expectations, distributions). In this sense, "the concept of a conditional probability with regard to an isolated hypothesis whose probability equals 0 is inadmissible." (Kolmogorov; quoted in [6]).

The additional input may be (a) a symmetry (invariance group); (b) a sequence of events such that Failed to parse (syntax error): {\displaystyle B_n↓B,} (c) a partition containing the given event. Measure-theoretic conditioning (below) investigates Case (c), discloses its relation to (b) in general and to (a) when applicable.

Some events of zero probability are beyond the reach of conditioning. An example: let be independent random variables distributed uniformly on and the event Failed to parse (syntax error): {\displaystyle "X_n→0} as Failed to parse (unknown function "\textstylen"): {\displaystyle <math>\textstylen\to\infty} ";</math> what about Does it tend to 1, or not? Another example: let be a random variable distributed uniformly on and the event " is a rational number"; what about The only answer is that, once again,

the concept of a conditional probability with regard to an isolated hypothesis whose probability equals 0 is inadmissible.

—Kolmogorov, quoted in [6]

Conditioning on the level of measure theory

For more information, see: Conditional expectation.

Example. Let be a random variable distributed uniformly on and where is a given function. Two cases are treated below: and where is the continuous piecewise-linear function

and is the Weierstrass function.

Geometric intuition: caution

Given two values of are possible, 0.25 and 0.5. It may seem evident that both values are of conditional probability 0.5 just because one point is congruent to another point. However, this is an illusion; see below.

Conditional probability

The conditional probability Failed to parse (syntax error): {\displaystyle P(Y≤1/3|X)} may be defined as the best predictor of the indicator

given X. That is, it minimizes the mean square error on the class of all random variables of the form

In the case the corresponding function may be calculated explicitly,[details 1]

Alternatively, the limiting procedure may be used,

giving the same result.

Thus, Failed to parse (syntax error): {\displaystyle P(Y≤1/3|X)=g_1(X).} The expectation of this random variable is equal to the (unconditional) probability, Failed to parse (syntax error): {\displaystyle E(P(Y≤1/3|X))=P(Y≤1/3),} namely,

which is an instance of the law of total probability

In the case the corresponding function probably cannot be calculated explicitly. Nevertheless it exists, and can be computed numerically. Indeed, the space L2 (Ω) of all square integrable random variables is a Hilbert space; the indicator is a vector of this space; and random variables of the form are a (closed, linear) subspace. The orthogonal projection of this vector to this subspace is well-defined. It can be computed numerically, using finite-dimensional approximations to the infinite-dimensional Hilbert space.

Once again, the expectation of the random variable Failed to parse (syntax error): {\displaystyle P(Y≤1/3|X)=g_2(X)} is equal to the (unconditional) probability, Failed to parse (syntax error): {\displaystyle E(P(Y≤1/3|X))=P(Y≤1/3),} namely,

However, the Hilbert space approach treats as an equivalence class of functions rather than an individual function. Measurability of is ensured, but continuity (or even Riemann integrability) is not. The value is determined uniquely, since the point 0.5 is an atom of the distribution of . Other values are not atoms, thus, corresponding values are not determined uniquely. Once again, "the concept of a conditional probability with regard to an isolated hypothesis whose probability equals 0 is inadmissible." (Kolmogorov; quoted in [6]).

Alternatively, the same function (be it or ) may be defined as the Radon-Nikodym derivative

where measures μ, ν are defined by

for all Borel sets That is, μ is the (unconditional) distribution of , while ν is one third of its conditional distribution,

Both approaches (via the Hilbert space, and via the Radon-Nikodym derivative) treat as an equivalence class of functions; two functions and g′ are treated as equivalent, if Failed to parse (syntax error): {\displaystyle g(X)=g′(X)} almost surely. Accordingly, the conditional probability Failed to parse (syntax error): {\displaystyle P(Y≤1/3|X)} is treated as an equivalence class of random variables; as usual, two random variables are treated as equivalent if they are equal almost surely.

Conditional expectation

The conditional expectation may be defined as the best predictor of given . That is, it minimizes the mean square error on the class of all random variables of the form

In the case the corresponding function may be calculated explicitly,[details 2]

Alternatively, the limiting procedure may be used,

giving the same result.

Thus, The expectation of this random variable is equal to the (unconditional) expectation, namely,

which is an instance of the law of total expectation

In the case the corresponding function probably cannot be calculated explicitly. Nevertheless it exists, and can be computed numerically in the same way as above, — as the orthogonal projection in the Hilbert space. The law of total expectation holds, since the projection cannot change the scalar product by the constant 1 belonging to the subspace.

Alternatively, the same function (be it or ) may be defined as the Radon-Nikodym derivative

where measures μ, ν are defined by

for all Borel sets Here is the restricted expectation, not to be confused with the conditional expectation

Conditional distribution

For more information, see: Disintegration theorem and Regular conditional probability.

In the case the conditional cumulative distribution function may be calculated explicitly, similarly to The limiting procedure gives

which cannot be correct, since a cumulative distribution function must be right-continuous!

This paradoxical result is explained by measure theory as follows. For a given the corresponding Failed to parse (syntax error): {\displaystyle F_Y|X=x(y)=P(Y≤y|X=x)} is well-defined (via the Hilbert space or the Radon-Nikodym derivative) as an equivalence class of functions (of ). Treated as a function of for a given it is ill-defined unless some additional input is provided. Namely, a function (of ) must be chosen within every (or at least almost every) equivalence class. Wrong choice leads to wrong conditional cumulative distribution functions.

A right choice can be made as follows. First, Failed to parse (syntax error): {\displaystyle F_Y|X=x(y)=P(Y≤y|X=x)} is considered for rational numbers only. (Any other dense countable set may be used equally well.) Thus, only a countable set of equivalence classes is used; all choices of functions within these classes are mutually equivalent, and the corresponding function of rational is well-defined (for almost every ). Second, the function is extended from rational numbers to real numbers by right continuity.

In general the conditional distribution is defined for almost all (according to the distribution of ), but sometimes the result is continuous in , in which case individual values are acceptable. In the considered example this is the case; the correct result for

shows that the conditional distribution of given consists of two atoms, at 0.25 and 0.5, of probabilities 1/3 and 2/3 respectively.

Similarly, the conditional distribution may be calculated for all in or

The value is an atom of the distribution of , thus, the corresponding conditional distribution is well-defined and may be calculated by elementary means (the denominator does not vanish); the conditional distribution of given is uniform on Measure theory leads to the same result.

The mixture of all conditional distributions is the (unconditional) distribution of .

The conditional expectation is nothing but the expectation with respect to the conditional distribution.

In the case the corresponding Failed to parse (syntax error): {\displaystyle F_Y|X=x(y)=P(Y≤y|X=x)} probably cannot be calculated explicitly. For a given it is well-defined (via the Hilbert space or the Radon-Nikodym derivative) as an equivalence class of functions (of ). The right choice of functions within these equivalence classes may be made as above; it leads to correct conditional cumulative distribution functions, thus, conditional distributions. In general, conditional distributions need not be atomic or absolutely continuous (nor mixtures of both types). Probably, in the considered example they are singular (like the Cantor distribution).

Once again, the mixture of all conditional distributions is the (unconditional) distribution, and the conditional expectation is the expectation with respect to the conditional distribution.

Technical details

  1. Proof:
    it remains to note that Failed to parse (syntax error): {\displaystyle (1&minus;a)^2+2a^2} is minimal at
  2. Proof: it remains to note that is minimal at and is minimal at

See also

Notes

  1. n-sphere#Generating points on the surface of the n-ball
  2. wikibooks: Uniform Spherical Distribution
  3. Area#General formula
  4. Pollard 2002, Sect. 5.5, Example 17 on page 122
  5. Durrett 1996, Sect. 4.1(a), Example 1.6 on page 224
  6. 6.0 6.1 6.2 6.3 Pollard 2002, Sect. 5.5, page 122 Cite error: Invalid <ref> tag; name "Pollard-5.5-122" defined multiple times with different content Cite error: Invalid <ref> tag; name "Pollard-5.5-122" defined multiple times with different content

References

  • Durrett, Richard (1996), Probability: theory and examples (Second ed.)
  • Pollard, David (2002), A user's guide to measure theoretic probability, Cambridge University Press

Category:Probability theory