import jax
from jax import random
from jax import numpy as jnp
from jax import scipy as sp
import matplotlib.pyplot as plt
import ipywidgets as widgets
Posterior Distribution for the Bernoulli distribution
Some new terms:
- Prior Probablity: A prior probability is the probability that an observation will fall into a group before you collect the data.
- Posterior Probability: A posterior probability is the probability of assigning observations to groups given the data.
Recall: Bernoulli Distribution is a discrete distribution for a random variable that can take two states.
For example: Weather can be good or bad. We can encode weather as \(W\) ∈ {bad,good} where let suppose bad is encoded as 0 and good as 1.
Now, we have a distribution lets say bernoulli distribution which has been modelled by some parameter \(\theta\) i.e. \(W\) ~ Bern(\(\theta\)). But we only have the distribution. We do not know the parameter \(\theta\). So our task is to find an estimate of \(\theta\).
Thus we want to find theta given the data. P(\(\theta|D\)). To find \(\theta\), we will assume it to be a random variable and particularly beta distribution wlog. This beta distribution has two hyperparameters which is \(\alpha\) and \(\beta\).
Thus \(\theta\) ~ Beta(\(\alpha, \beta\)) and W ~ Bern(\(\theta\))
Now the joint pdf: \[\begin{equation} p(\theta,D) = p(\theta) p(D|\theta) \end{equation}\] Here \(p(\theta)\) is the prior probability since it is the probability of a parameter/group. \(p(D|\theta)\) is the likelihood.
\[\begin{equation} p(\theta,D) = \frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}\prod_{i=0}^{n-1}(\theta^{wi})(1-\theta)^{1-wi}}{B(\alpha,\beta)} \end{equation}\]
By Baye’s Theorem:
\[\begin{align} P(\theta \mid D) = \frac{P(D \mid \theta) \, P(\theta)}{P(D)} = \frac{P(\theta,D)}{P(D)} \end{align}\]
\[\begin{equation} P(\theta|D) \sim P(\theta,D) \end{equation}\]
Thus the conditional is proportional to the joint pdf. P(D) is difficult to calculate as we will see it in the end.
Now,
\[\begin{equation} P(\theta|D) = \frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}\prod_{i=0}^{n-1}(\theta^{wi})(1-\theta)^{1-wi}}{B(\alpha,\beta)} \end{equation}\]
\[\begin{equation} P(\theta|D) \sim \theta^{\alpha-1}.\prod_{i=0}^{n-1}(\theta^{wi}).(1-\theta)^{\beta-1}.\prod_{i=0}^{n-1}(1-\theta)^{1-wi} \end{equation}\]
\[\begin{equation} P(\theta|D) \sim \theta^{\alpha-1}.(\theta^{\sum_{i=0}^{n-1}{wi}}).(1-\theta)^{\beta-1}.(1-\theta)^{\sum_{i=0}^{n-1}{1-wi}} \end{equation}\]
\[\begin{equation} P(\theta|D) \sim \theta^{\alpha+\sum_{i=0}^{n-1}{wi}-1}.(1-\theta)^{\beta+N-\sum_{i=0}^{n-1}{(1-wi)}-1} \end{equation}\]
\[\begin{equation} \alpha' = \alpha+\sum_{i=0}^{n-1}{wi} \ \ and \ \ \beta' = \beta+N-\sum_{i=0}^{n-1}{(1-wi)} \end{equation}\]
\[\begin{equation} P(θ|D) ∝ θ^{α'-1}(1-θ)^{β'-1} \end{equation}\]
\[\begin{equation} P(θ|D) \sim beta(α',β') \end{equation}\]
Thus, the posterior of the theta paramter is a beta distribution. This is called a conjugate prior since:
prior: beta
Model: Bernoulli
Posterior: beta again
= random.PRNGKey(0)
key = jax.random.bernoulli(key, 0.5, (1, 100))
weather weather
Array([[ True, False, False, False, True, False, False, True, False,
True, True, True, False, False, True, False, True, True,
True, False, True, False, False, False, False, False, True,
True, False, True, False, True, True, False, True, False,
False, True, False, True, False, True, False, False, False,
False, True, True, True, False, False, False, False, False,
True, True, False, False, False, True, False, False, True,
False, False, False, False, True, True, False, True, False,
False, False, False, True, False, False, True, False, False,
False, True, True, True, True, True, False, True, False,
True, True, True, True, False, True, True, False, False,
True]], dtype=bool)
# Interactive plot function
def plot_beta_pdf(alpha, beta):
= jnp.linspace(0, 1, 100)
mu
= sp.stats.beta.pdf(mu, alpha, beta)
pdf
plt.plot(mu, pdf)
r"x")
plt.xlabel(
"PDF")
plt.ylabel(
"Beta Distribution")
plt.title(
0, 1)
plt.xlim(
0, 13)
plt.ylim(
"Bell curve.png", dpi=400)
plt.savefig(
= widgets.FloatSlider(value=1, min=0.3, max=100, description="alpha")
alpha
= widgets.FloatSlider(value=1, min=0.3, max=100, description="beta")
beta
= widgets.VBox([alpha, beta])
ui
= widgets.interactive_output(
output 'alpha': alpha, 'beta': beta})
plot_beta_pdf, {
display(ui, output)
= 2.0
alpha = 1.5
beta # Calculating the posterior parameters
= alpha + weather.sum()
alpha_prime = beta + 100 - weather.sum() beta_prime
alpha_prime
Array(47., dtype=float32, weak_type=True)
beta_prime
Array(56.5, dtype=float32, weak_type=True)
Maximum A Posteriori Estimate (MAP) for Bernoulli Distribution
Motivation:
In the last presentation, we saw about MLE (Maximum Likelihood Estimation) but if the number of samples are small, MLE can overfit. For example: coin toss, if sample size is small, probability of heads may not be close to 0.5 but with Bayesian statistics as we saw just before, the probability that we had defined at the beginning is approximately the same which we saw just now.
This is because MLE: \(\theta\)* = argmax \(l\) (\(D;\theta\)) where \(\theta \in [0,1]\) and \(l\) is the log likelihood function.
We are simple maximising the parameter and not encoding any prior knowledge. It may be possible that we have some prior knowledge and we wish to take that in account while estimating the parameter.
Now we know that,
\[\begin{equation} P(\theta|D) \sim \theta^{α-1}(1-\theta)^{β-1}\prod_{i=0}^{n-1}(\theta^{wi})(1-\theta)^{1-wi} =M(D|\theta;α,β) \end{equation}\]
Thus MAP: \(\theta^{*}_{MAP}\) = argmax \(M\)(\(D|\theta;\alpha,\beta\)) where \(\theta \in [0,1]\)
Since it is easier to find in log space we will take the logarithm of it.
\[\begin{aligned} m(D|\theta;\alpha,\beta) &= \log(M(D|\theta;\alpha,\beta)) \end{aligned}\]\[\begin{align} m(D|\theta;\alpha,\beta) &= log(\theta^{\alpha-1}(1-\theta)^{\beta-1}\prod_{i=0}^{n-1}(\theta^{wi})(1-\theta)^{1-wi}) \\ &=\sum_{i=0}^{n-1}{(log(\theta^{wi})+log(1-\theta)^{1-wi})} +log(\theta^{\alpha-1}) + log((1-\theta)^{\beta-1}) \\ &=\sum_{i=0}^{n-1}(wi.log(\theta) + (1-wi).log(1-\theta)) + (\alpha-1).log(\theta) + (\beta-1).log(1-\theta) \end{align}\]
\[\begin{equation} \frac{∂m}{∂\theta} =\sum_{i=0}^{n-1}(\frac{wi}{\theta}-\frac{1-wi}{1-\theta}) + \frac{\alpha-1}{\theta} -\frac{\beta-1}{1-\theta} = 0 \end{equation}\]
\[\begin{equation} =\sum_{i=0}^{n-1}(\frac{wi.(1-\theta)-\theta.(1-wi)}{\theta.(1-\theta)}) + \frac{(\alpha-1).(1-\theta) - \theta.(\beta-1)}{\theta.(1-\theta)} = \sum_{i=0}^{n-1}(wi - wi.\theta - \theta + \theta.wi) + \alpha - \alpha.\theta -1 + 2\theta -\theta.\beta = 0 \end{equation}\]
\[\begin{equation} =\sum_{i=0}^{n-1}(wi) + \alpha -1 - \theta(N+\alpha+\beta-2) \end{equation}\]
\[\begin{equation} \theta^{*}_{MAP} = \frac{\sum_{i=0}^{n-1}(wi) + \alpha -1}{N+\alpha+\beta-2} \end{equation}\]
= 0.8
theta_weather_true = random.PRNGKey(0)
key = jax.random.bernoulli(key, 0.8, (1, 100))
weather weather
Array([[ True, False, False, True, True, False, True, True, True,
True, True, True, False, True, True, False, True, True,
True, True, True, False, True, True, True, True, True,
True, True, True, False, True, True, True, True, True,
True, True, False, True, True, True, True, False, False,
False, True, True, True, False, True, True, False, True,
True, True, True, False, True, True, True, False, True,
False, True, False, True, True, True, False, True, True,
False, True, True, True, True, False, True, True, False,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, False, False,
True]], dtype=bool)
= weather.mean()
theta_weather_mle theta_weather_mle
Array(0.77, dtype=float32)
= 50.0
alpha = 10.0
beta = (weather.sum()+alpha-1)/(100+alpha+beta-2)
theta_weather_map theta_weather_map
Array(0.79746836, dtype=float32, weak_type=True)
It is not always true that MAP is better than MLE. But in case of corrupt dataset i.e. where there is some noise, MAP will give better result than MLE.
= random.PRNGKey(0)
key = jax.random.bernoulli(key, 0.1, (1, 100))
weather_corrupt weather_corrupt
Array([[ True, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
True, False, False, False, False, False, False, False, False,
False, False, True, False, True, False, False, False, False,
False, True, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, True, False, False,
False, False, False, False, False, False, False, False, False,
True, True, False, False, False, False, False, False, False,
False]], dtype=bool)
= weather_corrupt.mean()
weather_corrupt_mean weather_corrupt_mean
Array(0.08, dtype=float32)
= 50.0
alpha = 10.0
beta = (weather_corrupt.sum()+alpha-1)/(100+alpha+beta-2)
theta_weather_map_corrupt theta_weather_map_corrupt
Array(0.3607595, dtype=float32, weak_type=True)
Predictive Posterior for the Bernoulli
Motivation: Till now we are able to assess the probability of a parameter given the data but what if we want to measure the probability of a good day given the earlier two days had good weather?
We introduce a new random variable named \(T\) for tomorrow. Assume \(T\sim Bern(θ)\). Now,
\[\begin{equation} P(\theta,D,T) = P(\theta)P(D|\theta)P(T|\theta) P(\theta,T|D)P(D) = P(\theta)P(D|\theta)P(T|\theta) \end{equation}\]
\[\begin{equation} P(\theta,T|D) = \frac{P(\theta).P(D|\theta)}{P(D)}.P(T|\theta) \end{equation}\]
Simplying by Baye’s theorem: \(P(\theta,T|D) = P(\theta|D).P(T|\theta)\) Now \(\theta\) is a continuous random variable since it takes value in the range \([0,1]\) Hence to find \(P(T|D)\), we need to eliminate . To do this, we integrate over \(\theta\).
\[\begin{equation} P(T|D) = \int_\theta \mathrm{d}\theta \, P(\theta,T|D) = \int_\theta \mathrm{d}\theta \, P(\theta|D).P(T|\theta) \end{equation}\]
\(P(\theta|D)\) is the posterior which is beta distribution with \(α'\) and \(β'\). \(P(T|\theta)\) is the bernoulli distribution as assumed initially. \(\theta\) goes from 0 to 1 in the integration.
\[\begin{equation} P(T|D) = \int_0^{1} \mathrm{d}\theta \, Beta(α',β').Bern(\theta) \end{equation}\]
This integration is hard to solve.
Observe!
\[\begin{equation} P(T|D) =\int_\theta \mathrm{d}\theta \, P(\theta|D).P(T|\theta) = E_{\theta\sim P(\theta|D)}P(T|\theta) \end{equation}\]
Thus this is the expectation over theta distributed according to \(P(theta)\) given \(D\). This can be evaluated using sampling of the data when the integral is intractable.(not solvable)
But in the case of this integral, a solution exists which is:
\[\begin{equation} \frac{(\alpha')^{T}.(\beta')^{1-T}}{\alpha'+\beta'} = P(T|D) \end{equation}\]
Now if \(T = 1\), \[\begin{equation} P(T|D) = \frac{\alpha'}{\alpha'+\beta'} \end{equation}\] and similarly if \(T = 0\)
\[\begin{equation} P(T|D) = \frac{\beta'}{\alpha'+\beta'} \end{equation}\]
\[\begin{equation} \alpha' = \alpha + \sum_{i=0}^{n-1}wi \end{equation}\] Here the summation term is the actual number of times good weather came up. Alpha is the pseudo times good weather came as it incorporates the idea of knowing about previous state of the weather or how likely it is to happen.
Marginal for the Bernoulli
Remember sometime before we saw that \(P(D)\) is difficult to calculate.
\(P(θ,D) = P(θ).P(D|θ)\) we will solve \(P(D)\) by integrating the joint pdf wrt \(θ\) from \(0\) to \(1\). Here Marginal means wrt a single random variable. We marginalize the pdf.
\[\begin{equation} P(D) = \int_\theta \mathrm{d}\theta \, P(\theta,D) \end{equation}\]
\[\begin{equation} P(\theta|D) = \int_0^{1} \mathrm{d}\theta \frac{\theta^{α+\sum_{i=0}^{n-1}{wi}-1}.(1-\theta)^{β+N-\sum_{i=0}^{n-1}{(1-wi)}-1}}{B(α,β)} \end{equation}\] It is hard to find the anti-derivative of this. One way is to sample it as it is the expectation of \(P(D|\theta)\) with \(p(\theta)\) instead of \(\theta\).
The solution is: \[\begin{equation} \frac{B(α',β')}{B(α,β)} \end{equation}\] where \(α\)’ and \(β\)’ are the paramters earlier.
Now this is not a distribution and just a probability since \(P(D)\) is the probability of Data(\(D\)) as \(P(D) = P(W=D)\) i.e. all the weather observations is the dataset.