1. Problem statement
Edmond is an intellectually outstanding student. One day, he is tasked with measuring heights of all the students in his school. The task requires filling in 3 fields: name, gender, and height. Unfortunately, he forgets the gender column and worries that the teacher might notice and think poorly of him. Being a smart and trustworthy student, Edmond cannot let that happen. Determined to fix the mistake, he searches for an efficient and accurate way to fill in the missing information. That is how he comes across the concept of “Expectation-Maximisation”.
2. Expectation-Maximisation (EM)
In general, the Expectation-Maximisation (EM) algorithm is used to estimate hidden variables or distributions from observed data when some information is incomplete or missing. It iteratively alternates between assigning probabilities to the missing data (Expectation step) and optimizing parameters based on these assignments (Maximisation step).
In Edmond’s case, the observed data consists of the names and heights of the students, while the missing data is the gender. By applying EM, Edmond can infer the gender for each recorded height based on statistical patterns, such as the distribution of heights typically associated with different genders in the school population. This allows him to fill in the missing fields accurately, even though the gender information was initially incomplete.
3. How it works
Given the statistical model which generates a set \(\mathbf{X}\) of observed data, a set of unobserved latent data or missing values \(\mathbf{Z}\), and a vector of unknown parameters \(\theta\), along with a likelihood function \(L(\theta; \mathbf{X}, \mathbf{Z}) = p(\mathbf{X}, \mathbf{Z} \mid \theta)\) the maximum likelihood estimate (MLE) of the unknown parameters is determined by maximising the marginal likelihood of the observed data:
\[L(\theta; \mathbf{X}) = p(\mathbf{X} \mid \theta) = \int p(\mathbf{X}, \mathbf{Z} \mid \theta) p(\mathbf{Z} \mid \theta) \, d\mathbf{Z}.\]
However, this quantity is often intractable since \(\mathbf{Z}\) is unobserved and the distribution of \(\mathbf{Z}\) is unknown before attaining \(\theta\).
The EM Algorithm
The EM algorithm seeks to find the maximum likelihood estimate of the marginal likelihood by iteratively applying these two steps:
Expectation step (E step): Define \(Q(\theta \mid \theta^{(t)})\) as the expected value of the log likelihood function of \(\theta\), with respect to the current conditional distribution of \(\mathbf{Z}\) given \(\mathbf{X}\) and the current estimates of the parameters \(\theta^{(t)}\):
\[Q(\theta \mid \theta^{(t)}) = \mathbb{E}_{Z \sim p(\cdot \mid \mathbf{X}, \theta^{(t)})} \left[ \log p(\mathbf{X}, \mathbf{Z} \mid \theta) \right].\]
Maximization step (M step): Find the parameters that maximize this quantity:
\[\theta^{(t+1)} = \underset{\theta}{\operatorname{arg\,max}} \, Q(\theta \mid \theta^{(t)}).\]
More succinctly, we can write it as one equation:
\[\theta^{(t+1)} = \underset{\theta}{\operatorname{arg\,max}} \, \mathbb{E}_{Z \sim p(\cdot \mid \mathbf{X}, \theta^{(t)})} \left[ \log p(\mathbf{X}, \mathbf{Z} \mid \theta) \right].\]
4. An Hands-on example
One famous algorithm is usually associated with EM is Gaussian Mixture. Basically, Gaussian Mixture is an unsupervised algorithm that is mostly used to cluster data. In this section, in order to give you a better understanding of how EM and Gaussian mixture work, I will use a simple binomial mixture example.
Imagine that you have two coins with unknown probabilities of heads, denoted p and q respectively. The first coin is chosen with probability \(\pi\) and the second is chosen with probability \(1 - \pi\). The chosen coin is flipped once and the result is recorded. \(x = \{1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1\}\) (Heads = 1, Tails = 0). Let \(Z_i \in \{0, 1\}\) denotes which coin was used on each toss.
Applying EM to the example, we start by using binary cross entropy (BCE) to determine the similarity between 2 distributions:
\[Q(\theta \mid \theta^{(t)}) = \mathbb{E} \left[ \sum_{i}^{n}z_i \log(\pi p^{x_i}(1-p)^{1-x_i}) + (1-z_i) \log((1-\pi) q^{x_i} (1-q)^{1-x_i}) \right]\]
\[= \sum_{i=1}^n \mathbb{E}[z_i \mid x_i, \theta^{(t)}] \left[ \log \pi + x_i \log p + (1 - x_i) \log (1 - p) \right]\]
\[+ \left( 1 - \mathbb{E}[z_i \mid x_i, \theta^{(t)}] \right) \left[ \log (1 - \pi) + x_i \log q + (1 - x_i) \log (1 - q) \right]\]
Next, we compute \(\mathbb{E}[z_i \mid x_i, \theta^{(t)}]\):
\[\mu_i^{(t)} = \mathbb{E}[z_i \mid x_i, \theta^{(t)}] = p(z_i = 1 \mid x_i, \theta^{(t)})\]
\[= \frac{p(x_i \mid z_i, \theta^{(t)}) p(z_i = 1 \mid \theta^{(t)})}{p(x_i \mid \theta^{(t)})}\]
\[= \frac{\pi^{(t)} [p^{(t)}]^{x_i} [(1 - p^{(t)})]^{1-x_i}}{\pi^{(t)} [p^{(t)}]^{x_i} [(1 - p^{(t)})]^{1-x_i} + (1 - \pi^{(t)}) [q^{(t)}]^{x_i} [(1 - q^{(t)})]^{1-x_i}}\]
Maximising \(Q(\theta \mid \theta^{(t)})\) with respect to \(\theta\) yields the update equations:
\[\frac{\partial Q(\theta \mid \theta^{(t)})}{\partial \pi} = 0 \implies \pi^{(t+1)} = \frac{1}{n} \sum_i \mu_i^{(t)}\]
\[\frac{\partial Q(\theta \mid \theta^{(t)})}{\partial p} = 0 \implies p^{(t+1)} = \frac{\sum_i \mu_i^{(t)} x_i}{\sum_i \mu_i^{(t)}}\]
\[\frac{\partial Q(\theta \mid \theta^{(t)})}{\partial q} = 0 \implies q^{(t+1)} = \frac{\sum_i (1 - \mu_i^{(t)}) x_i}{\sum_i (1 - \mu_i^{(t)})}.\]
5. Conclusion
EM is an algorithm that helps us determine the hidden distributions of the data in the way we expect. For example, you assume that your coin is tossed by two different coins or two different persons in the aforementioned example and find distributions to determine which one is more likely and which one is less likely for each observation. Similarly, in the case of Edmond, EM can help him distinguish between the heights of male and female students.
In the blog, I have recapped the motivation behind EM and explained how it works, and I hope it helps. Thanks for reading and see you in the next blog!.