Machine Learning(2)Estimate the probability density -- Mixture of Gaussians

Machine Learning(2)Mixture of Gaussians


Chenjing Ding
2018/02/21


notation meaning
M the number of mixture components
p(j) weight of mixture component
p ( x | θ j ) mixture component
p ( x | θ ) mixture density
θ j j-th component parameters

1. Mixture of Multivariate Gaussians

In some cases, one Gaussian distribution cannot represent p ( x | θ ) , (see red model in figure 1 ), thus in this chapter we want to estimate the mixture density of multivariate Gaussians.web

1.1 Obtain mixture of density

Weight of mixture component:ide

p ( j ) = π j

Mixture component:
p ( x | θ j )

Mixture density
p ( x | θ ) = j = 1 M p ( x | θ j ) p ( j )


这里写图片描述
figure1 mixture of density

2. Maximum Likelihood

using maximum likelihood to estimate u j :
svg

E n ( θ ) = ln p ( x n | θ ) E ( θ ) = n = 1 N E n ( θ ) = n = 1 N ln p ( x n | θ ) E ( θ ) u j = n = 1 N p ( x n | θ ) u j p ( x n | θ ) = n = 1 N p ( j ) p ( x n | θ j ) u j k = 1 M p ( x n | θ k ) p ( k ) = n = 1 N p ( j ) Σ 1 ( x n u j ) p ( x n | θ j ) k = 1 M p ( x n | θ k ) p ( k ) = Σ 1 n = 1 N ( x n u j ) p ( j ) p ( x n | θ j ) k = 1 M p ( x n | θ k ) p ( k ) ; γ j ( x n ) = p ( j ) p ( x n | θ j ) k = 1 M p ( x n | θ k ) p ( k ) ; u j = n = 1 N x n γ j ( x n ) n = 1 N γ j ( x n )

Problem with estimation u j
u j depends on γ j ( x n ) , γ j ( x n ) also depends on u j , so there is no analytical solution.
γ J ( x n ) = p ( J ) p ( x n | θ J ) k = 1 M p ( x n | θ k ) p ( k ) = p ( x n | j = J , θ ) p ( J ) p ( x n | θ ) = p ( x n , j = J | θ ) p ( x n | θ ) = p ( j = J | x n , θ )
thus γ j ( x n ) represents “responsibility of component j for mixture density given x n , if we can estimate γ j ( x n ) , then we can obtain u j ; and K-Means cluster is helpful.

3. K-Means cluster

K-Means cluster aims to assign data to one of the K clusters according to the distance to the mean of each cluster.this

3.1 steps

step1: Initialization: pick K arbitrary centroids (cluster means)atom

step2: Assign each sample to the closest centroid.lua

step3: Adjust the centroids to be the means of the samples assigned to them.spa

step4: Go to step 2 until no change in step3;3d


这里写图片描述
figure2 the process of K-Means cluster (K = 2)

3.2 Objective function

K-Means optimizes the following objective function:
rest

L = n = 1 N k = 1 K r n k | | x n μ k | | 2 r n k = {       0 ,       e l s e       1 ,       k = a r g m i n k | | x n μ k | | 2
r n k is an indicator variable that checks whether u k is the nearest cluster center to point x n .

3.3 Advantages and Disadvantages

Advantage:component

  • simple and fast to compute
  • converge to local minimum of within-cluster squared error

Disadvantage:

  • sensitive to initialization
  • sensitive to outliers
  • difficult to set K properly
  • only detect spherical clusters

    这里写图片描述
    figure3 the problem of K-Means cluster (K = 2)

4 .EM Algorithm

Once we use K-Means cluster to get the mean of each cluster, then we have θ j = ( u j ,   Σ j ) , we can estimate the “responsibility” of component j for mixture density γ j ( x n ) .

4.1 K-Means Clustering Revisited

step1: Initialization pick K arbitrary centroids [compute θ j 0 = ( μ j 0 , Σ j 0 ) ]

step2: Assign each sample to the closest centroid. [compute γ j ( x n ) Estep]

step3: Adjust the centroids to be the means of the samples assigned to them, [compute θ j τ = ( μ j τ , Σ j τ ) Mstep]

step4: Go to step 2 (until no change)

The process is almost same with K-Means cluster, but in K-Means one point only depends on one distribution, no concept like γ j ( x n ) .

4.2 Estep & Mstep

Estep: softly assign samples to mixture components

γ j ( x n ) = p ( j ) p ( x n | θ j ) k = 1 M p ( x n | θ k ) p ( k ) ; j = 1... K , n = 1... N

Mstep: re-estimate the parameters (separately for each mixture component) based on the soft assignments.
N j ^ = n = 1 N γ j ( x n ) p ( j ) ^ = N j ^ N u j n e w ^ = n = 1 N γ j ( x n ) x n n = 1 N γ j ( x n ) Σ j n e w ^ = 1 N j ^ n = 1 N γ j ( x n ) ( x n u j n e w ^ ) ( x n u j n e w ^ ) T

4.3 Advantages

  • Very general, can represent any (continuous) distribution.
  • Once trained, very fast to evaluate.
  • Can be updated online.

4.4 Caveats

  1. introduce regularization
    instead of Σ 1 , use ( Σ + σ ) 1 to avoid Σ 1 = 0 causing p ( x n | θ j ) goes to infinite
  2. Initialize with k-Means to get better results
    Typical steps:
    Run k-Means M times (e.g. M = 10~100)
    Pick the best result (lowest error J)
    Use this result to initialize EM
  3. EM for MoG is computational expensive
  4. Need to select the number of mixture components K properly model selection problem