NOCOM program for mixture of normal distributions

Jurg Ott / 14 January 2001
Rockefeller University, New York
ott@linkage.rockefeller.edu

NOCOM and COMPMIX programs

INTRODUCTION

The NOCOM program is one of the Linkage Utility programs, which may be found at ftp://linkage.rockefeller.edu/software/utilities/. It estimates the parameters (means, variance, proportions of components) of a mixture of normal distributions for independent observations (quantitative data). Power transformation is optional, with estimation of the exponent as well. Thus, for example, lognormal distributions may also be analyzed. The estimation is iterative and is based on a 'counting' (EM) algorithm (Hasselblad 1966, Ott 1979), starting from a set of initial parameter values furnished by the user. NOCOM is written in Fortran and has been compiled with the GNU G77 compiler.

As mentioned above, for a given value of the exponent in the power transformation, the parameters are estimated iteratively by the EM algorithm. When also the exponent has to be estimated, its value is changed by an initial step size of D. At each new exponent value, the other parameters are first approximately adjusted for the change in the exponent and then estimated by the counting method. The step size for the exponent is halved whenever the direction of the search changes until the step size falls below the tolerance TOL. If the exponent exceeds the lower bound, R_LB, or the upper bound, R_UB, an error message is issued. In programming, special care was taken to allow for initial mean estimates to be far from their maximum likelihood value.

ESTIMATING MIXTURE PARAMETERS

To estimate the parameters such as means and proportions, suitable starting values are required. A sensible approach at finding starting values is to produce a histogram of the data and visually identify component means (e.g., using the HIST program, which is one of the Linkage Utility programs). A common standard deviation may be estimated by focussing on one of the components, preferably the predominant one. One imagines a normal density curve for this component and identifies its maximum point and a point of inflection at either side of the maximum. The (horizontal) distance between point of maximum and one of the inflection points is one standard deviation. In practice, starting with too high a value for the standard deviation is better than when the starting value is too small. An example is given below.

TESTING FOR A MIXTURE

Consider the null hypothesis, H₀: u₁= u₂, and the alternative hypothesis, H₁: u₁¹ u₂, where u_i is the mean of the i-th component. To test whether H₁ is significantly better than H₀, one looks at the log likelihood obtained under each hypothesis. Then, one forms the statistic, G²= 2[ln(L₁) – ln(L₀)], where L_i is the maximum likelihood obtained under the i-th hypothesis.

Because u₁= u₂ forces p₁= 1, G² does not have an asymptotic chi-square distribution. Often, it is nonetheless taken to approximately be chi-squared with 2 df. Computer simulations show, however, that the p-value so obtained is somewhat too small, that is, the test based on the chi-square distribution with 2 df is liberal (nonconservative) (Thode et al. 1988).

For a chi-square test with 2 df at the 5% significance level, the threshold of the test statistic is equal to 5.99. Instead of this limit, according to Thode et al (1988), the somewhat higher limit, 6.08 + 4.51/Ön, should be used, where n is the number of observations.

DATA INPUT

Observations (all larger than zero except when a constant exponent of 1 is specified; see below) are expected in free format in a file called nocom.dat. Output will be written to the nocom.out file. Parameter values are entered interactively as follows:

1) Exponent, e, for power transformation of original values (y) into transformed
values (x), where x = (y^e – 1)/e + e. Enter e = 1 for no transformation

2) Estimate exponent (1) or keep it constant (0)?

3) Number, NC, of components of the normal mixture

For a single component (NC=1), skip items 4 through 11

4) Fixed ratios of standard deviations (2) or separate standard deviation for each component (1)? Note that parameter estimation is basically unstable when separate standard deviations are estimated for each component. Thus, option 2 is recommended. A special case of this option is that standard deviations are the same for each component. While the user will still enter possibly different standard deviations, the program will estimate a single variance factor by which each variance is increased or decreased.

Items 5 through 11, below, must be repeated for each component:
---------
5) Mean of i^th component

6) Estimate this mean (1) or keep its value constant (0)?

For variance option (1) [see 4)]:
7) Standard deviation for i^th component
8) Estimate this standard deviation (1) or keep it constant (0)?

For variance option (2):
7) Standard deviation for 1^st component, or ratio of standard deviation to that of 1^st component

9) Proportion (weight) of i^th component

10) Estimate this proportion (1) or keep constant (0)?
----------

For variance option (2):
11) Estimate standard deviations (1) or keep constant (0)?

12) Enter one of the following codes for further calculations:
      = 0 to stop
      = 1 Start over (data will be read again)
      = 2 For new initial values with previously used data
      = 3 For new exponent only, same data as before.

The program does all computations on transformed values, x. Input (initial estimates) and output are also in terms of the transformed (x) rather than the original (y) observations.
Note: Exponent e = 1 is equivalent to x = y; e = 0 is equivalent to x = ln(y); e = 0.5 is equivalent to x = Öy (square root transformation).

The program dimensions are set to accommodate up to 10,000 observations and up to 9 components of a mixture of normal distributions. At each 25^th iteration or whenever the final iteration is reached, results are printed (the iteration number is the first number printed). With exponent estimation, the final iteration may not be the one with the highest log likelihood obtained. Therefore, the maximum log likelihood and its associated estimated exponent are printed after the last iteration.

As a training set, 200 observations are provided in the nocom.dat file. These have been obtained using a random number generator for normal variables, the standard deviation being equal to 2. The first 150 observations have mean 10, the next 50 observations have mean 15. As a general guideline, to obtain reasonable starting values for means, standard deviation and proportions, one first looks at the distribution of these 200 observations using the HIST program. With 20 classes, one sees two modes and might come up with starting values as shown in the following table. Running NOCOM with a constant exponent of 1, two components, common standard deviation (option 2), one obtains the following results:

                                Common
                  Mean1 Mean2 Std.Dev. Prop.1 Prop.1 Ln(Lik)
True values       10     15      2       0.75   0.25
Starting values   10.5   17.0    2.5     0.8    0.2   -322.678
Final values       9.98 15.53   2.07    0.79   0.21 -315.053

The program may be run in batch mode with the command, for example,
nocom <nocomstart.dat
(see the nocomstart.dat file).

COMPMIX program

The COMPMIX program assumes a mixture of normal distributions, and known parameters for each component in that mixture. These parameters may have been estimated by the NOCOM program. The COMPMIX program then calculates the conditional probability, given an observed value, that it belongs to the i-th component. The program is interactive and self-explanatory. Input data are read from a file named at execution time.

REFERENCES

Hasselblad V (1966) Estimation of parameters for a mixture of normal distributions. Technometrics 8, 431-444

Ott J (1979) Detection of rare major genes in lipid levels. Hum Genet 51, 79-91

Thode HC, Finch SJ, Mendell NR (1988) Simulated percentage points for the null distribution of the likelihood ratio test for a mixture of two normals. Biometrics 44, 1195-1201