Color Tolerance: A comparison of the method of constant stimuli and gray scale comparison method on a CRT.

David C. Wilbur


Introduction

The Munsell Color Science laboratory has been actively engaged in color science research since its inception in 1983. One aspect of this research has been the evaluation of visual color differences with the goal of improved specification of instrumental tolerances for industrial color control. In 1995, the Munsell Industrial Color Difference Evaluation Consortium was established to improve specifically the effectiveness of automated industrial-color difference evolution.

In an industrial setting it is useful to be able to accept or reject batches based on instrumental measurements of the colorimetric values of samples from the batch. Instrumental measurements are used in formulae which determine the tolerance of acceptable color difference. These formulae are derived from empirical studies using visual assessment of suprathreshold color differences. 1

The main purpose of this research is to compare two methods for deriving suprathreshold color tolerances. If differences exist then these differences are due solely to discrepancies between the two psychophysical methods since the same stimuli were presented in both experiments. Disparities in results may, in part, explain differences between laboratories employing the different techniques.

The two methods being compared in this research are the method of constant stimuli and the gray scale comparison. Both are psychophysical techniques that can be used for measuring suprathreshold color tolerances.

The null hypothesis of this research is that there is not a difference between the method of constant stimuli and gray scale comparison methods. The alternative hypothesis is there is a difference between the two methods to derive color tolerances. The hypothesis of this research is stated as follows: In comparing the method of constant stimuli with the gray scale comparison it is expected that there will exist significant differences between the two psychophysical techniques.


BackGround

Application of current quality assessment tests from hardcopy stimuli to stimuli generated on a CRT will enhance automation efforts. Before the utilization of a CRT to present stimuli, experiments were run using physical stimuli. Physical stimuli have been composed of dyed wool samples, primed aluminum panels that had been sprayed with an automotive lacquer or prints from a electrophotographic printer. These samples must be measured and then sorted to make sample pairs that varied in a variety of different directions such as L*, C* or H*. Many times this procedure would be done for 300 to 400 samples. In addition to the many hours needed to produce raw samples, a great deal of time and labor was needed to prepare the samples for experimentation. Color tolerance measurements are difficult and often expensive to run, therefore a new more efficient method was needed to reduce time, labor and cost associated with the implementation and running of color-tolerance experiments. The computer-controlled CRT provides an extremely efficient solution to this problem.

In addition to increased efficiency a CRT display makes possible the study of additional parametric effects on judgements of color tolerances. Parametric effects to be tested might include the following: surround and background luminance, sample size, sample separation, sample luminance, and texture. These effects can be quickly manipulated on a CRT display. In addition, the use of a CRT decreases uncertainty in experiments due to the increased stability of the experimental viewing conditions. The CRT allows the experimenter to know exact colorimetric values of the CRT through simple measurement techniques. Once the monitor has been calibrated it becomes trivial to change stimuli in certain directions based on preliminary results of the experiment. This alone reduces implementation time of the experiment. Based on results from previously performed experiments the utilization of a CRT for determining suprathreshold color tolerances is feasible and produces results comparable with those using object-color samples.

The proposed research utilizes a Sony Trinitron luminance enhanced CRT which will present the stimuli for the comparison of the two methods. The method of constant stimuli and the grayscale comparison method have been previously performed and verified as accurate psychophysical methods using samples consisting of wound thread. This project will provide a qualitative assessment of the two techniques when performed on a CRT.

Once this comparison exists between the two methods, experimenters from various color labs can judge which method may be better suited for determining suprathreshold color tolerances. It may be that both methods return the same results and there is no difference between them. Then, factors such as implementation time, run time and observers preference can be evaluated. However, based on the differences presented experimenters may also be able to decide which of the two methods may be a better overall metric for deriving suprathreshold color tolerances using a CRT.


Theory

Visual psychophysics concerns the study of stimulus-response relationships. Psychophysical methods when employed properly have proven their capacity to produce large amounts of valuable information about the senses. These noninvasive procedures have a special value in that they can be freely applied to human subjects. This research utilizes the proven capacity of psychophysical experiments to test the limits of the human visual system to obtain accurate color tolerance measurements for the industrial community2.

There are many discussions concerning which psychophysical methods are better suited for measuring suprathreshold color tolerances. The two methods being compared in this experiment are the method of constant stimuli and the gray scale comparison method.

In the method of constant stimuli an observer is asked to make a greater than/less than choice based on the stimuli presented. In this case the observer will be presented with two pair of stimuli, an anchor pair and a sample pair. The observer's task is to decide whether the color difference between the anchor pair is greater than or less than the color difference between the sample pair and respond with a keystroke which corresponds to the pair with the greater color difference.

There are many ways in which direction of color difference could have been manipulated. In this experiment the direction was chosen with the stimuli changing in L*, C* and H*. In order to assure that results were independent of each other the order of the sample pair presentation was completely randomized. This type of experiment is also known as a pass/fail or yes/no experiment.

A draw back of the method of constant stimuli is that it consumes large amounts of time and is taxing on the observer, who may be asked to look at large amounts of stimuli which may look alike. These judgements are very difficult to make. Another serious problem with this method is the bias introduced by the stimulus range selected by the experimenter. Known as the range effect, it tends to shift the calculated 50% point towards the center of the range. One explanation for this is that observers may be tempted to equalize their yes and no responses over the course of the session. The proper attitude of an observer would be to take each new trial as it comes and forget about responses to previous trials. As mentioned earlier, the method of constant stimuli requires a very large number of trials are used to produce stable results. Other methods can be used to determine a threshold more quickly such as the method of adjustment or the method of limits. However, method of constant stimuli allows the determination of the complete psychometric function.

Use of a Rating scale is the most common method for determining relationships among stimuli. The use of a gray scale step wedge with equal size increments provides the observers estimate of the proportion of difference between the stimuli pair and the anchor pair. Analysis of the data using several published techniques provides a color difference threshold.

As with the method of constant stimuli there are also many potential problems with the rating scale method. Rating scale methods are often subject to the effects of observer adaptation to the distribution of stimuli. There may be a tendency for observers to not use the extreme of the scale. Another problem similar to the one above is that some observers may use the end points of the scale then find that some of the stimuli are greater then or less then those assigned to the end points. This makes it impossible to discriminate between near extreme stimuli.

It is hypothesized that the difference thresholds measured by one method will correlate well with those measured by some other procedure. In addition to comparing the results of the methods, the amount of labor, time and ease of use for the experimenter and the observer will also be considered. Based on these factors, a recommendation can be made on which procedure is better for determining tolerance thresholds.


Methods

Generating the Stimuli

Suprathreshold tolerances were measured around 3 color centers in three directions The three color centers are shown in the following table (TABLE I).

Table I. The CIELAB values of the three color centers studied.

Color

L

a*

b*

C*

h0

Red

47.4

61.9

49.5

79.3

38.7

Green

65.0

-67.6

54.6

86.9

141.0

Gray

60.0

0.0

0.0

0.0

0.0

Each color center was varied in L*, C*, and H* in steps of .3D E*ab. For each color center the D L*, D C*ab, and D H*ab values were calculated for every possible pair. From this list 10 sample pairs were selected so that D L* varied in step of .3D E*ab, 10 sample pairs were selected so that D C* varied in step of .3D E*ab and 10 sample pairs were selected so that D H* varied in step of .3D E*ab. The tables in appendix A show the L*, a*, b* values for each color center, color difference pair in each direction.

In each case color pairs were selected so that the total color difference were at least 95% in the desired direction. The range for the sample pairs started at .3D E*ab and extended to 3.0D E*ab. Therefore there were 30 sample pairs for each color center. In total 70 sample pairs were generated. The same sets of sample pairs were used in both experiments. During the preliminary testing phase of the experiment it became apparent that the thresholds for the green color center were two small. For these cases, additional sample pairs were generated and added to the upper end of the range and sample pairs were removed from the lower end.

Monitor Calibration:

This research project utilized a Sony Trinitron Multiscan 15sf monitor that had been modified by Sony to increase the peak luminance value. The luminance level had been increased to achieve higher absolute light levels for the experiments. To ensure that the values measured from the monitor are accurate we had to calibrate the monitor.

Calibration refers to achieving a predefined set-up for a display system. Characterization is always required in order to define color in a device-independent fashion.

Calibration was accomplished using the LMT C-1200 colorimeter and the LMT luminance meter in order to measure the colorimetry values of the monitor.

Computer-controlled CRT displays can be described by a two-stage model. The first stage is nonlinear and relates digital counts with the photometric scalars for each channel. The second stage consists of a linear transformation matrix, which relates photometric scalars of each channel with tristimulus values. The first step in the calibration process was to linearize the monitor. Once the monitor was linearized and RGB values were determined it was possible to determine XYZ values and from the XYZ values we determined the CIELAB values. CIELAB values were determined using the CIE1931-color matching functions with a 20 standard observer. Once we knew the output CIELAB values we worked backward through the process. For example, we know that we wanted some samples to vary by 3D E’s knowing the output CIELAB values we were able to specify on the monitor exactly what RGB values to display so that the samples displayed by the monitor varied by .3D E*ab. These target values were used to generate stimuli but the actual measured values of the stimuli were used in the analysis.

The calibration process used in this research is identical to that used by Berns et al3.

Experiment:

Visual assessments took place over a 2-week period. 30 color normal subjects ages 19 to 49 took part in the experiment. The subject pool consisted of 5 females and 25 males each with varying degrees of experience in judging color differences.

The experiment was conducted in a darkened room. Observers adapted for one minute to the background display before beginning the session. The adaptation screen consisted of a neutral gray background with L*=50. This same background is used in both MCS and GSC experiments.

To insure that results were not based on which experiment was presented first the order of presentation of experiments was randomized.

Gray Scale Comparison

The gray scale comparison method used for visual assessment is similar to that used by Luo and Rigg7.

The patches in this experiment were 4cm x 4cm. The patches subtended a visual angle of approximately 4.4o when viewed at a normal viewing distance of about 52 cm.

Table IX. The colorimetric values for the GSC display

Object

L*

a*

b*

Border

100

0

0

Background

50.0

0

0

Standard

40.0

0

0

Patch 1

40.5

0

0

Patch 2

41.0

0

0

Patch 3

41.5

0

0

Patch 4

42.0

0

0

Patch 5

42.5

0

0

Method of Constant Stimuli:

The method of constant stimuli was similar to that used by Berns et alnumber. In this experiment samples were arranged vertically with a 1 pixel black line separating the samples in the standard pair, and the sample pair. The samples were arranged as shown in Fig. 2. The anchor pair was always presented on the left side of the display however; the top and bottom position of the anchor pair was randomized for each trial. The trials for the three-color centers were intermixed and presented in a different random order for each observer. The randomization of stimuli is intended to minimize sequential effects; it has been shown that observers are incapable of ignoring their previous responses. Observers free-viewed the samples at normal viewing distance for a CRT display. A white border (close to D65) surrounded the display defining the reference white. The anchor pair, consisting of two uniform patches, had a D E*ab=1.0. The colorimetric values of the display are shown in Table X. Each observer was asked to judge which pair exhibits a larger color difference by comparing a sample pair and the standard pair5.

Patches in this experiment were also 4cm x 4cm and subtended a visual angle of approximately 4.4o when viewed at a normal viewing distance of 52cm.

Table X. The colorimetric values for the MCS display

Object

L*

a*

b*

Border

100

0

0

Background

50.0

0

0

Standard 1

40.0

0

0

Standard 2

41.0

0

0

 

TABLE XI. Summary of experimental conditions for each experiment.

Psychophysical Method

Viewing Background

Gap

No. of Pairs

No. of observers

No. of assessments

Gray Scale

Gray

Hairline

70

30

2100

Constant Stimuli

Gray

Hairline

70

30

2100

Results

As mentioned earlier, a panel of 30 observers assessed each pair. For both the gray scale comparison and the method of constant stimuli each observer assessed each pair once.

Method of Constant Stimuli:

The method of constant stimuli data were analyzed using probit analysis, a univariate statistical method that locates a median threshold from binary choice (in this case greater then/less then) visual data. Probit analysis employs the hypothesis that frequency-of -rejection data follow a cumulative -normal distribution. The chi-squared test is used to test this hypothesis. Probit analysis tolerance thresholds corresponding to the pair difference will be determined for each color center and color difference direction. An overall fit of the data will be assessed by a chi-square test. The T50 (50% tolerance level) corresponds to the median color tolerance. The precision of the tolerances is evaluated by the associated fiducial limits. Conceptually, fiducial limits are equivalent to standard error multiplied by 2, because 95% is twice the standard error.

Probit analysis is available through the math program SAS. SAS is available on the Rochester Institute of Technologies VAX computing system. The following figures illustrate the probit fit, the associated fiducial limits and the raw data.

Probability vs. DL* for the red color center
DL*
Fig 3. Probit fit for DL* for red color center

Probability vs. DC* for the red color center
DC*
Fig 4. Probit fit for DC* for red color center

Probability vs. DH* for the red color center
DH*
Fig 5. Probit fit for DH* for red color center

Probability vs. DL* for the green color center
DL*
Fig 6. Probit fit for DL* for green color center

Probability vs. DC* for the green color center
DC*
Fig 7. Probit fit for DC* for green color center

Probability vs. DH* for the green color center
DH*
Fig 8. Probit fit for DH* for green color center

Probability vs. DL* for the gray color center
DL*
Fig 9. Probit fit for DL* for gray color center

Grayscale Comparison Method:

For the experiment conducted using the gray scale comparison method, the raw data in Grade (G) units were transformed to the visual difference for each pair D V, using the following Eq. 1

D V=.1390+.3748*G+.0552*G2-.0052*G3 (1)

The coefficients in Eq. 1 were obtained by fitting a third-order polynomial equation between the D E and grade values. Using Eq. 1, D V values were calculated for each observer. The D V values were summed across all observers then divided by 30 (number of observers) to obtain an average D V value for each sample pair. The corresponding color direction values for each sample pair (L*, C*, H*) were plotted versus the average D V value obtained for that particular color difference pair. A linear regression line was fitted to the data and an r2 value was calculated. The r2 value was calculated to assess the fit of the linear regression to the data points. Threshold values were determined by finding the color difference D L*, D C* or D H* which corresponds to a visual difference of 1D V.

DL* vs. DV for the red color center
DV
Fig 10. Linear regression fit DL* for red color center

DC* vs. DV for the red color center
DV
Fig 11. Linear regression fit DC* for red color center

DH* vs. DV for the red color center
DV
Fig 12. Linear regression fit DH* for red color center

DL* vs. DV for the green color center
DV
Fig 13. Linear regression fit DL* for green color center

DC* vs. DV for the green color center
DV
Fig 14. Linear regression fit DC* for green color center

DH* vs. DV for the green color center
DV
Fig 15. Linear regression fit DH* for green color center

DL* vs. DV for the gray color center
DV
Fig 16. Linear regression fit DL* for gray color center

Method Comparison:

The results were used to investigate the differences between the method of constant stimuli and the gray scale comparison. The visual results from the psychophysical methods are presented in different forms: visual difference (D VGS) and visual probability (P) for the gray scale comparison and method of constant stimuli respectively. (The visual probability is the percentage of the number of observations judging the sample pair having a larger difference than that of the standard pair.). To compare the results from each experiment the D V threshold values from the gray scale comparison and the corresponding color direction threshold values from the method of constant stimuli were plotted versus color direction. Also plotted are the normalized data points from the gray scale comparison. Normalization of the data were done by finding the mean of the MCS data and adding that mean to the raw data from the GSC. Now the MCS and GSC results have the same mean value.

Comparison of Techniques
Color Direction
Fig 17. Method Comparison

Discussion

Method of Constant Stimuli

The data from the method of constant stimuli were analyzed using a statistical method of probit analysis. The results are plotted in Fig.1 through Fig. 8. Each of the graphs shows the probit fit to the raw data, the raw data and the associated fiducial limits. Because the chi-square value was small (p>0.1000) the fiducial limits were calculated using a t value of 1.96. All of the raw data from the MCS experiment fit the probit model with 95% accuracy. Using the 50% probability (the point were 50% judged the color difference greater then the standard and 50% of the observers judged the color difference less then the standard) to determine the threshold. The experimental threshold values for each color center in each direction are given in the Table XII.

Table XII. Threshold values from the MCS

Color Direction

Red

Green

Gray

D L*

1.00

1.227

1.325

D C*

4.090

5.787

D H*

2.405

3.331

Since the probit model fit the data with 95% accuracy the relative accuracy of the threshold values is also believed to be very good.

Gray Scale Comparison

The data from the gray scale comparison were analyzed by calculating D V values using Eq 1. for each observer and calculating an average D V value for each color center in each direction across all 30 observers. A linear regression was fit to the data and the r2 value was calculated. Low r2 values show that the raw data does not follow a linear regression very well.

The experimental threshold values for each color center in each direction obtained from the gray scale experiment are given in Table XIII.

Table XIII. Threshold values from the GSC

Color Direction

Red

Green

Gray

D L*

0.015

0.310

0.752

D C*

1.115

2.034

D H*

1.015

0.189

Method Comparison:

The MCS was compared with the GSC by comparing the suprathreshold color tolerances determined from each experiment. 50% probability point in the MCS was compared with the 1D V value from the GSC. In addition the GSC data was normalized by adding a constant factor so that the new data that had the same average value as the data from the MCS experiment (Table XIII). No consistent trends were seen in the difference between the 2 sets of results.

Table XIII. Normalized threshold values from the GSC

Color Direction

Red

Green

Gray

D L*

1.977

2.272

2.714

D C*

3.077

3.996

D H*

2.977

2.151

An overall mean and standard deviation was calculated by taking the color difference values from the normalized GSC data and the MCS data. The difference was calculated for each color direction in each direction. The mean was calculated by taking the average of these differences. This resulted in a overall difference between the two experiments of 1.14 D E*ab and a standard deviation of 0.38. The data is shown in Table XIV.

Table XIV. Mean and standard deviation between the experiments.

D L* Red

D C* Red

D H* Red

D L* Green

D C* Green

D H* Green

D L* Gray

Average

StdDev

0.98

1.01

0.57

1.04

1.79

1.18

1.39

1.14

0.38

The correlation coefficient between the MCS data and the normalized GSC data was calculated to be .792. A low correlation coefficient suggests that there is little correlation between the results of the two methods. The differences may in part explain the differing results obtained from other color laboratories. More data may have yielded the nature of the differences, the determination of the differences between the two methods was out of the scope of this experiment.


Conclusion

The results support the research hypothesis that differences do exist between the MCS and the GSC methods for deriving suprathreshold color tolerances. There is not sufficient data to see any major trends that may explain the cause of these differences or how these differences could lead to differences in the derivation of color difference formulae. However, the research shows that the differences between the results of the two methods are due solely to differences between the two psychophysical techniques since the same stimuli were presented in each experiment. These differences may impart explain the discrepancies found between laboratories employing the different techniques.

Based on the results the MCS was determined to be a better metric for deriving suprathreshold color tolerances. The determination of this conclusion is based on the following factors.

First, the method of constant stimuli was easier to implement in code. Matlab provides a relatively easy environment for implementing psychophysical experiments, due impart to the psychophysical toolbox which can be downloaded for free of the internet. The amount of programming for this experiment was far less then that of the GSC method. The MCS employed 300 lines of code while the GSC employed the use of 400 lines of code. Ease of programming allows for a faster cycle time from start to finish. Actual code can be viewed in Appendix: A of this report.

Second, according to the subjects participating in the experiment, the MCS was the easier of the two experiments to perform. The main reasons for this statement were based on overall time to complete the experiment and relative ease of decision making involved. The average time to complete MCS portion of the experiment was 7-10 minutes. The average time to complete the GSC portion of the experiment was 10-15 minutes.

Last and most importantly, the results obtained from the MCS were the more precise. Probit analysis provides 95% confidence intervals for each of the color centers in each direction. The results from the GSC method were far less accurate. Fitting a linear regression to the GSC provided average r2 values of less then .70.

Given the ease of implementation, precision and reliability of the results the MCS method is the clear choice for deriving suprathreshold color tolerances.


Table of Contents