Task Dependency of Eye Fixations & Development of a Portable Eye Tracker

Jeffrey M. Cunningham


Background

Vision provides us more information about our surroundings than any other sense. Our eyes are capable of both fine resolution in the center of the retina (fovea) and high sensitivity, but poor resolution, in the periphery. However, this is hardly noticed, since a variety of eye movements are able to move the fovea as necessary very quickly and with great accuracy. Below is a bit of terminology - a list of eye movements, their characteristics and functions - Carpenter (1) provides a detailed analysis of these movements:

  1. Drifts - small movements during fixating (looking at a single point). These are relatively large and slow drifts of the eye away from the center of fixation.
  2. Tremors - small movements during fixating, these are much quicker, smaller and of higher frequency than drifts.
  3. Vestibular Reflex - movements made to keep an image relatively still on the retina while the head or body is in motion.
  4. Optokinesis - movements similar to vestibular movements, but in this case, a large part of the visual field (generally, the background) is in motion relative to the head.
  5. Smooth Pursuit - similar to optokinesis, however, instead of the eye movements compensating for a moving background, smooth pursuit is the tracking of an object at a modest speed.
  6. Vergence - the convergence or divergence of the eye when an object is moved towards or away from the head.
  7. Saccades - large and very quick movements of the eyes, generally used to move the fovea to an area of interest, with minimal interruption of visual perception.

Fixations in a scene are what was of interest in these experiments. Saccades were the movements of most concern, since they are responsible for moving from one point of fixation to the next, and are readily apparent with eye tracking equipment.

The combinations of saccades and fixations allow for objects of interest to be foveated quickly, compared to the slow head movements that would be required to move the fovea without the benefit of eye movements. This is not to say head movements are not incorporated where large changes in fixation are required. Alfred Yarbus (2) pointed out that in natural conditions the amplitude of eye movements usually do not exceed 20 degrees, and Lancaster (3) found that about 99% of eye movements are composed of saccades less than 15 degrees in amplitude. Because of the high speed of saccades, these eye movements only obscure the visual scene (due to the blurring associated with the movement) about 5% of the time.

The connection between fixations and attention has long been assumed. There is also the question of how the saccadic landing point is determined. A recent study confirms people are not able to attend to one location while saccading or preparing to saccade to a different location (4). Thus, it is safe to conclude that immediately after a saccade, the attention of the subject is at the landing position of the saccade. However, since it is also possible to fixate in one location, and shift one's attention to another, the time frame before another shift in attention is unknown. For these experiments, it was assumed that, in general, the subjects did not shift their attention away from the point of fixation during fixations since there was no need to - they were free to move their head and eyes.

While investigating the use of eye movements in perception, Alfred Yarbus had seven subjects look at a painting (Repin's "An Unexpected Visitor") after being given instructions to either; (1) estimate the wealth of the family in the picture, (2) give the ages of the people, (3) guess as to what the family had been doing before the arrival of the "unexpected visitor," (4) remember the clothes worn by the people, (5) remember the position of the people and objects in the room, (6) estimate how long the "unexpected visitor" had been away from the family, (7) or to simply look at the painting with no further instructions (free viewing). His results show that the patterns of eye movements change with the task at hand. For example, while estimating how long the visitor had been away, the time looking at the picture was almost exclusively devoted to looking at the faces of the people in the picture (Yarbus, 1967. 2).

Figure 1. "Seven records of eye movements by one subject. 1) Free examination. Before the subsequent recordings, the subject was asked to 2) estimate the wealth of the family; 3) give the ages of the people; 4) surmise what the family had been doing before the arrival of the 'unexpected visitor'; 5) memorize the clothes worn by the position of the objects and people in the room; 6) memorize the location of the people and objects in the painting; and 7) estimate how long the 'unexpected visitor" had been away." (2).

 

Goals

This research project had two major goals. The first was modeled after Yarbus' work with "An Unexpected Visitor." Yarbus designed a number of suction cups with small mirrors attached that would be stuck on the eye for short periods of time (less than 15 minutes). The mirror would direct a beam of light to a piece of photosensitive media, and in this way the eye movements would be recorded. During the recording, the eyelids of the subject had to be taped back, with the head clamped in place, and a bright light source pointed towards the eye. Yarbus is widely quoted, and it was of interest to verify his results with a modern video tracker. The eye tracker used in these experiments was a system of two small CCD cameras connected to a personal computer and video processing hardware. This set-up was far more comfortable than those used by Yarbus and provided a more natural viewing situation for the subjects. It was uncertain how many subjects Yarbus used for this experiment - he presents the results of the free-viewing trial for seven subjects, but the results of one subject for the remaining results. It seems the subjects also viewed the same painting seven times, so by the time they were asked to memorize the locations of the people and objects in the painting, they had already viewed the painting for 15 minutes. For this project, nine subjects were used, as well as three images, such that each subject saw each image once. It is hoped that these results will be less biased by experimental techniques.

The second objective was to construct an eye tracking video system that is more portable than the current system used in the Visual Perception Laboratory; Applied Science Laboratories' model 501 video eyetracker system. The design of ASL's system is discussed in the following section. More "portable", implies a lighter and smaller system. As the ASL system can cause discomfort and/or headaches from the band, it was also hoped to make the new eye tracker more comfortable. The ultimate goal was to create a system where someone's eye movements could be tracked while walking, driving, or performing similar activities.

Methods

Resources

Center for Imaging Science's Visual Perception Laboratory at R.I.T

 

 

Eye Tracker (ASL 5000)

As mentioned above, the video eye tracker, ASL's model 5000, (see Figure 2 for a diagram of the optics) consists of a system of two cameras attached to an adjustable head band so that it secures easily to the subject's head.

Figure 2 -- Diagram of head mounted optics (5).

 

One camera is monochromatic and is directed via a mirror and a beamsplitter at the left eye, which is coaxially illuminated with near-infrared radiation. The retina reflects a good portion of this radiation, as does the first surface of the cornea. These two reflections are analysed by the hardware to determine the direction of gaze of the eye. This was the eye tracking system used, with ASL's accompanying software and hardware. The analysis involves the identifying of the corneal and retinal reflections via a thresholding operation, representing each reflection as a circle, finding the center of each circle, and calculating the vector from one center to the other. Because the relationship (the vector from one reflection to the other) between these two reflections is used in the calculation of point-of-gaze, (as opposed to using only one reflection, represented in an x-y plane, as older systems do) this system is less sensitive to the movement of the optics in relation to the eye. When the optics move in relation to the head, the vector between the two reflections changes very little, but when the eye moves in relation to the head, there is a large change in the vector. This allows the eye tracker to be used without prohibiting head movements, which opens the possibility of using this system to obtain accurate results in situations such as walking and driving. Earlier systems required a table-mounted device and a bite bar for the subject. Calibration to each subject is required, but is a simple task. The subject was told to look at nine calibration points in the scene, and the software uses the eye positions at these points to fit a polynomial for interpolation of gaze position between these calibration points. The second camera is directed at the scene, either by use of the beamsplitting visor (coaxial), or by simply pointing the camera forward, in the same direction as the eyes (direct). The latter method introduces parallax error, but produces a much better image of the scene, and is easier to set up - this was how the scene camera was directed for this study. After the computation by the ASL hardware, a crosshair corresponding to point-of-gaze is overlaid onto the scene camera image. The system samples at the, standard video field frequency 60Hz, however, two samples were averaged for each reading, reducing the effective sampling rate to 30Hz. This video signal was sent to a Hi-8mm VCR where it was analysed, frame by frame. The tape analysis is an easy (albeit, tedious) method of cataloguing the subjects' fixations on particular objects, and time and frequency information can be collected. An existing C program was modified to ease the collection of this data.

At 30Hz, the system only records fixations lasting at least 33ms. This should not be problem since the minimum fixation period has been found to be between 120-350ms (Carpenter, 1988. 1). This relatively low sampling frequency does not record many of the other eye movements, such as tremors. This is acceptable, since the focus of the data collection was on fixation points, also taken to be the points of attention.

In summary, this is a list of the components of the ASL 5000 system;

Figure 3 - Diagram of component (5).

Part 1 -- Task Dependency

The first experiment was based on Yarbus' work involving the reproduction of Repin's "An Unexpected Visitor." In order to reduce the artefacts from Yarbus' work, the present project used a modern, less obtrusive eye tracker (as described above), nine subjects, three images (digital prints of photographs), and three tasks. Each subject viewed all three images, but viewed each image only once (one task per image). Some tasks were repeated - all subjects viewed all three images, but not all subjects performed all three tasks.

The average age of the subjects was approximately 20 years. The main source of subjects was from a freshman imaging science class, with five men and four women. Their eyesight was normal, or corrected to normal. The eye tracker worked well for subjects wearing contacts, but eye tracking was not attempted on subjects wearing glasses. The subjects were calibrated and then the subject was presented with three images, one at a time, as described above. After being told the task, the image was presented and eye tracking data began at this point.

Yarbus had seven tasks for each subject, but for this experiment, there were only three. This allows for the averaging of the results of more than one subject for any given task-image pair, while maintaining a reasonable number of subjects. The tasks were as follows (these are meant to be similar to tasks Yarbus used):

  1. Free viewing - no real objective. The subject looked at what they preferred, and was instructed to let the experimenter know when the subject wanted to move to the next image.
  2. Memorization - the subject was asked to memorize the image, and were given a minute to look at the image. After the minute was up, the subjects had to make a sketch of what they remembered of the image.
  3. Ages - The subject was asked to give the ages of the people (the subject will respond while viewing).

Yarbus used one image for these tasks. This experiment made use of three images. The images were digital, photo-quality prints. They were approximately 11" by 17". All images had 3 or 4 people interacting, with a simple activity taking place. The goal was to have similar themes and situations in the images, but to prevent the subject from using information from a previous image to help with the present task.

The paths of the gaze across the picture were easily obtained via the computer and video set-up, and these records can be analyzed to determine if the task at hand changed fixation patterns. The percent of time the subject fixated on faces, for example, can be obtained. These percentages were averaged, and ANOVA analysis was performed to verify statistical significance.

Part 2 -- Portable Eye Tracking

The second part of this project was construction of a portable eye tracker. The existing system (ASL's 501, as described earlier) is designed to be a portable system. The controller box, which performs the video processing, operates on 12v, and also provides power to the two camera control boxes. These three pieces of electronics fit into a backpack. The headband allowed for free movement of the head, but most people find the headband uncomfortable after 30 to 60 minutes of use. To reduce weight and the size, smaller cameras with on-board controllers replaced the existing cameras and controller boxes. These were mounted on a baseball cap, which is more comfortable than the rigid plastic headband. The large IR-reflecting visor was replaced with a smaller, monocle reflector. Combined with a camcorder and two camcorder batteries, this system is smaller, lighter, and more comfortable than the ASL system. Note that the same ASL eye tracking controller was used to perform eye tracking and gaze position video overlay.

Results

Part 1 -- Task Dependency

VARIABLES: Task and image were varied. The tasks were;

  1. Free viewing - no real objective. The subject looked at what they preferred, and was instructed to let the experimenter know when the subject wanted to move to the next image.
  2. Memorization - the subject was asked to memorize the image, and were given a minute to look at the image. After the minute was up, the subjects had to make a sketch of what they remembered of the image.
  3. Ages - The subject was asked to give the ages of the people (the subject will respond while viewing).

The images were;

  1. Doc - A staged photograph of a typical doctor's office scene (Figure 4).
  2. Shoe - A photograph of two women and a small boy (Figure 5).
  3. Ropes - A photograph of a ropes course at a summer camp (Figure 6).

Figure 4 - "Doc" image.

Figure 5 - "Shoe" image.

Figure 6 - "Ropes" image.


SUBJECT INFORMATION: The subjects were primarily chosen from Jeff Pelz's SIMG-203 freshman class. The average age was 19, and were five men and five women in the study. Table 1 lists some information about the subjects used to collect the data on task dependency, including the basic information such as age, sex, and if they were wearing contact lenses during the experiment. The last two columns of Table 1 contain the image-task pairings, listed in the order in which they were performed. The order in which the subjects viewed the image-task pairs was randomized. I worked with a total of ten subjects, but only eight are listed below because Subject 1 was used to collect some preliminary data, and the VCR stopped taping while working with Subject 7, and no data pertaining to the task dependency investigation was collected.

Table 1       Subject Information  
Subject # Age Sex Contact Lenses? Image Task
2 22 M Yes 1. Doc

2. Shoe

3. Ropes

1. Ages

2. Mem

3. Free

3 18 M Yes 1. Doc

2. Ropes

3. Shoe

1. Free

2. Mem

3. Ages

4 ? M Yes 1. Ropes

2. Doc

3. Shoe

1. Mem

2. Free

3. Ages

5 18 M No 1. Shoe

2. Ropes

3. Doc

1. Mem

2. Free

3. Free

6 19 F Yes 1. Shoe

2. (no good)

3. Ropes

1. Free

2. (no good)

3. Ages

8 20 F Yes 1. Shoe

2. Doc

3. Ropes

1. Ages

2. Mem

3. Mem

9 19 F Yes 1. Doc

2. Shoe

3. Ropes

1. Free

2. Free

3. Mem

10 19 F Yes 1. Ropes

2. Doc

3. Shoe

1. Ages

2. Mem

3. Mem

DATA: The raw data was a video stream from the controller box. The video was captured onto video tape, and was analyzed after the subject had left. In the analysis, the three images were divided into 22 - 25 segments. These divisions were dictated by significant objects and regions within the image. For example, each person's head was a separate region. Also, in the Doc image, other image segments were the painting hanging in the background and the handshake. With this segmentation, fixation durations for the segments could be calculated. An example of the output from the C program used to aid in the tape analysis follows;

19, Empty Chair, 1431433, 00:23:51:13

23, Floor, 1431433, 00:23:51:13

17, Phone, 1431433, 00:23:51:13

5, B's Upper Body, 1431599, 00:23:51:18

1, A's Head, 1431699, 00:23:51:21

5, B's Upper Body, 1432300, 00:23:52:09

24, Misc, 1432566, 00:23:52:17

7, C's Head, 1432699, 00:23:52:21

14, Painting, 1433066, 00:23:53:02

7, C's Head, 1433433, 00:23:53:13

24, Misc, 1433566, 00:23:53:17

10, D's Head, 1433733, 00:23:53:22

22, Blank Space - Upper Right, 1434633, 00:23:54:19

12, D's Lower Body, 1434833, 00:23:54:25

10, D's Head, 1434933, 00:23:54:28

12, D's Lower Body, 1435233, 00:23:55:07

11, D's Upper Body, 1435433, 00:23:55:13

12, D's Lower Body, 1436066, 00:23:56:02

11, D's Upper Body, 1436666, 00:23:56:20

10, D's Head, 1437066, 00:23:57:02

11, D's Upper Body, 1438133, 00:23:58:04

.

.

.

The above sample is from the data collected from a subject performing the memorization task with the Doc image. The first number refers to a segment of the image, in this case 19 refers to the empty chair in the scene. The second number is the internal time code from the Hi-8mm VCR, and the last number is the time code expressed in hours, minutes, seconds, and frame number (30 frames per second).

This data was then read into an Excel(r) spreadsheet. The spreadsheet calculated the duration of each entry, and summed the durations to find a total fixation time for each of the 22 - 25 image segments. The following table (Table 2), and was created from the same data that the above example was drawn from;

Table 2 Excel spreadsheet - summation of fixations times.


Three of these tables were created for each subject (except Subject 6, where one trial was unusable), as each subject saw three image-task pairs. Nine tables (Tables 3 - 11) are averages for the nine image-task pairs (the red arrows highlight those entries with fixation percentages greater or equal to 10%);










At this point, the variance in the distribution of fixation times varies with the task at hand. The most obvious variance comes with the "Age" task. Here, as expected, most subjects spent a great deal of time fixating on heads and faces (collectively segmented into the events listed in the tables above as A's, B's, C's and D's heads). For example, in Table 11, on average, the subjects spent most of their viewing time fixating on five events, or images segments. Four out of those five were the heads of the people in the image. "A's head" refers to the head of the leftmost person, and the rest of the B - C labeling follows this left to right fashion. Before attributing this variation to task dependency, the statistical significance of the factors was evaluated, using ANOVA. A general linear model was used, and F*-tests were performed to determine the significance of the effect the task had on the response. Tests were also performed for image and image-task interaction effects. These tests require a single measure for the "response." In this study, three measures were used, and three sets of ANOVA tests performed. In the first ANOVA test, the measure was the percent of the total viewing time that was dedicated to fixating on heads. The second and third tests used person fixations (head and upper and lower body) and inanimate object fixations (painting, desk, ropes, car, etc.) as measures.

The ANOVA testing was done in MiniTab(r), a statistical software package. The data was formed in Excel(r), and copied into the MiniTab(r) spreadsheet. The program performs the F*-tests, and an effect having a p-value of 0.05 cannot be ignored. The textual output was captured and is presented below:

Where the percent of fixations on heads is the measure;

General Linear Model


Factor Levels Values

image 3 1 2 3

task 3 1 2 3

Analysis of Variance for Y

Source DF Seq SS Adj SS Adj MS F P

image 2 0.08555 0.03200 0.01600 2.30 0.139

task 2 0.77753 0.67211 0.33605 48.34 0.000

image*task 4 0.03643 0.03643 0.00911 1.31 0.317

Error 13 0.09037 0.09037 0.00695

Total 21 0.98989

Where the percent of fixations on persons is the measure;

General Linear Model


Factor Levels Values

image 3 1 2 3

task 3 1 2 3

Analysis of Variance for Y

Source DF Seq SS Adj SS Adj MS F P

image 2 0.039436 0.039742 0.019871 4.16 0.040

task 2 0.632528 0.532564 0.266282 55.70 0.000

image*task 4 0.063318 0.063318 0.015829 3.31 0.045

Error 13 0.062151 0.062151 0.004781

Total 21 0.797432

Where the percent of fixations on inanimate objects is the measure;

General Linear Model

Factor Levels Values

image 3 1 2 3

task 3 1 2 3

Analysis of Variance for Y

Source DF Seq SS Adj SS Adj MS F P

image 2 0.112498 0.129396 0.064698 10.79 0.002

task 2 0.367235 0.305425 0.152713 25.48 0.000

image*task 4 0.051320 0.051320 0.012830 2.14 0.134

Error 13 0.077925 0.077925 0.005994

Total 21 0.608977

Table 12. ANOVA summary.

  Task F* Image F* Task*Image F* Task p-value Image p-value Task*Image p-value
Heads 48.34 2.30 1.31 0.000 0.139 0.317
Persons 55.70 4.16 3.31 0.000 0.040 0.045
Objects 25.48 10.79 2.14 0.000 0.002 0.134

Part 2 -- Portable Eye Tracking

An eye tracker was set up on a baseball cap, as described in the Methods section. In this section, photographs and line drawings of the eye tracker will be provided. Battery life was a concern. Two lithium-ion rechargeable camcorder batteries were used in series to power the controller box, which powered the cameras and IR illuminator as well. The camcorder is powered by its own battery. The pair of batteries powered the controller, cameras, and illuminator for just over three hours on one trial. Further trials were not performed as the variability is not likely to be more than an hour, and the requirement was a battery life of two hours.

Components:

Figure 7. The components of the portable, baseball cap, eye tracker.

 

Figure 8. Frontal view of the optics mounted on the baseball cap.

Figure 9. Side view of the optics mounted on the baseball cap.

 

Figure 10. Top view of the optics mounted on the baseball cap.

Figure 11. Bottom view of the optics mounted on the baseball cap.

Figure 12. Sketch showing orientation of eye camera in relation to the eye.

Table 13. Pin assignments for control box cable.

Pin Assignment
7 IR power +5.5v.
11 Scene camera signal.
12 Eye camera ground.
13 Eye camera signal.
19 IR power ground.
23 Scene camera ground.
24 Scene camera power.
25 Eye camera power.

Conclusions

The data collected in this study regarding the task dependency of eye fixations agrees with the results found by Alfred Yarbus (1967, 2), . As seen in Figure 1, Yarbus recorded the path of eye movements. With his seven tasks, he demonstrated that the paths he traced certainly looked different for each task. Figure 1 is for only one subject, and is the only diagram of this sort in his book. The figure shows a qualitative difference, but does not lead to quantitative data. In this study, the segmentation of the images was qualitative, but after that all data was quantitative. The amount of time each subject spent looking at the image segments was tallied, and converted into percents. Then an ANOVA analysis was performed, using groups of image segments for the response variable. For all measures (percent of time fixating on, head, persons, objects), task had a significant effect (p >0.0005). The p-value is the probability of being wrong when you assume the factor as no effect. Unfortunately, it is not as easy to dismiss the effect that the specific image had on the subjects' responses. Ideally, one would have different images, but the subjects would respond to the images in a similar manner. When the percent of time fixating on heads was used as the measure of the response, there was not a significant relationship between image and eye fixations. There may be a small relationship when fixations on persons was used as a measure. Image had a significant effect when object fixations were used as a measure. This last finding isn't too surprising, considering the surroundings and objects in the images are what changed the most between the images. For example, in the Doc image, there are many clearly distinguishable objects in the image, but in the Ropes image, the objects are finer, and more difficult to pick out from their surroundings. A better choice of images would probably eliminate this partial image dependency. This is not to say that it is not expected that eye movements will vary with image, because they will. But when the images were somewhat similar, as they were in this case, the image dependency should be minimal.

While the nature of the data, and the means of collecting it differs from Yarbus' study 30-plus years ago, the interpretation remains the same. An investigation in visual perception, the results from Yarbus' work and this study show that the method of gathering visual information is not solely determined by the physiology of the eye. In the eye, near the center of the retina, there is a high concentration of photoreceptors in an area called the fovea. This fact is what makes eye movements necessary -- being able to move the eye such that an image of the area of interest is formed on the foveal region. However, visual search patterns are governed on physiological and cognitive levels, as demonstrated by the task orientation, or task dependency shown in Yarbus' and this work. The subjects' ability to vary eye movement patterns is a cognitive function, presumably to make the gathering of visual information more efficient. An example of an inefficient gathering method would be to visually sample the image so as to foveate every part of the image (perhaps scanning in raster lines, as a TV image is drawn), and after sampling throw away irrelevant information. None of the subjects appear to have done anything of the sort, instead there was a concentration of fixations of parts of the image that were most likely to complete the task. In Yarbus' case, to complete the task of "Memorize the clothes worn by the people in the image", the subject (whose eye movements are displayed in Figure 1) spent nearly all of his/her time looking at the people. In fact, one can tell where the people in the painting were just from the eye movement pattern. This is an efficient way to complete the task, since relevant information is only located in the portions of the image occupied by people. In this study, when the subjects were asked to "Give the ages of the people in the image", nearly all the viewing time was spent fixating on the people's faces. This is also an efficient visual search, as most cues for age are found in one's face. Such results demonstrate an important cognitive role in visual searches.

While the results of the task dependency part of this project were satisfactory, the results of improving the portability of the eye tracker are incomplete. The baseball cap version of the eyetracker, as depicted in the Results section, was indeed lighter, smaller, and more comfortable than the existing eye tracker. This system was also capable of tracking an eye, and the batteries provided ample running time. And, except for the head mounted optics, the hardware fits into a small back or hip pack. These are all goals for this project. Unfortunately, there was no time left at the end of this project to do pilot and comparison testing with this eyetracker. While this looks like a feasible eye tracking system, and worth continuing, the work is not yet finished for this part of the project - more work yet remains. Specifically, two issues I wanted to address are;

Pilot testing should also be performed. This testing should include a comparison to the existing tracker in terms of comfort over long durations, ease of use, and ability to maintain an eye image (for proper tracking). Tests should also be performed under various lighting conditions, from very dark to very bright. One problem I encountered while using the existing eye tracker was that when the light level were low, the pupil would open up enough so that the retinal image was as bright as the corneal reflection, and the corneal reflection was obscured.

Table of Contents