Task Dependency of Eye Fixations & Development of a Portable Eye Tracker
Jeffrey M. Cunningham
Vision provides us more information about our surroundings than any other sense. Our eyes are capable of both fine resolution in the center of the retina (fovea) and high sensitivity, but poor resolution, in the periphery. However, this is hardly noticed, since a variety of eye movements are able to move the fovea as necessary very quickly and with great accuracy. Below is a bit of terminology - a list of eye movements, their characteristics and functions - Carpenter (1) provides a detailed analysis of these movements:
Fixations in a scene are what was of interest in these experiments. Saccades were the movements of most concern, since they are responsible for moving from one point of fixation to the next, and are readily apparent with eye tracking equipment.
The combinations of saccades and fixations allow for objects of interest to be foveated quickly, compared to the slow head movements that would be required to move the fovea without the benefit of eye movements. This is not to say head movements are not incorporated where large changes in fixation are required. Alfred Yarbus (2) pointed out that in natural conditions the amplitude of eye movements usually do not exceed 20 degrees, and Lancaster (3) found that about 99% of eye movements are composed of saccades less than 15 degrees in amplitude. Because of the high speed of saccades, these eye movements only obscure the visual scene (due to the blurring associated with the movement) about 5% of the time.
The connection between fixations and attention has long been assumed. There is also the question of how the saccadic landing point is determined. A recent study confirms people are not able to attend to one location while saccading or preparing to saccade to a different location (4). Thus, it is safe to conclude that immediately after a saccade, the attention of the subject is at the landing position of the saccade. However, since it is also possible to fixate in one location, and shift one's attention to another, the time frame before another shift in attention is unknown. For these experiments, it was assumed that, in general, the subjects did not shift their attention away from the point of fixation during fixations since there was no need to - they were free to move their head and eyes.
While investigating the use of eye movements in perception, Alfred Yarbus had seven subjects look at a painting (Repin's "An Unexpected Visitor") after being given instructions to either; (1) estimate the wealth of the family in the picture, (2) give the ages of the people, (3) guess as to what the family had been doing before the arrival of the "unexpected visitor," (4) remember the clothes worn by the people, (5) remember the position of the people and objects in the room, (6) estimate how long the "unexpected visitor" had been away from the family, (7) or to simply look at the painting with no further instructions (free viewing). His results show that the patterns of eye movements change with the task at hand. For example, while estimating how long the visitor had been away, the time looking at the picture was almost exclusively devoted to looking at the faces of the people in the picture (Yarbus, 1967. 2).

Figure 1. "Seven records of eye movements by one subject. 1) Free examination. Before the subsequent recordings, the subject was asked to 2) estimate the wealth of the family; 3) give the ages of the people; 4) surmise what the family had been doing before the arrival of the 'unexpected visitor'; 5) memorize the clothes worn by the position of the objects and people in the room; 6) memorize the location of the people and objects in the painting; and 7) estimate how long the 'unexpected visitor" had been away." (2).
This research project had two major goals. The first was modeled after Yarbus' work with "An Unexpected Visitor." Yarbus designed a number of suction cups with small mirrors attached that would be stuck on the eye for short periods of time (less than 15 minutes). The mirror would direct a beam of light to a piece of photosensitive media, and in this way the eye movements would be recorded. During the recording, the eyelids of the subject had to be taped back, with the head clamped in place, and a bright light source pointed towards the eye. Yarbus is widely quoted, and it was of interest to verify his results with a modern video tracker. The eye tracker used in these experiments was a system of two small CCD cameras connected to a personal computer and video processing hardware. This set-up was far more comfortable than those used by Yarbus and provided a more natural viewing situation for the subjects. It was uncertain how many subjects Yarbus used for this experiment - he presents the results of the free-viewing trial for seven subjects, but the results of one subject for the remaining results. It seems the subjects also viewed the same painting seven times, so by the time they were asked to memorize the locations of the people and objects in the painting, they had already viewed the painting for 15 minutes. For this project, nine subjects were used, as well as three images, such that each subject saw each image once. It is hoped that these results will be less biased by experimental techniques.
The second objective was to construct an eye tracking video
system that is more portable than the current system used in the
Visual Perception Laboratory; Applied Science Laboratories' model
501 video eyetracker system. The design of ASL's system is
discussed in the following section. More "portable",
implies a lighter and smaller system. As the ASL system can cause
discomfort and/or headaches from the band, it was also hoped to
make the new eye tracker more comfortable. The ultimate goal was
to create a system where someone's eye movements could be tracked
while walking, driving, or performing similar activities.
Center for Imaging Science's Visual Perception Laboratory at R.I.T
Eye Tracker (ASL 5000)
As mentioned above, the video eye tracker, ASL's model 5000, (see Figure 2 for a diagram of the optics) consists of a system of two cameras attached to an adjustable head band so that it secures easily to the subject's head.

Figure 2 -- Diagram of head mounted optics (5).
One camera is monochromatic and is directed via a mirror and a beamsplitter at the left eye, which is coaxially illuminated with near-infrared radiation. The retina reflects a good portion of this radiation, as does the first surface of the cornea. These two reflections are analysed by the hardware to determine the direction of gaze of the eye. This was the eye tracking system used, with ASL's accompanying software and hardware. The analysis involves the identifying of the corneal and retinal reflections via a thresholding operation, representing each reflection as a circle, finding the center of each circle, and calculating the vector from one center to the other. Because the relationship (the vector from one reflection to the other) between these two reflections is used in the calculation of point-of-gaze, (as opposed to using only one reflection, represented in an x-y plane, as older systems do) this system is less sensitive to the movement of the optics in relation to the eye. When the optics move in relation to the head, the vector between the two reflections changes very little, but when the eye moves in relation to the head, there is a large change in the vector. This allows the eye tracker to be used without prohibiting head movements, which opens the possibility of using this system to obtain accurate results in situations such as walking and driving. Earlier systems required a table-mounted device and a bite bar for the subject. Calibration to each subject is required, but is a simple task. The subject was told to look at nine calibration points in the scene, and the software uses the eye positions at these points to fit a polynomial for interpolation of gaze position between these calibration points. The second camera is directed at the scene, either by use of the beamsplitting visor (coaxial), or by simply pointing the camera forward, in the same direction as the eyes (direct). The latter method introduces parallax error, but produces a much better image of the scene, and is easier to set up - this was how the scene camera was directed for this study. After the computation by the ASL hardware, a crosshair corresponding to point-of-gaze is overlaid onto the scene camera image. The system samples at the, standard video field frequency 60Hz, however, two samples were averaged for each reading, reducing the effective sampling rate to 30Hz. This video signal was sent to a Hi-8mm VCR where it was analysed, frame by frame. The tape analysis is an easy (albeit, tedious) method of cataloguing the subjects' fixations on particular objects, and time and frequency information can be collected. An existing C program was modified to ease the collection of this data.
At 30Hz, the system only records fixations lasting at least 33ms. This should not be problem since the minimum fixation period has been found to be between 120-350ms (Carpenter, 1988. 1). This relatively low sampling frequency does not record many of the other eye movements, such as tremors. This is acceptable, since the focus of the data collection was on fixation points, also taken to be the points of attention.
In summary, this is a list of the components of the ASL 5000 system;

Figure 3 - Diagram of component (5).
The first experiment was based on Yarbus' work involving the reproduction of Repin's "An Unexpected Visitor." In order to reduce the artefacts from Yarbus' work, the present project used a modern, less obtrusive eye tracker (as described above), nine subjects, three images (digital prints of photographs), and three tasks. Each subject viewed all three images, but viewed each image only once (one task per image). Some tasks were repeated - all subjects viewed all three images, but not all subjects performed all three tasks.
The average age of the subjects was approximately 20 years. The main source of subjects was from a freshman imaging science class, with five men and four women. Their eyesight was normal, or corrected to normal. The eye tracker worked well for subjects wearing contacts, but eye tracking was not attempted on subjects wearing glasses. The subjects were calibrated and then the subject was presented with three images, one at a time, as described above. After being told the task, the image was presented and eye tracking data began at this point.
Yarbus had seven tasks for each subject, but for this experiment, there were only three. This allows for the averaging of the results of more than one subject for any given task-image pair, while maintaining a reasonable number of subjects. The tasks were as follows (these are meant to be similar to tasks Yarbus used):
Yarbus used one image for these tasks. This experiment made use of three images. The images were digital, photo-quality prints. They were approximately 11" by 17". All images had 3 or 4 people interacting, with a simple activity taking place. The goal was to have similar themes and situations in the images, but to prevent the subject from using information from a previous image to help with the present task.
The paths of the gaze across the picture were easily obtained
via the computer and video set-up, and these records can be
analyzed to determine if the task at hand changed fixation
patterns. The percent of time the subject fixated on faces, for
example, can be obtained. These percentages were averaged, and
ANOVA analysis was performed to verify statistical significance.
The second part of this project was construction of a portable eye tracker. The existing system (ASL's 501, as described earlier) is designed to be a portable system. The controller box, which performs the video processing, operates on 12v, and also provides power to the two camera control boxes. These three pieces of electronics fit into a backpack. The headband allowed for free movement of the head, but most people find the headband uncomfortable after 30 to 60 minutes of use. To reduce weight and the size, smaller cameras with on-board controllers replaced the existing cameras and controller boxes. These were mounted on a baseball cap, which is more comfortable than the rigid plastic headband. The large IR-reflecting visor was replaced with a smaller, monocle reflector. Combined with a camcorder and two camcorder batteries, this system is smaller, lighter, and more comfortable than the ASL system. Note that the same ASL eye tracking controller was used to perform eye tracking and gaze position video overlay.
VARIABLES: Task and image were varied. The tasks were;
The images were;
Figure 4 - "Doc" image.
Figure 5 - "Shoe" image.
Figure 6 - "Ropes" image.
SUBJECT INFORMATION: The subjects were primarily chosen from
Jeff Pelz's SIMG-203 freshman class. The average age was 19, and
were five men and five women in the study. Table 1 lists some
information about the subjects used to collect the data on task
dependency, including the basic information such as age, sex, and
if they were wearing contact lenses during the experiment. The
last two columns of Table 1 contain the image-task pairings,
listed in the order in which they were performed. The order in
which the subjects viewed the image-task pairs was randomized. I
worked with a total of ten subjects, but only eight are listed
below because Subject 1 was used to collect some preliminary
data, and the VCR stopped taping while working with Subject 7,
and no data pertaining to the task dependency investigation was
collected.
| Table 1 | Subject Information | |||||
| Subject # | Age | Sex | Contact Lenses? | Image | Task | |
| 2 | 22 | M | Yes | 1. Doc 2. Shoe 3. Ropes |
1. Ages 2. Mem 3. Free |
|
| 3 | 18 | M | Yes | 1. Doc 2. Ropes 3. Shoe |
1. Free 2. Mem 3. Ages |
|
| 4 | ? | M | Yes | 1. Ropes 2. Doc 3. Shoe |
1. Mem 2. Free 3. Ages |
|
| 5 | 18 | M | No | 1. Shoe 2. Ropes 3. Doc |
1. Mem 2. Free 3. Free |
|
| 6 | 19 | F | Yes | 1. Shoe 2. (no good) 3. Ropes |
1. Free 2. (no good) 3. Ages |
|
| 8 | 20 | F | Yes | 1. Shoe 2. Doc 3. Ropes |
1. Ages 2. Mem 3. Mem |
|
| 9 | 19 | F | Yes | 1. Doc 2. Shoe 3. Ropes |
1. Free 2. Free 3. Mem |
|
| 10 | 19 | F | Yes | 1. Ropes 2. Doc 3. Shoe |
1. Ages 2. Mem 3. Mem |
|
DATA: The raw data was a video stream from the controller box.
The video was captured onto video tape, and was analyzed after
the subject had left. In the analysis, the three images were
divided into 22 - 25 segments. These divisions were dictated by
significant objects and regions within the image. For example,
each person's head was a separate region. Also, in the Doc image,
other image segments were the painting hanging in the background
and the handshake. With this segmentation, fixation durations for
the segments could be calculated. An example of the output from
the C program used to aid in the tape analysis follows;
19, Empty Chair, 1431433, 00:23:51:13
23, Floor, 1431433, 00:23:51:13
17, Phone, 1431433, 00:23:51:13
5, B's Upper Body, 1431599, 00:23:51:18
1, A's Head, 1431699, 00:23:51:21
5, B's Upper Body, 1432300, 00:23:52:09
24, Misc, 1432566, 00:23:52:17
7, C's Head, 1432699, 00:23:52:21
14, Painting, 1433066, 00:23:53:02
7, C's Head, 1433433, 00:23:53:13
24, Misc, 1433566, 00:23:53:17
10, D's Head, 1433733, 00:23:53:22
22, Blank Space - Upper Right, 1434633, 00:23:54:19
12, D's Lower Body, 1434833, 00:23:54:25
10, D's Head, 1434933, 00:23:54:28
12, D's Lower Body, 1435233, 00:23:55:07
11, D's Upper Body, 1435433, 00:23:55:13
12, D's Lower Body, 1436066, 00:23:56:02
11, D's Upper Body, 1436666, 00:23:56:20
10, D's Head, 1437066, 00:23:57:02
11, D's Upper Body, 1438133, 00:23:58:04
.
.
.
The above sample is from the data collected from a subject
performing the memorization task with the Doc image. The first
number refers to a segment of the image, in this case 19 refers
to the empty chair in the scene. The second number is the
internal time code from the Hi-8mm VCR, and the last number is
the time code expressed in hours, minutes, seconds, and frame
number (30 frames per second).
This data was then read into an Excel(r) spreadsheet. The
spreadsheet calculated the duration of each entry, and summed the
durations to find a total fixation time for each of the 22 - 25
image segments. The following table (Table 2), and was created
from the same data that the above example was drawn from;
Table 2 Excel spreadsheet - summation of fixations times.

Three of these tables were created for each subject (except
Subject 6, where one trial was unusable), as each subject saw
three image-task pairs. Nine tables (Tables 3 - 11) are averages
for the nine image-task pairs (the red arrows highlight those
entries with fixation percentages greater or equal to 10%);









At this point, the variance in the distribution of fixation
times varies with the task at hand. The most obvious variance
comes with the "Age" task. Here, as expected, most
subjects spent a great deal of time fixating on heads and faces
(collectively segmented into the events listed in the tables
above as A's, B's, C's and D's heads). For example, in Table 11,
on average, the subjects spent most of their viewing time
fixating on five events, or images segments. Four out of those
five were the heads of the people in the image. "A's
head" refers to the head of the leftmost person, and the
rest of the B - C labeling follows this left to right fashion.
Before attributing this variation to task dependency, the
statistical significance of the factors was evaluated, using
ANOVA. A general linear model was used, and F*-tests were
performed to determine the significance of the effect the task
had on the response. Tests were also performed for image and
image-task interaction effects. These tests require a single
measure for the "response." In this study, three
measures were used, and three sets of ANOVA tests performed. In
the first ANOVA test, the measure was the percent of the total
viewing time that was dedicated to fixating on heads. The second
and third tests used person fixations (head and upper and lower
body) and inanimate object fixations (painting, desk, ropes, car,
etc.) as measures.
The ANOVA testing was done in MiniTab(r), a statistical
software package. The data was formed in Excel(r), and copied
into the MiniTab(r) spreadsheet. The program performs the
F*-tests, and an effect having a p-value of 0.05 cannot be
ignored. The textual output was captured and is presented below:
Where the percent of fixations on heads is the measure;
General Linear Model
Factor Levels Values
image 3 1 2 3
task 3 1 2 3
Analysis of Variance for
Y
Source DF Seq SS Adj SS Adj MS F P
image 2 0.08555 0.03200 0.01600 2.30 0.139
task 2 0.77753 0.67211 0.33605 48.34 0.000
image*task 4 0.03643 0.03643 0.00911 1.31 0.317
Error 13 0.09037 0.09037 0.00695
Total 21 0.98989
Where the percent of fixations on persons is the measure;
General Linear Model
Factor Levels Values
image 3 1 2 3
task 3 1 2 3
Analysis of Variance for
Y
Source DF Seq SS Adj SS Adj MS F P
image 2 0.039436 0.039742 0.019871 4.16 0.040
task 2 0.632528 0.532564 0.266282 55.70 0.000
image*task 4 0.063318 0.063318 0.015829 3.31 0.045
Error 13 0.062151 0.062151 0.004781
Total 21 0.797432
Where the percent of fixations on inanimate objects is the measure;
General Linear Model
Factor Levels Values
image 3 1 2 3
task 3 1 2 3
Analysis of Variance for
Y
Source DF Seq SS Adj SS Adj MS F P
image 2 0.112498 0.129396 0.064698 10.79 0.002
task 2 0.367235 0.305425 0.152713 25.48 0.000
image*task 4 0.051320 0.051320 0.012830 2.14 0.134
Error 13 0.077925 0.077925 0.005994
Total 21 0.608977
Table 12. ANOVA summary.
| Task F* | Image F* | Task*Image F* | Task p-value | Image p-value | Task*Image p-value | |
| Heads | 48.34 | 2.30 | 1.31 | 0.000 | 0.139 | 0.317 |
| Persons | 55.70 | 4.16 | 3.31 | 0.000 | 0.040 | 0.045 |
| Objects | 25.48 | 10.79 | 2.14 | 0.000 | 0.002 | 0.134 |
An eye tracker was set up on a baseball cap, as described in
the Methods section. In this section, photographs and line
drawings of the eye tracker will be provided. Battery life was a
concern. Two lithium-ion rechargeable camcorder batteries were
used in series to power the controller box, which powered the
cameras and IR illuminator as well. The camcorder is powered by
its own battery. The pair of batteries powered the controller,
cameras, and illuminator for just over three hours on one trial.
Further trials were not performed as the variability is not
likely to be more than an hour, and the requirement was a battery
life of two hours.
Components:
Figure 7. The components of the portable,
baseball cap, eye tracker.
Figure 8. Frontal view of the optics mounted on
the baseball cap.
Figure 9. Side view of the optics mounted on
the baseball cap.
Figure 10. Top view of the optics mounted on
the baseball cap.
Figure 11. Bottom view of the optics mounted on
the baseball cap.
Figure 12. Sketch showing orientation of eye
camera in relation to the eye.
Table 13. Pin assignments for control box cable.
| Pin | Assignment |
| 7 | IR power +5.5v. |
| 11 | Scene camera signal. |
| 12 | Eye camera ground. |
| 13 | Eye camera signal. |
| 19 | IR power ground. |
| 23 | Scene camera ground. |
| 24 | Scene camera power. |
| 25 | Eye camera power. |
The data collected in this study regarding the task dependency of eye fixations agrees with the results found by Alfred Yarbus (1967, 2), . As seen in Figure 1, Yarbus recorded the path of eye movements. With his seven tasks, he demonstrated that the paths he traced certainly looked different for each task. Figure 1 is for only one subject, and is the only diagram of this sort in his book. The figure shows a qualitative difference, but does not lead to quantitative data. In this study, the segmentation of the images was qualitative, but after that all data was quantitative. The amount of time each subject spent looking at the image segments was tallied, and converted into percents. Then an ANOVA analysis was performed, using groups of image segments for the response variable. For all measures (percent of time fixating on, head, persons, objects), task had a significant effect (p >0.0005). The p-value is the probability of being wrong when you assume the factor as no effect. Unfortunately, it is not as easy to dismiss the effect that the specific image had on the subjects' responses. Ideally, one would have different images, but the subjects would respond to the images in a similar manner. When the percent of time fixating on heads was used as the measure of the response, there was not a significant relationship between image and eye fixations. There may be a small relationship when fixations on persons was used as a measure. Image had a significant effect when object fixations were used as a measure. This last finding isn't too surprising, considering the surroundings and objects in the images are what changed the most between the images. For example, in the Doc image, there are many clearly distinguishable objects in the image, but in the Ropes image, the objects are finer, and more difficult to pick out from their surroundings. A better choice of images would probably eliminate this partial image dependency. This is not to say that it is not expected that eye movements will vary with image, because they will. But when the images were somewhat similar, as they were in this case, the image dependency should be minimal.
While the nature of the data, and the means of
collecting it differs from Yarbus' study 30-plus years ago, the
interpretation remains the same. An investigation in visual
perception, the results from Yarbus' work and this study show
that the method of gathering visual information is not solely
determined by the physiology of the eye. In the eye, near the
center of the retina, there is a high concentration of
photoreceptors in an area called the fovea. This fact is what
makes eye movements necessary -- being able to move the eye such
that an image of the area of interest is formed on the foveal
region. However, visual search patterns are governed on
physiological and cognitive levels, as demonstrated by
the task orientation, or task dependency shown in Yarbus' and
this work. The subjects' ability to vary eye movement patterns is
a cognitive function, presumably to make the gathering of visual
information more efficient. An example of an inefficient
gathering method would be to visually sample the image so as to
foveate every part of the image (perhaps scanning in raster
lines, as a TV image is drawn), and after sampling throw away
irrelevant information. None of the subjects appear to have done
anything of the sort, instead there was a concentration of
fixations of parts of the image that were most likely to complete
the task. In Yarbus' case, to complete the task of "Memorize
the clothes worn by the people in the image", the subject
(whose eye movements are displayed in Figure 1) spent nearly all
of his/her time looking at the people. In fact, one can tell
where the people in the painting were just from the eye movement
pattern. This is an efficient way to complete the task, since
relevant information is only located in the portions of the image
occupied by people. In this study, when the subjects were asked
to "Give the ages of the people in the image", nearly
all the viewing time was spent fixating on the people's faces.
This is also an efficient visual search, as most cues for age are
found in one's face. Such results demonstrate an important
cognitive role in visual searches.
While the results of the task dependency part of this project were satisfactory, the results of improving the portability of the eye tracker are incomplete. The baseball cap version of the eyetracker, as depicted in the Results section, was indeed lighter, smaller, and more comfortable than the existing eye tracker. This system was also capable of tracking an eye, and the batteries provided ample running time. And, except for the head mounted optics, the hardware fits into a small back or hip pack. These are all goals for this project. Unfortunately, there was no time left at the end of this project to do pilot and comparison testing with this eyetracker. While this looks like a feasible eye tracking system, and worth continuing, the work is not yet finished for this part of the project - more work yet remains. Specifically, two issues I wanted to address are;
Pilot testing should also be performed. This testing should include a comparison to the existing tracker in terms of comfort over long durations, ease of use, and ability to maintain an eye image (for proper tracking). Tests should also be performed under various lighting conditions, from very dark to very bright. One problem I encountered while using the existing eye tracker was that when the light level were low, the pupil would open up enough so that the retinal image was as bright as the corneal reflection, and the corneal reflection was obscured.