Visual Representations... Chapter 2 (J. B. Pelz)

Visual Representations in a Natural Visuo-motor Task

Chap. 2: The Block Copying Paradigm


The task selected was one in which a subject is required to manually duplicate a multi-color pattern of blocks. The block-copying task is primarily sensorimotor and is naturally broken down into a series of sub-tasks, each tightly coupled to externally observable gaze and hand movements. As a result, the subject's immediate cognitive state can be meaningfully inferred by examining those movements. The task requires that the subject interact with the environment, and requires fine manual manipulation.

The task, reminiscent of the block manipulation simulations of Agre and Chapman [1987] and Whitehead and Ballard [1990], required a subject to manually duplicate a pattern made of colored Duploreg. blocks. The subject was seated in front of a 110 cm x 75 cm board set at 10deg. from vertical that was divided into three sections: The "model" area, the "resource" area, and the "workspace" area (see Figure 2.1). The model area contained the eight-block pattern to be duplicated, the resource  area contained twelve blocks from which blocks could be selected to construct the copy, and the subject was instructed to build the copy in the workspace. The selected task is a generalization of the task used by Ballard, Hayhoe, Li, & Whitehead [1992], in which subjects manipulated colored patterns on a Macintosh CRT using a mouse. The subject's head was fixed to allow eye position to be monitored using an SRI tracker. The task used in the present experiments allows the examination of natural, unconstrained movements of the eye, head, and hand. The block-copying paradigm is a natural, multi-step task requiring coordination of the eyes, head, and hand, yet the task is sufficiently explicit so that cognitive operations can be inferred from the subject's behavior.

Figure 2.1 Layout of experimental working plane containing the model, resource, and workspace.

In order to complete the block-copying task, the subject must:

1) Perform a series of hand movements, moving blocks from the resource to the workspace

2) Use vision to gather information about the model configuration and blocks available in the resource area, and to guide the hand movements described in 1) above

3) Use working memory to retain information about the model pattern and to monitor progress in the task

The above list is ordered. The task cannot be completed without performing item 1), the series of hand movements. The restriction that a single hand be used to move the blocks forces the blocks to be copied serially. The order in which the blocks are selected and which blocks are moved from the resource area are left to the subject. (There are always more blocks in the resource area than are needed to complete the copy.) Items 2) and 3), use of vision and working memory, are less rigidly constrained; there are many strategies a subject could adopt. While the subject is free to adopt any strategy, gathering information about the model configuration and guiding hand movements to pick up blocks in the resource area and place them in the workspace are almost always accomplished by foveating the point of attention, so the eye movements can be used to infer the underlying cognitive operations. Therefore the block-copying task requires eye movements tied to relatively simple cognitive processes. Cognitive events are intimately tied to objects, their features, and/or movements made by the subject in completing the task. If, after placing a block in the workspace, the subject makes a fixation in the model area while moving the hand towards the resource area, we can make two inferences: First, that the subject is selecting the next block to copy and determining its color so that a block in the resource

area can be targeted and second, that the subject chose to make these decisions based on the current model fixations and not solely on information from previous model fixations retained in working memory.


Monitoring eye position

Monocular (left) eye position was monitored with an Applied Science Laboratories ('ASL') Model E4000SU eyetracker and a 386 lab computer. The ASL is a headband mounted, video-based, IR reflection eyetracker. Figure 2.2 shows the eyetracker in use.

Figure 2.2 Author wearing headband-mounted eyetracker.

A collimated infrared emitting diode (IRED) illuminates the eye, resulting in a 'bright-pupil' retroreflection from the subject's retina, and a first surface reflection at the cornea (the first Purkinje image). A monochrome CCD camera (without the standard IR rejection filter) is aligned coaxially with the illuminator to image the eye. Figure 2.3 a) shows the bright-pupil and first Purkinje images as captured by the eye-camera. The eye-camera image is digitized and thresholded at two levels in real-time by the ASL control unit. The two threshold levels are adjusted manually so that pixels within the bright pupil are above threshold at one level, while only those pixels within the corneal reflection are above threshold at the second level. The centroid of the pupil and first Purkinje image are then computed by the lab computer. The ASL control unit overlays crosshairs indicating the pupil and first Purkinje centroids on the image from the eye camera. Figure 2.3 b) shows the resulting image as displayed on the 'eye monitor.'

Figure 2.3 L) Raw video frame from ASL's 'eye camera' showing bright pupil and first Purkinje image. R) Crosshairs mark centroids of pupil and first Purkinje image computed by the ASL.

Tracking both pupil and first Purkinje images makes the system less sensitive to movement of the tracker with respect to the head because translation of the eye's image (caused by headband movement) causes both pupil and first Purkinje images to move together, while rotation causes differential motion of the two centroids. To reduce eye movement artifacts due to headband movement, eye-in-head position is calculated based on the relative location of the two centroids whenever both are present in the eye-camera image. If the system loses the first Purkinje image, eye position is calculated based on the pupil image alone until the first Purkinje image is re-acquired.

Because the system is video-based, eye position signals are limited to 60 Hz when a single interlace field is used for each eye position computation, or 30 Hz when a full frame (odd and even interlace fields) is used. The latency of the eye position signal is between two and three video fields (33 to 50 msec). The accuracy of the ASL's eye-in-head signal is approximately 1 degree over a central 40 degree field.

Gaze position (integrated eye-in-head and head-position and orientation) is calculated by the ASL using the eye-in-head signal described above and a head position/orientation signal from a magnetic field head-tracking system (see description of Ascension Technology 6DFOB below). The ASL reports gaze position as the X-Y intersection of the line-of-sight with the working surface, identified as the 'calibration' plane. The position and orientation of the calibration plane are defined by entering the three-dimensional coordinates of three points on the plane into the ASL. The coordinates are entered indirectly by measuring the distance and angle of each point with respect to a fixed point. The magnetic field transmitter is used as the origin; the distance to each of the three points (A,B, & C in Figure 2.4) is measured and entered

Figure 2.4 The working plane is defined by locating three points (A, B, C) on the plane with respect to the center of the magnetic tracker's transmitter.

manually. The unit-vector from the transmitter to each point is entered using a gimbal attached to the transmitter that holds a HeNe LASER and a magnetic field receiver. The gimbal is directed to each point (using the LASER to ensure an accurate angle), and the orientation of the magnetic field receiver is read by the ASL. Eye-in-head, head orientation and position, and gaze intercept are available on an RS-232C serial interface from the ASL. The digital data stream was collected on an Apple Macintosh 840AV computer for storage and analysis. In addition to this digital data stream, the ASL provides a video record of eye position. The headband holds a miniature "scene-camera" to the left of the subject's head, aimed at the scene (see Figure 2.2). The ASL creates a crosshair overlay indicating eye-in-head position that is merged with the video from the scene-camera, providing a video record of the scene from the subject's perspective on the scene-monitor, along with a crosshair indicating the intersection of the subject's gaze with the working plane (see Figure 2.5). Because the scene-camera moves with the head, the eye-in-head signal indicates the gaze point with respect to the world. Head movements appear on the record as full field image motion. The scene-camera can be fitted with a range of lenses. Figure 2.5 was made with the 3.5 mm wide-angle lens, and shows the barrel distortion typical of such lenses.

Because the scene camera is not coaxial with the line of sight, calibration of the video signal is strictly correct for only a single distance. All gaze points are in the plane of the working board, and subjects typically do not change their distance from the board substantially, so the parallax error is not significant in this task, though it can be significant in tasks not constrained to a near-vertical plane. The parallax error can be eliminated by repositioning the scene-camera below the visor so that it is collinear with the eye-camera (see Figure 2.6). While this orientation eliminates parallax error, it

Figure 2.5 A video frame from the ASL's 'scene-monitor' shows gaze position in the scene with a white crosshair overlay on the image from the headband mounted scene-camera

severely restricts the field of view of the scene-camera. In addition, image contrast and chroma are reduced due to the poor reflectance below 800 nm and flare from the IRED illuminator.

The eye-in-space signal calculated by the ASL by integrating the eye-in-head and head position/orientation signals is not affected by parallax -- the scene camera is used only during calibration when the distance to the scene is fixed. After initial calibration, the gaze intersection is calculated by projecting the eye-in-head position onto a 'virtual calibration plane' at the same distance as the calibration plane during calibration. The vector defined by the eye center and the intersection with the 'virtual plane' is then rotated based on the head position/orientation signal, and projected onto the working plane.

The ASL was calibrated for each subject before each trial session. The subject was fitted with a biteboard and seated a comfortable distance from the work surface, typically 60 - 75 cm. Calibrating the ASL requires three steps -- 1) entering the position of the three reference points on the calibration plane (see Figure 2.4), 2) locating the calibration points (9 or 17 points; see Figure 2.7), and 3) recording the subject's pupil and first Purkinje centroids as each point in the calibration target is fixated.

The first step was described above (see page 22). In the second step, the calibration points are located on the work surface by marking them on the scene monitor and entering their absolute, three-dimensional coordinates in the ASL "environment" file. The three points used to locate the calibration plane (points A, B, & C in Figure 2.4)

Figure 2.6 Alternate arrangement of scene-camera eliminates parallax error by aligning axes of eye- and scene-cameras.

are the bottom-right, bottom-left, and top-left points in the calibration array, respectively. In the final step, the subject is steadied by a biteboard and instructed to fixate each calibration target in turn, so that raw pupil and first Purkinje images can be grabbed at each point. The calibration function is determined by a proprietary algorithm based on the known target positions and the raw pupil and corneal reflection positions. The calibration can be performed with 9 or 17 points, as shown in Figure 2.7. The 17-point calibration target increases accuracy by allowing the target points to cover a larger area while reducing the area over which eye-position data must be interpolated. The 17-point target is especially critical when the scene-camera is fitted with a wide-angle lens that suffers from barrel distortion.

Monitoring head and hand position

The ASL relies on a magnetic head-tracker to monitor the position and orientation of the head. An Ascension Technology magnetic field tracker (Model 6DFOB, "Flock") was used to monitor the position and orientation of the head and the hand. The 6DFOB system can 'daisy-chain' multiple receivers with a single transmitter. The transmitter unit was mounted above and in front of the subject's head. The transmitter contains three orthogonal coils that are energized in turn. The receiver unit contains three orthogonal 'antennae' coils which detect the transmitters' signals. Position and orientation of the receiver are determined from the absolute and relative strengths of the transmitter/receiver coil pairs. The position of the sensor is reported as the (x, y, z) position with respect to the transmitter, and orientation as azimuth, elevation, and roll angles.

Figure 2.7 L) Nine and R) 17 point calibration targets for the ASL eyetracker.

To allow measurements over a range of transmitter to receiver distances, the Ascension 6DFOB adjusts the transmitter's field strength based on the distance of the receiver from the transmitter. The maximum distance at which a clear signal can be detected is approximately three feet. The quality of the signal received signal varies with the strength of the transmitter, but too strong a signal saturates the receiver. The maximum field strength is 1 Earth field. Position and orientation values are encoded as 16-bit integers. Distances (x, y, z) are scaled from -36" to 36", yielding a precision of 0.001" (72"/216), or 0.003 cm. Orientation (azimuth, elevation, and roll) are scaled from -180deg. to 180deg., with a precision of 0.005deg. or 1/3 min arc.

The Ascension system has a range of temporal filter options that can be selected with software commands. There are two classes of filters: 'AC' and 'DC.' The AC filters are band-block filters designed to filter out signals caused by environmental sources operating at around 60 Hz; i.e., 120 VAC line supply, video monitors, and lighting equipment. There are three settings for the AC filters: i) 'AC filters off', ii) 'AC Narrow', and iii) 'AC Wide.' The two AC filter options (narrow and wide) differ in the width of the band of frequencies blocked by the filters. Removing frequency components in this range is not detrimental in itself because there is no appreciable component of head or hand movements beyond about 20 Hz [Rosenbaum 1991], but there is no way to implement such a filter in real-time without introducing a delay in the reported position and orientation values. Unlike the 'AC' filters, the 'DC' filter does not filter out a fixed band of frequencies. Instead, it is an adaptive filter that monitors the sensor's reported values over a period, and adjusts its time constant based on recent position/angle history. If the sensor has shown little motion for a given period, the time constant is increased in an effort to reduce steady-state, or 'DC' position/orientation reports. When sensor movement is detected in this mode, it is at first suppressed by the long time constant on the presumption that it represents noise rather than real movement of the sensor. If the movement continues it is assumed to represent real motion of the sensor, and the time constant is reduced. The user can set upper and lower bounds on the time constant to limit the degree to which the adaptive filter can adjust to varying inputs, and the speed at which the filter adapts to sensor motion. The default filter configuration ('AC wide' and 'DC' filters on) produces very low noise output at the cost of increased temporal lag between sensor movement and position reporting. Because the DC filters actively vary their time constants the actual lag introduced by the 'DC' filter is not constant.

The "Flock" sensors used for head and hand tracking were characterized to determine the accuracy and noise in the measurement system with and without the default filters. The three-dimensional position signal accuracy was dependent on the separation between transmitter and receiver. Absolute error was below 0.2 cm when the receiver was within 40 cm of the transmitter, but increased dramatically beyond that distance. Errors reached approximately 1 cm at a distance of 65 cm, and 5 cm at a distance of 85 cm. Orientation values (computed based on the relative strength of the three channels) are less sensitive to distance. Errors were below 5 minutes of arc, and unaffected by distance out to 65 cm. See the Appendix for details of the Ascension 6DFOB calibration. Two Ascension 6DFOB units were used in the experiments. One receiver was attached to the eyetracker's headband to monitor head movements; the second receiver was taped to the subject's thumb to monitor hand movements. The hand position data was sent via a serial connection to the lab computer. The head position data was reported directly to the ASL's PC where it was integrated with the eye position signal to calculate the integrated gaze position. The raw head position signal was also sent to the Macintosh lab computer where it was logged along with gaze, eye, and hand movement signals. In some of the later experiments, the second Flock unit (used to track hand movements) was replaced with a Polhemus "Fastrak" model. This allowed a separate transmitter (dedicated to monitoring hand position) to be placed behind the board so that the distance between transmitter and receiver was minimized. Because the noise levels rise as the distance between transmitter and receiver increases, there is an advantage to keeping the distance small.

The headband mounted scene-camera was equipped with a 3.5 mm focal length wide-angle lens providing a wide field (approx. 110deg. horizontal) that included the board and the subject's hand. A Sony EVO-9650A Hi-8 video deck was used to record the video from the scene camera with gaze position overlay. A Hi-8 format timecode (30 frames per second) was recorded with each frame, allowing timing measurements to be made on playback. The video deck was connected to the lab computer with an RS-232C serial interface so that it could be controlled automatically by the computer. In addition to starting and ending recording sessions, the interface allowed the program to poll the deck to determine the timecode being written to tape. This timecode was written to the header of the digital data file for each trial so that the data stream and video record could be correlated.


The model area contained the pattern to be duplicated (a pattern of eight blocks of four colors), the resource area contained the blocks to be used to construct the copy, and the copy was constructed in the workspace. The blocks were square, approximately 2.5 cm on a side. The subject was instructed to duplicate the pattern as quickly as possible without making errors using one hand (of the S's choice), but was otherwise free to choose any strategy and sequence of movements to accomplish the task. Ten subjects performed the basic block-copying task, completing from 60 to 210 block moves each (mean 160). Subjects were seated approximately 60 - 75 cm from the board (each subject selected the distance to be a comfortable reaching distance from the board). At the typical distance, the board subtended approximately 80 x 60 degrees of visual angle. Each block subtended approximately 2.25 degrees.

Subjects were fitted with the ASL eyetracker, and a calibration was performed. Early experiments were done with a 9 point calibration target; later the 17 point system was implemented. The outermost calibration points formed a rectangle approximately 25 x 20 degrees (the exact value depended on viewing distance, which was set by the subject). The eyetracker was calibrated at the beginning of each block of trials, and zeroed on a central fixation point just before each trial began.

Scoring of block-move strategies

The sequential nature of the block movements (enforced by requiring that blocks be moved with one hand) makes it convenient to break up the sequence of movements in each 8-block trial into smaller subtasks. Taking the movement of each block as a sub-task, it is useful to consider the sequence of eye, head, and hand movements taking place for each block move. We consider each 'block move' sub-task to begin when the previous block has been placed in the workspace and the eyes move away from that area. The sub-task is completed when the present block is "dropped" in the workspace. Each block move is made up of a set of lower-level subtasks; the subject must find (or have remembered) the color and position of the block to be moved, move the hand to "pick up" a block of the same color in the resource area, then return to the workspace to add the block to the correct position in the duplicate being constructed. It is instructive to examine the pattern of fixations used in copying each block. Those fixations serve several purposes: gathering information about the model pattern, the location of particular blocks in the resource area, monitoring progress in the workspace, and guiding hand movements picking up and dropping the blocks.

The gaze- and hand-movement 'primitives' making up each block move were used to describe the subjects' strategies. Gaze changes are labeled by the areas in which fixations occur. The primitives were labeled:

M - fixation in the [m]odel area

P - block pickup (with fixation in the resource area)

D - block [d]rop (with fixation in the workspace area)

"Pickup" and "Drop" events are used to label hand movements, and it is understood that Pickup ('P') and Drop ('D') events are accompanied by fixations in the resource and workspace areas, respectively. While the gaze often left the resource area before a block was picked up, there was almost always a fixation associated with the pickup and drop events.

By reviewing the videotaped records in slow motion, the sequence of fixations and hand movements used to copy each block was used to label the strategy used for that block. For example, Figure 2.8 shows a typical 'block move' sequence schematically. The sequence begins with the first fixation after the previous block has been placed in

Figure 2.8 Sequence of gaze changes making up an 'MPMD' block move.

the workspace. In this case the gaze moves first to the [M]odel area ('M-...'), then to the workspace, where the hand [P]icks up a block ('M-P-...'), then returns to the [M]odel area ('M-P-M-...'), and finally to the workspace, where the block is '[D]ropped' in position ('M-P-M-D'). The block move is thus labeled as an 'MPMD' sequence.


A striking aspect of subjects' performance was the highly stereotyped behavior in completing the task, marked by frequent fixations in the model area and a relatively small number of strategies. Approximately 90% of the block moves were classified into the four strategies shown in Table 2.1. Figure 2.9 a), b), and c) show the sequence of fixations for block move sequences labeled MPD, PMD, and PD, respectively.

While these four strategies were most common in the basic experiments, it was necessary to define a fifth category because in some experiments subjects made even more frequent model fixations. Block moves were categorized as ">MPMD" sequences if the subject looked more than twice into the model area during a single block move. A small number of block moves (3.3% across 10 subjects) still did not fit into any of the above categories and were labeled "other." These were typically block moves in which the subject had difficulty picking up or placing a block, or in which the subject made an error. Table 2.2 shows the six categories used to score the block move strategies.

Table 2.1 Four strategies into which ninety percent of the subjects' block moves were categorized

       Model-Pickup-Model-Drop          "MPMD"          
          Model-Pickup-Drop             "MPD"           
          Pickup-Model-Drop             "PMD"           
             Pickup-Drop                "PD"            

Figure 2.9 Sequences of gaze changes making up a) an 'MPD' block move, b) a 'PMD' block move, and c) a 'PD' block move.

Table 2.2 Ninety-five percent of the subjects' block moves were categorized into the first five strategies shown.

      "> Model-Pickup-Model-Drop"        "> MPMD"           
        Model-Pickup-Model-Drop          "MPMD"             
           Model-Pickup-Drop             "MPD"              
           Pickup-Model-Drop             "PMD"              
              Pickup-Drop                "PD"               
          any other sequence             "other"            

Basic Features of Task Performance

Relative Frequency of Strategies

A striking feature of subjects' performance in the task was the consistent reliance on frequent reference to the model. Subjects often averaged over two saccades into the model area in the course of moving each of the eight blocks. This frequent reference to the model is surprising because humans are capable of holding several items of information in short term memory; certainly enough information to remember the position and color of a few blocks. Except for one subject all subjects averaged more than one fixation in the model area for each block copied. The average number of model references per block copied was 1.51; subject jw averaged only 0.93 model references per block. Figure 2.10 shows the mean relative frequency of each of the six strategies listed in Table 2.2, averaged across 14 subjects. Figure 2.11 shows the relative frequency of strategy use for each subject. In nine of the fourteen subjects, over 50% of block moves were accomplished with two or more model fixations. In twelve of fourteen, block moves with two or more model fixations were more common than any other strategy.

Figure 2.10 Mean relative frequency of six strategies for fourteen subjects.

Figure 2.11 Individual relative frequency histograms of strategies used in block-copying task by fourteen subjects.

Change in Strategy Use Over the Eight-Block Trial

While subjects' performance in the task suggests that very little information about the model pattern is remembered from one block to the next, examination of the strategies used as a function of the serial block number (i.e., over the course of each trial) does suggest that some information about the model is acquired during the trial. Figure 2.12 shows that the low-memory MPMD strategy is most common for the first block, while the PD strategy increases over the trial. So, while subjects tend to rely on frequent eye movements rather than working memory, it is evident that they do use information from previous model references to some extent -- some internal representation of the model pattern is apparently built up over the many fixations used in the construction of the duplicate pattern.

Number of References to the Model per Block Copied

A useful metric of the degree to which working memory is used in task performance is the average number of model references per block copied. Figure 2.13 shows the change in the frequency of fixations in the model area as a function of serial block order. There is a significant decrease in the number of model references after the first block (P<0.005), and on the last block of the eight-block pattern (P<0.05). Blocks 2 through 7 are relatively stable at approximately 1.4 looks per block.

Trial Duration

Subjects were instructed to complete the task as quickly as possible without making errors. In addition to monitoring strategy use by each subject, the trial duration was recorded. Figure 2.14 shows the average trial duration for each subject, and the

Figure 2.12 Variation in strategy use over the eight-block trial.

Figure 2.13 Average number of model references per block as a function of serial block order

Fixation Durations in Model Area

Because our understanding of the common occurrence of the MPMD strategy suggests that different information is extracted in the first and second model fixations, individual model fixation durations were examined to learn whether the fixation durations might indicate the complexity of the information gathered. For example, subjects might require less time to bind the color of a target block than to bind its location in the model during an MPMD sequence, or the model reference in an MPD sequence might be longer than that in a PMD sequence. Four subjects' records (eb, sc, jw, and mh) were analyzed to extract the time spent in the model area during the first model reference in an MPMD sequence (labeled MPMD1), and the second model reference (labeled MPMD2). The fixations for the model reference in MPD and PMD sequences were also measured to see whether there were significant differences between model fixation durations in those strategies. The results were idiosyncratic, as seen in Table 2.3 a). While a significant difference between the first and second model fixations (MPMD1 & MPMD2) was found in three of the four subjects, two (eb and sc) spent less time in the first model reference, and the third (jw) spent more time in the first model reference. The fourth subject (mh) spent less time during the first model reference, like eb & sc, but the difference was not significant. The inconsistency between subject jw and the other subjects can be understood, however, by noting that subject jw performed the task very differently than other subjects. While on average the MPMD strategy was used in approximately 40% of block moves, that double-look strategy accounted for only half as many of jw's block moves. The average subject moved only 12% of the blocks using the PD strategy; jw was three times more likely to copy a block without a model reference. To determine whether the differences in model reference durations were due to different strategy use among the subjects, the number of fixations made during each model reference were analyzed. While eb, sc, & mh each averaged between 1.3 and 1.8 fixations during the first model reference (MPMD1), jw averaged 2.6 fixations. There was no such difference in the second reference (MPMD2); all four subjects averaged between 1.5 and 1.8 fixations. Table 2.3 b) shows the average duration of MPMD1 and MPMD2 model references that contained only one fixation in the model area. This presumably eliminates from analysis any block moves in which jw made several fixations in preparation for a future PD block move. This reduced the mean durations for all subjects, but the decrease was most dramatic for jw. All four subjects spent less time in the first model reference than in the second, though the differences were only significant for eb and sc. The comparison between MPD and PMD model references was less clear; all four subjects spent more time in the model in the PMD than during the MPD sequence (opposite our expectations), though the difference only reached significance for subject eb.

Figure 2.14 Average trial duration (time to duplicate the eight block pattern) for fourteen subjects.

Figure 2.15 Average trial duration as a function of the number of model references per block.

Table 2.3 Mean time [msec(s.e.m.)] in model area for the first and second MPMD model references, MPD, and PMD sequences. a) Average of all model references. b) Average of MPMD1 & MPMD2 model references with only one fixation.

a) Average of all model references

    Model             eb        sc              jw              mh              
    MPMD1       385(23)         362(16)         599(37)         335(24)         
    MPMD2       423(15)         424(12)         472(19)         355(19)         
     MPD         346(45)        416(25)         469(31)         325(17)         
     PMD         447(15)        472(19)         470(22)         356(39)         

b) Average of MPMD1 & MPMD2 model references with a single fixation.

    Model             eb        sc              jw              mh              
    MPMD1       292(14)         326(34)         389(29)         282(23)         
    MPMD2       418(24)         390(16)         405(24)         315(18)         


These experiments support and generalize the result of the earlier block-copying experiments performed on a CRT with a mouse [Ballard, Hayhoe, Li, & Whitehead 1992]. The fact that subjects are now free to make unconstrained movements while performing the task avoids concerns about the un-natural constraints necessary to use other kinds of eyetracking devices, and about use of a computer mouse instead of natural hand movements. It is now possible to investigate complex, natural behaviors.

In performing the block-copying task, subjects make very frequent references to the model area; the modal strategy includes two looks to the model area for every block copied. The average number of model references was greater than 1.5, and all but one subject averaged more than one look to the model for every block copied. Subjects chose strategy even though they are capable of memorizing multiple-block sub-patterns; in a preliminary experiment in which subjects manipulated colored block patterns on a Macintosh CRT using a mouse, subjects were allowed to inspect the model pattern for a variable length of time, then constructed the duplicate from memory after the model was removed [Ballard, Hayhoe, Li, & Whitehead 1992, Ballard, Hayhoe, & Pelz 1995]. Subjects could copy patterns of up to four blocks with few errors. So while subjects are capable of remembering multi-block sub-patterns, they choose not to operate at the maximum capacity of working memory when free to select their own strategy. Instead, they seek to minimize reliance on working memory by acquiring information incrementally during the task.

Subjects used frequent eye movements to serialize the complex task into a series of simpler subtasks, each placing only a small load on working memory. Rather than work from a rich internal representation built up over multiple fixations, subjects chose instead to work from moment-by-moment representations, gathered 'just-in-time' from the most recent fixation. In the extreme, individual features of objects appear to be gathered separately. The most plausible interpretation of the role of the fixations is that color is loaded in the first fixation and position in the second. The color is loaded in the first model fixation, and used to pick up a block of the appropriate color in the resource area, then the position is loaded in the second model fixation and used to place the block in the workspace. The fact that this 'extreme' case was the modal response observed in the experiment provides an important insight into the manner of representation employed by humans under natural conditions. It is important to distinguish this experiment from the class of experiments designed to find the limits of humans' visual memory. We know from those experiments that we are capable of maintaining several objects in memory simultaneously [Miller 1956, Baddeley 1986]; this experiment shows that subjects choose to operate far from that limit when performing a task that allows visual reference to the relevant information.

Performance in the task was not totally 'memoryless' however. Strategies did shift over the course of a trial. Over 90% of the first blocks were copied using >=MPMD block moves (i.e., >MPMD + MPMD) strategies, but that number fell to only 30% for the eighth block. The average number of model references per block copied fell from 2.0 for the first block to 1.2 for the last block, so subjects did not approach each block move from 'scratch' using only a single variable to store color and position information. So while it is clear that subjects seek to minimize memory use with frequent fixations, some information is retained across fixations. Subjects' performance in the task also suggests that eye movements should be considered as more than unfortunate side-effects of the limited foveal region. Eye movements serve a cognitive role in performing the complex task, serializing the task into simple components, each with minimal memory demands. Perceptual and motor events are ordered and executed by marking elements in the scene with fixations.

The reluctance to use working memory to capacity can be explained if such memory is expensive to use with respect to the cost of the serializing strategy. The tradeoff between eye movements and working memory can be visualized with the help of schematic timelines illustrating working memory load over time. Model fixations are used to bind values from the scene into variables in working memory. Figure 2.16 is a schematic illustration of a strategy that relies heavily on working memory. At the

beginning of the task, the subject inspects the model and loads the color and position of the three blocks forming the model's top row into working memory. We can think of this action as "binding the values from the fixation point into working memory." Several variables are needed in the strategy shown in Figure 2.16, and they must retain their contents for varying durations. In this example the position of block #3 must be held until the third block is finally placed in the workspace. Using working memory in this way to 'load up' information from the model minimizes the number of eye movements to the model area. After completing the three blocks the subject could fixate the second row of the model and bind that set of variables into working memory. This series of actions would be repeated until the entire model was duplicated. This strategy would appear as an initial block with one or more fixations (e.g., >MPMD, MPMD, MPD, or PMD) followed by two PD block moves, a sequence that was not observed in these experiments.

Figure 2.16 Working memory load as a function of time for a subject who executes three blocks 'from memory.'

Figure 2.17 illustrates a strategy that does not rely on working memory to maintain information about blocks other than the immediate block being copied. In this example, each block is fixated just before it is copied, requiring only that blocks' color and position to be bound to variables in working memory. Both the number of variables and the duration for which they must be held are reduced by trading off more frequent model fixations for a lower working memory load. In this case, the eyes must be moved to the model area for each block copied, but only two variables are required. Subjects are apparently using fixation as a "pointer" to elements in the environment to gain the same representational economies that machine vision systems have shown by using deictic pointers [see Agre and Chapman 1987; Ballard 1989, 1991; Brooks 1991]. This use of fixation is central to the tradeoff between eye movements and working memory load. Figure 2.17 illustrates the MPD strategy, in which each block move is preceded by a model fixation. Figure 2.18 shows the minimum memory strategy; each block is fixated twice -- once to load its color, then a second time to load its position. Such a strategy could use just one deictic variable by postponing the second model fixation until after a block of the needed color was picked up from the resource area. This MPMD strategy relies the least on working memory, requiring only a single variable, but requires twice as many model references as the MPD strategy shown in Figure 2.17. In spite of the number of eye movements required to copy the model pattern using the MPMD strategy, it was the most common strategy. Analysis of the time subjects spent in the model area during the two model references in the MPMD strategy is consistent with the interpretation that subjects are gathering different information during the two references to the model.

These results show that subjects choose to refer to the external world, acquiring (and re-acquiring) information just as it is needed, rather than relying on working memory to hold even relatively simple information relevant to the task. The frequency of eye movements used to minimize working memory suggests that working memory load is expensive relative to eye movements. In the next chapter we will explore to what degree subjects trade-off eye movements and working memory, and how the balance between eye movements and working memory can be manipulated by task constraints.

Figure 2.17 More frequent model fixations reduce the working memory load.

Figure 2.18 Working memory load is minimized in the MPMD strategy. Only the color or position of a single block is held at any time.

==> Chap. 3: Tradeoff between frequent eye movements and working memory load

==> "Visual Representations in a Natural Visuo-motor Task"

By: Jeff B. Pelz
Center for Imaging Science, Rochester Institute of Technology
Department of Brain and Cognitive Sciences, University of Rochester