Visual Representations... Chapter 6 (J. B. Pelz)

Visual Representations in a Natural Visuo-motor Task

Chap. 6: General Discussion

The goal of this dissertation was to explore visual representations and visuo-motor coordination in the context of complex, ongoing behavior. This required the development of a new laboratory facility capable of monitoring eye, head, and hand movements at a fine time-scale. The task selected was a block-copying paradigm that reflects essential perceptual, motor, and cognitive operations. The task involves important aspects of natural behavior, yet is sufficiently constrained so that the computations necessary to perform each subtask can be made explicit. While we have known since Yarbus' [1967] work that eye movements reflect cognitive events, little progress has been made in understanding this relationship [Viviani 1990]. Because of the design of the block-copying task used in these experiments, however, fixations are tied closely to the perceptual, cognitive, and motor subtasks that constitute subjects' overall behavior in the task. When subjects fixate the model area after dropping a block in the workspace, then pick up a block from the resource area, we can infer that the model fixation served to acquire the color and/or position of a block in the model pattern. Similarly, fixations in the resource and workspace areas accompanying block pickups and drops can be assumed to assist in targeting hand movements. The development of the new laboratory opened up a field of inquiry into what has until now been largely unexplored territory. The ability to monitor natural behavior at a fine time-scale, along with the advantages of the block-copying task, provide new insight into the way humans utilize visual information, working memory, and visuo-motor coordination in the performance of natural tasks.

The first observation was that subjects make very frequent eye movements, returning to inspect the model pattern again and again while copying the eight colored blocks. Eye movements were used to serialize the task into simpler subtasks, which were executed sequentially. The constraints of the task and the subjects' common, stereotyped behavior led to a relatively small number of strategies used to copy each block. Trials were analyzed, and individual block moves were categorized into one of five 'strategies,' labeled in terms of the series of fixations executed while copying the block. The modal strategy was the "Model-Pickup-Model-Drop" (MPMD) strategy, in which the subject looked first to the model, then to the resource area (to guide the block pickup), returned gaze to the model, and finally on to the workspace to guide the block drop. This strategy requires two references to the model area for each block copied. The number of model references averaged 1.5 per block, though the value was not constant over the course of the eight-block trial. The first block was higher (2.0 looks/block) and the last block was lower (1.1), indicating that some representation was built over the course of the trial. It is important to note, however, that the change was not dramatic, and the average number did not fall below 1.0, even for the last block. Because subjects were given no direction on how to perform the task, other than to complete the copy as quickly as possible without making errors, it is important that subjects chose to complete the task by referring to the model so frequently. It is interesting to return to Table 1.1 (page 10) and note that no subjects used the alternative strategy of first locating and identifying several objects in the scene (quadrant IV in Table 1.1), then moving a number of blocks without fixating the model again. The subjects' use of temporary, task-specific visual representations suggests that vision may be much more 'top-down' than was previously thought. This thesis challenges the idea that the visual system's task is to gather information for integration into a high-fidelity, general-purpose representation of the environment without regard to the immediate task. In the classical view of visual perception, (also embraced by traditional computer vision approaches [e.g., Marr 1982]), planning and cognition was performed by referencing the internal representation. The frequent eye movements used by subjects in these experiments suggest that in real tasks, humans apparently maintain only sparse, transient representations of task-relevant information in concert with dynamic deictic markers to refer to elements in the environment. Given subjects' preference to acquire color and position information in separate model references, it appears that even representations of task-relevant information may be minimal and short-lived. Note that the retinal images during the first and second model fixations in an MPMD sequence are virtually identical, but the information held in working memory (e.g., the position or color of a block) is determined by the 'step' in the sequential program being executed. Whatever internal representation that may have been built during the first model fixation was sparse (and/or had decayed) enough to require the second model reference. The inference that different information is gathered in subsequent model references was supported by an analysis of the fixation times in the first and second model references in MPMD sequences. In moves executed with only one model fixation in each model reference, the first model fixations (color) were shorter than the second fixation (position).

Another series of experiments led to the observation that the trade-off between frequent eye movements and working memory load is flexible, and that subjects are capable of dynamically adjusting the balance based on task demands. One set of experiments manipulated task demands in two ways; by increasing the cost of model references, and by reducing the information content of the model pattern. In the first case, the cost of frequent model references was increased by placing a greater distance between the model, resource, and workspace areas. The 'far' configuration slowed performance in general, but those strategies containing multiple model references were affected the most. Subjects modified strategy use in the 'far' condition, reducing reliance on frequent model references. The average number of looks per block copied fell, but even when model references required large eye, head, and torso movements, the average number of model references per block did not fall below 1.0, indicating that subjects still chose not to copy multi-block patterns 'from memory' rather than make frequent eye movements.

Task demands were also manipulated by reducing the information content of the model pattern. In one experiment, the complexity of the model was reduced by using monochrome model patterns, leaving only information about the blocks' position. In the second experiment, a 'linear' model pattern retained color information, but the positions of all blocks were determined after the first block was placed in the workspace. In both cases the relative frequency of strategies containing two or more model references fell, leading to a reduction in the average number of looks per block, though again the average value never went below one model reference per block copied. Subjects' behavior in the 'monochrome' and 'linear' conditions supports the interpretation that the frequent model references in the control condition are used to acquire color and position information separately. If those frequent model references were merely artifacts of the block-copying task, we would not expect a reduction in model complexity to affect subjects' strategies. Task demands were also manipulated by requiring the subject to perform a secondary task while copying the block pattern. Subjects relied even more heavily on frequent eye movements to the model when a concurrent cognitive load was imposed, relying less on working memory. When the cognitive load was combined with the 'far' configuration, subjects struck a balance between the opposing constraints. The number of references subjects made to the model area under the 'far-attend' condition was midway between the observed performance in the control and 'attend' conditions. Taken together, the results of the experiments that manipulated the cost of model references or the complexity of the model pattern demonstrate that subjects use multiple model references to acquire the color and position of the model pattern, and are capable of adjusting the trade-off between frequent eye movements and working memory based on immediate task constraints.

A novel feature of this dissertation was the development of a laboratory facility to monitor eye, head, and hand movements while subjects performed natural tasks. The facility developed for these experiments also allows the study of how those movements are coordinated in natural behavior. An important observation was that some subjects programmed independent eye and head movements, dissociating their spatial and temporal trajectories. This observation is inconsistent with current models of eye and head movements that postulate a common gaze shift goal sent in parallel to eye and head motor systems [e.g., Guitton 1992, van der Steen 1992]. Recent reports by Land [1992] and Kowler et al. [1992] have provided experimental support for such coupling between eye and head trajectories. While the experiments reported in this dissertation resulted in wide variability between subjects, there is no doubt that humans are capable of executing independent eye and head movements to different targets. This demonstrates once again the importance of understanding the task-dependency of complex behaviors. Land [1992] probably did not observe any dissociation between eye and head movements because his task required only relatively slow, horizontal gaze changes, Kowler et al. [1992] studied reading and tasks requiring similar eye and head movements. While stressing the 'natural tendency' of the eye and head to move together, they noted that on some occasions the eye and head moved in opposite directions. In the present study, the two-dimensional nature of the task, along with the time pressure under which subjects worked, led to a situation where performance could be optimized by dissociating eye and head trajectories. While subjects exhibited regular, rhythmic patterns of eye, head, and hand movements while performing the block-copying task, significant asymmetries were observed, demonstrating the task-dependence of head and hand movements as well. The temporal coordination of eye and head movements, eye and hand movements, and head and hand dwell-times were different when gaze was moved into, or away from the resource area. In this case, the constraints of the subtask being performed affect subjects' motor actions.

Performing the block-copying task involves a complex set of perceptual, cognitive, and motor primitives. The planning and execution of those primitives to create coherent behavior requires the allocation of limited central resources. The secondary task used in the cognitive load conditions requires some of the same resources. When subjects performed the block-copying task with a concurrent cognitive load, several aspects of eye/head performance were affected. The amplitude of head movements and the peak cross-correlation were both affected, but of particular interest is the change in head targeting. The added cognitive load led to dramatic changes in some subjects' head trajectories. While some subjects often executed independently targeted eye and hand movements in the control condition, the dissociation of eye and head movements fell dramatically under the cognitive load condition.

One of the goals of this thesis was to explore the possible cognitive role of fixations. The experiments have demonstrated that fixations indeed play a crucial cognitive role in perception. A critical aspect of this role is their use in binding task-relevant information to variables in working memory. This conception of working memory as the currently active marked variables leads to a simple interpretation of the tradeoffs between working memory and eye movements, in which fixation can be seen as a choice of an external rather than an internal marker. These experiments also suggest that another aspect of the cognitive role of eye movements is in indexing the execution of sequential programs. Subjects' behavior in performing the block-copying task can be understood as successive application of the what, where, and manipulation primitive actions described in Table 1.1 on page 10. Figure 6.1 illustrates a sequential program executed as a subject performs an MPMD block move. What, where, and manipulation primitives are successively applied to gather information from the scene and guide movements. Each step in the program is indexed by fixations in the model, resource, and workspace areas. This demonstrates how small number of primitives used in a simple control program can be generalized to create more complex behaviors.

This interpretation of the cognitive role of eye movements may also offer insights into the classic division of visual pathways into dorsal and ventral streams [Mishkin, Ungerleider, & Macko 1983]. Positioning and binding of the dynamic markers may be performed by the dorsal stream to parietal cortex, while the ventral stream is used to acquire features (e.g., color and position) from the marked locations. Visual search would require the participation of both systems.

Interpreting eye movements and brain computations in terms of binding variables in behavioral programs blurs the distinction between perception and cognition, which have traditionally been thought of as different domains. Historically we have been accustomed to thinking of the job of perception as creating rich, task-independent descriptions of the world which are then re-accessed by cognition [e.g., Marr 1982]. These experiments suggest that the role of perception may be much simpler since it only needs to create descriptions that are relevant to the immediate task. To the extent that manipulations on a given block are largely independent of the information acquired in previous views, performance in this task suggests that it is unnecessary to construct an elaborate scene description to perform the task and that there is only minimal processing of unattended information. In addition, since color and location information appear to be acquired separately, it appears that even in the attended regions the perceptual representation may be quite minimal. These observations support the suggestion made previously that only minimal information about a scene is represented at any given time, and that the scene can be used as a kind of "external" memory [O'Regan and Levy-Schoen 1983; O'Regan 1992; Irwin 1991; Irwin 1992]. A related suggestion has also been made by Nakayama [1990].

The question of the complexity of the internal representation is often tied to the mechanisms responsible for visual stability, and the ability to integrate information across eye movements. If a high-fidelity, pictorial representation were in fact built up from information gathered over several eye movements, as suggested by McKonkie & Raynor [1976], then we would have access to a stable representation of the environment. The converse argument, that the world appears stable across eye movements, therefore we have access to a rich, internal representation, is not necessarily true. If only task-relevant information is integrated across eye movements, then even sparse representations can contribute to visual stability.

These results suggest a new interpretation of the limitations of human working memory as well. Rather than viewing the capacity limit as a fundamental limit on the brain [e.g., Just & Carpenter 1976], we can look at the limit as an inevitable consequence of an efficient system that uses deictic variables to preserve only the products of the brain's computations that are necessary for the ongoing task. In natural tasks performed in complex environments, it may be the case that retaining information from previous fixations could make tasks more difficult. When task relevant information exists in the environment, it is simpler to reference the external data with a small number of markers serving as pointers than to load all the information into working memory. Shifting a single pointer then takes the place of 'clearing' a complex data set and replacing it with a new set.

Being able to keep "14 +/- 2" items in working memory instead of "7 +/- 2" would enhance performance in experiments designed to determine such limits, but it would not necessarily be a benefit when performing natural behaviors. The limited number of variables are only a handicap if entire tasks have to be completed from memory; in that case, working memory may be overburdened. So while we experience the limitations of working memory when trying to remember two unfamiliar phone numbers simultaneously, we are able to copy a pattern of eight colored blocks, cook a meal, and navigate an expressway on-ramp with relative ease. Yet the state-space required to complete any one of these actions is greater than that required to remember 14 digits, and would far exceed the capacity of working memory if all the relevant information had to be maintained internally at the same time. Serializing the tasks with frequent eye movements (i.e., fixating relevant features only when that information is necessary) simplifies the tasks by keeping the instantaneous state-space to a minimum. The cost of searching alternatives and learning new behaviors scales exponentially with the number of markers [Ballard, Hayhoe, & Pook 1995], so there is great pressure to limit the number of markers and to find behavioral programs that operate with a minimum number of those markers. Having to operate with a small number of markers is seen to be less restrictive when we note that the effective capacity of working memory is limited only to the amount of information that can be 'pointed to' by the number of markers. There is also a serial vs. parallel tradeoff; instantaneous state-space requirements can be minimized by breaking tasks down into subtasks that can be executed sequentially, reusing markers freed up after the previous step.

This dissertation provides support for a different approach to studying visual processing, in which vision is viewed as more top-down than previously supposed. In such an approach, the task takes on particular importance, because the observed behavior cannot be divorced from the immediate task(s). Acknowledging the task-specific nature of vision presents a challenge: how can the results of any experiments be generalized beyond the task used in that experiment? If complex behaviors are viewed as a sequential program made up of a relatively small number of simple primitives, then experiments designed to identify those primitives are the first step toward understanding their application in a wide variety of visual behaviors.

Operation Primitive


Select target in model where
Shift gaze to model eye & head movements (M)
Get color what
Select target in resource where
Shift gaze to resource eye & head movements (P)
Move hand to fixation point manipulation
Pickup manipulation
Select target in model where
Shift gaze to model eye & head movements (M)
Get location what
Select target in workspace where
Shift gaze to workspace eye & head movements (D)
Move hand to fixation point manipulation
Drop manipulation

Figure 6.1 Sequential program of 'what,' 'where,' & 'manipulation' primitives representing an MPMD block move sequence.


To: Appendix: Calibration of the Magnetic Tracking System


==> "Visual Representations in a Natural Visuo-motor Task"

By: Jeff B. Pelz
Center for Imaging Science, Rochester Institute of Technology
Department of Brain and Cognitive Sciences, University of Rochester

1995