Visual Representations... Chapter 1 (J. B. Pelz)

Visual Representations in a Natural Visuo-motor Task

Chap. 1: Introduction

Visual perception is an extended process that naturally occurs in the context of complex behaviors in dynamic, feature-rich environments with literally thousands of potential fixation targets. Yet its study has been limited in large part to the perception of simple, static stimuli in reduced environments with the subject holding fixation, or perhaps making simple eye movements. This is in part because of a limitation in instrumentation and in part motivated by a belief that complex systems can be understood in terms of simpler component processes. Instrumentation now exists that allows the study of vision in more natural settings. Moving beyond the investigation of small subsystems presents several challenges, but in return we are able to ask new kinds of questions about vision in its normal behavioral context. Studying vision in this context (and in the associated complex environments) compels us to consider the nature of the internal representations used by humans in the performance of complex, real-world tasks. It is not yet clear what type of internal representation exists, nor the manner in which it is created. At one extreme, successive fixations made over a period of time could be integrated into a high-fidelity, general-purpose representation of the immediate environment. At the other extreme, transitory task-specific representations could be computed for each fixation frame, with little carry-over between fixations. Such moment-by-moment representations do not match our conscious perception, but there is evidence that subjects do not have access to the kind of high-fidelity, internal representations that introspection might lead one to postulate, and the 'internal store' doctrine has been questioned by several investigators. Gibson [1966], O'Regan [1992], and others have proposed that the same subjective impression of an internal replica of the visual environment could be achieved through active perception of the environment by treating that environment as an 'external store,' though little evidence has accrued to date to support the proposal.

The study of how internal representations are formed and utilized leads to questions about how working memory is used in performing complex, multi-step tasks. The nature of internal representations is intimately linked to the study of short-term, or working memory. Baddeley defines working memory as "... a system for the temporary holding and manipulation of information during the performance of a range of cognitive tasks such as comprehension, learning, and reasoning." [Baddeley 1986, p. 34] Rich internal representations require substantial working memory resources, a resource that we know to be limited. Complex visuo-motor tasks that we perform daily also rely on working memory. Most studies of working memory have focused on determining the upper bounds of memory, so while we know a great deal about the limits of memory systems, we know less about how memory is actually utilized in natural tasks, and we have little understanding of the computational role those limitations place on the system as a whole.

Recent advances in instrumentation now make it possible to perform experiments in which subjects perform complex tasks under natural conditions. In the past, most research on visual perception and eye movements has been limited to the examination of single events. Experiments on the oculomotor system have focused on understanding the mechanical properties and limitations of the isolated components rather than their combined contribution to the strategy of perception. The study of visual processes has been segregated from the study of motor processes like eye movements, though they must be intimately linked. Whether it was a question about how much information can be gained in a single glimpse, or the mechanics of eye movements, the research was limited to looking at the small building blocks of visual perception. But visual perception is not just what happens in 200 msec; it operates continuously. Studying isolated eye movements made in reaction to reduced stimuli forced eye movements into the reflex mode. With the exception of reading, we know very little about how visual behaviors unfold over time, or about the sequence of eye movements observers use when they are free to choose their own strategies in the context of performing natural tasks. Making accurate measurements of eye movements is difficult, and until recently it required that the head be immobilized. Kowler [1990] described some early measurement techniques in which the experimenter "photographed a droplet of mercury placed on the limbus. Translations of the head were minimized by having subjects lie on a stone slab with their heads wedged tightly inside a rigid iron frame." While such measures may now seem extreme, there is evidence that even requiring a subject to stay on a biteboard alters oculomotor performance [Kowler, et al., 1992, Collewijn, et al., 1992]. Another result of the limitations of previous eye movement monitors is that we know very little about how the eye and head work together in natural gaze changes.

Eye movements are integral to visual perception; if we are to understand vision, we will need to understand the role that eye movements play. If we no longer consider eye movements as reflexive reactions to the environment, we can begin to investigate the processing underlying eye movements. The work of Yarbus [1967] was very important in expanding the view of the role of eye movements as externally visible reflections of cognitive events. In one of Yarbus' classic experiments on eye movements during perception of complex objects, he monitored subjects' eye movements as they viewed Repin's painting, "The Unexpected Visitor." Before viewing, the subjects were instructed to perform one of seven tasks: 1) free examination of the picture, 2) estimate the material circumstances of the family in the picture, 3) give the ages of the people, 4) surmise what the family had been doing before the arrival of the "unexpected visitor," 5) remember the clothes worn by the people, 6) remember the position of the people and objects in the room, and 7) estimate how long the "unexpected visitor" had been away from the family. The pattern of eye movements and fixations varied dramatically with different instructions to the subjects. Figure 1.1 shows the scan paths for a single observer viewing the painting under each of the seven instructions.

Yarbus concluded that "... the distribution of the points of fixation on an object, the order in which the observer's attention moves from one point of fixation to another, the duration of the fixations, the distinctive cyclic pattern of examination, and so on are determined by the nature of the object and the problem facing the observer at the moment of perception." [Yarbus 1967, p. 196] The fact that eye movement patterns are not determined by the stimulus alone, but are dependent on the task being performed suggests that eye movements are an integral part of perception and not simply a mechanism evolved to deal with the 'foveal compromise' (the uneven distribution of photoreceptors across the retina that allows both high resolution and a wide field of view).

While Yarbus' work demonstrated the coupling of eye movements and perception, it is not clear how his results relate to more normal conditions. Yarbus' subjects were required to view the painting for three minutes after each instruction. While there were

Figure 1.1 Eye movement records for one subject while viewing the painting "An Unexpected Visitor." [Yarbus 1967, Fig. 109].

clear differences in the eye movements recorded under the different instructions, this time scale is more normally associated with cognitive processes than perceptual processes. We are left to guess the underlying perceptual and cognitive processes tied to any individual fixation, and since the subject had no control over trial duration and no task to perform we cannot even know which fixation clusters occurred while the subject waited patiently for the trial to end. The "Unexpected Visitor" experiment demonstrated that high-level cognitive processes affect fixations, but such experiments do not allow any inferences to be made about the purpose of any fixation or series of fixations. Such inferences are only possible if the subject is performing a task where the computations necessary for task performance can be made explicit while maintaining natural behavior.

The issues surrounding the nature of visual representations also arise in the context of computer vision. Implicitly or explicitly, the design of computationally-based artificial vision systems have traditionally been driven by a desire to model the subjective experience of human vision. This approach dominated work in machine vision for a decade [see e.g., Marr 1982, Feldman 1985, Levine 1985]. In this strategy, the entire scene is analyzed, building models of progressively higher dimensionality, eventually reaching the "3-D Model Representation," an approach that fits with our intuition about what it feels like to 'see.' While observers may be conscious of 'paying attention' to various aspects of a scene, it is usually assumed that attention is a cognitive mechanism applied to at least a partially analyzed perceptual representation. Computational models have focused on algorithms for reaching high-level descriptions of the scene from various cues (e.g., the shape-from-X and structure-from-motion algorithms), or a semantic representation of the scene [e.g., From Pixels to Predicates, Pentland 1986]. In general, these problems have proven to be very difficult except in very restricted, artificial environments. The approach common to all these systems is a full-scale assault on the sensor data before specifying the immediate action to be planned or the overall task to be performed. But visual information necessary for any given task is not uniformly distributed in space and time, so a vision system that applies its resources uniformly, without regard to the immediate task, is inherently inefficient. For example, given the task of identifying a person standing behind a chain link fence, such a system would devote the same computational resources to processing the chain-link fence as to the face behind it.

In the last decade, some workers in the field of machine vision systems have found that allowing camera movement may simplify some of the representational problems. The task-dependent nature of biological vision systems served as a model for several early proponents of 'active vision' [Aloimonos et al. 1987, Bajcsy 1988, Ballard 1989, Brooks 1991]. The study of attention and eye movements was crucial in much of that work, and early papers on active vision paid homage to Yarbus' [1967] work which demonstrated the task-dependent character of human eye movements. The idea that optimal behaviors for collecting visual information are dependent on the particular task is central to the advantages offered by the 'active vision' systems that have been implemented in computer vision. The early proponents of active vision shifted their efforts away from attempts at general-purpose "image understanding," searching instead for behaviors of the agent that could efficiently probe the environment for information where and when it was needed.

Physical motion of the camera provides important advantages to active vision systems. Control over (and knowledge of) the camera's location constrains the images in a manner that allows conclusions to be drawn that are impossible without those constraints. For example, Ballard and Ozcandarli [1988] demonstrated a motion parallax system that computed the depth of objects based on their motion relative to the fixation point. The ability to fixate a single point while translating dramatically reduced the computational resources necessary to determine object distance. Systems using two cameras have shown computational economies by using disparity information derived from the two images.

The crucial aspect of these models is that they allow frequent access to the sensory input during the problem-solving process [Brooks 1991; Agre and Chapman 1987; Ballard 1989, 1991]. Agre and Chapman [1987], building on work by Ullman [1984], introduced the term 'deictic primitives' to refer to transient pointers used to mark aspects of a scene (e.g., color or shape). Such aspects are dynamically referred to by indicating that part of the scene with a special marker. The term 'deictic' comes from linguistics; a class of pronouns (e.g., 'this,' 'that,' and 'those') are termed 'deictic' words (from the Greek deiktikos; to show or point). These deictic words result in significant representational economies in language, allowing one to point and say "I want those" instead of "I want to purchase the pair of size 10, black, high-topped, canvas athletic shoes in the third row of boxes above the floor and 4 boxes to the left of the door."

The power of computer vision systems using deictic markers comes from their use of a relatively small number of these markers, binding them to areas of a scene only as long as they are relevant to the immediate task, then moving them to another area. Memory requirements for systems using deictic strategies are also dramatically lower than for non-deictic systems. Because only a small portion of the visual scene is represented at any time, there is no need to maintain a large internal store. If information from a particular region of the scene must be referred to again later, it is only necessary to store the location (or feature vector) of the marker, rather than all the information to be found at that location. Table 1.1 illustrates the simplification allowed by systems that do not rely on representations in which the positions and properties of all objects in the scene must be represented in viewer-centered coordinates. Quadrant IV represents the goal of 'traditional' image understanding algorithms; given a (static) scene, locate and identify all objects in the scene. This requires that all known models be applied to all areas of the image -- a task that has proven very difficult. Deictic variables can reduce the computational complexity by marking portions of a scene, simplifying the problem into a series of two kinds of tasks; "what" and "where." Quadrant II represents a "what" task; given a single image location, determine the identity of the marked object from among many possible models. Quadrant III is the "where" task; given a single internal model, search the image space to determine the location of the marked model. This simplification leads to dramatically faster algorithms for each of the specialized tasks [Swain and Ballard 1991; Swain et al. 1992; Ballard and Rao 1994]. Once an item's identity and location are known, it can be manipulated (physically or symbolically). So the organization of visual computation into WHAT/WHERE modules may have a basis in complexity. Trying to match a large number of image segments to a large number of models at once may be too difficult. The problem can be made tractable by serializing the problem into sequences of 'what' and 'where' functions.

Table 1.1 Organization of visual computation into WHAT/WHERE modules.

               One Model                     Many Models                   
       One        I. Manipulate an object      II. "What?"  Identify an    
    Image      whose identity and location   object whose location is      
    Part       are known.                    known.                        
  Many Image    III. "Where?"  Locate a       IV. Locate and identify      
    Parts      known object in the scene.    all objects in the scene.     

These deictic representations, which allow localization and interaction with respect to the current fixation point rather than in an absolute, camera-centered reference frame are particularly valuable when vision is used to control actions, i.e., in visuo-motor tasks. Fixation-based, exocentric reference frames permit closed-loop 'servoing' actions, eliminating complex reference frame transformations, and making the systems less 'brittle' in the face of errors. Traditional systems that rely on visual localization with respect to egocentric reference frames are intolerant of even relatively small errors in localization and/or representation of the motor agent's position.

Since the utility of deictic systems has been demonstrated in computational systems, we are led to ask whether similar representational economies may be exploited in biological vision systems. Because humans have limited working memory and the eyes allow a natural implementation of deictic strategies which might allow complex tasks to be performed without elaborate internal representations being held in memory, the question is immediately raised whether humans in fact use eye movements in this way during natural behaviors. One of the goals of this thesis is to explore the applicability of the computational idea of deictic representations to human vision.

However, care must be taken in selecting a task for study. Because we do not have any direct access to the strategies employed by humans in most complex tasks, one can only infer those strategies based on subjects' behavior. In free viewing of an image or scene, subjects make a series of fixations, separated by saccadic eye movements. The resulting 'scanpaths' have been studied in attempts to infer the underlying cognitive goals of the viewers [e.g., Noton and Stark, 1971]. Without knowledge of the intentions and pre-conceptions of the subject, (knowledge probably not available even to the subject in many complex tasks), the value of scanpaths in discovering cognitive strategies is limited. While there is some similarity in scanpaths of observers viewing the same scene with the same instructions, there is also significant variability within and between-subjects. Viviani [1990] presents these objections to attempts to discover cognitive strategies from eye movement records, cautioning that such an attempt "presupposes, however, the possibility of controlling the input stimuli and subject's intentions to an extent that is seldom, if ever, attainable in the case of free visual exploration. One must therefore, settle for evidence derived from situations with more constraints on the sequence of movements than are actually present in real-life scanning." [Viviani, 1990 p.380] The "constraints on the sequence of movements" is meant quite literally: Viviani goes on to describe single and double-step paradigms that are considered sufficiently constrained.

But there is another alternative: Instead of arbitrarily constraining the task to reduced stimuli and providing explicit instructions on the pattern of eye movements permitted (e.g., as in a traditional double-step paradigm), it is possible to design a task which constrains fixations to 'useful' regions in a complex field rather than to a small number of permitted points. Ideally the task would also constrain the number of 'useful' strategies likely to be exhibited. Such a task would fall between the underconstrained case of "free viewing," and the overly restricted case where one or two targets are to be fixated in a prescribed manner. Tasks such as mental rotation and mental arithmetic have been used in attempts to understand internal representations of objects [e.g., Just and Carpenter 1976]. Such tasks require complex cognitive computation, but have little or no externally observable behaviors other than task duration. On the other hand, a task that requires complex eye movements tied to relatively simple cognitive processing, where fixations could be reliably related to cognitive processes based on the subject's progress in the task, could provide a useful way to examine visual behavior in complex tasks. This thesis explores such a task involving a sequence of visual, motor and memory components in copying a pattern of colored blocks.

To do this, we need to monitor subjects' eye, head, and hand movements in an unconstrained situation. Head-free eyetrackers now permit the measurement of eye and head movements at a fine time-scale, allowing us to see how humans use vision to gather visual information and guide motor movements during real-world tasks. Simultaneously monitoring the eyes, head, and hand allows the coordination of perceptual and motor behavior to be examined as well. The work by Agre and Chapman [1987] and Whitehead and Ballard [1990] demonstrated the advantages of deictic representations using computer simulations of a robot solving problems by manipulating blocks in a 2-dimensional space. These simulated block-manipulation tasks provided inspiration to investigate human performance in a related block-copying task and to study human's use of such deictic strategies in natural behavior. In order to investigate these questions, a series of experiments are presented that examine human performance in an extended, natural block-copying task. There are several key features of the task under study: It is an extended task, allowing the study of natural, ongoing behavior rather than isolated movements. Unlike meta-tasks such as mental rotation and mental arithmetic, the block-copying task is concrete - there is no ambiguity about the task to be performed or the subject's progress in the task. While the task is clearly defined, it has a loose trial-like structure corresponding to the copying of each block, and no explicit instructions need be given regarding eye, head, or hand movements. Subjects simply perform the task in whatever manner they choose, using natural eye, head, and hand movements.

The experiments presented in this dissertation demonstrate the importance of studying vision in the context of ongoing behavior. Vision can not be considered a process that operates independently of ongoing tasks. The experiments show that vision is highly task-dependent. Investigating natural behavior required the development of a new laboratory facility with the capability to monitor several aspects of complex behavior. The component systems for monitoring eye, head, and hand movements, and the subsystems' integration into the new facility are discussed in Chapter 2. The block-copying paradigm on which all the experiments are based is then described and fundamental properties of subjects' performance is discussed. The main result is that subjects use frequent eye movements to serialize an extended, complex task. Serializing a task in this manner reduces the instantaneous load on working memory by gathering information necessary to perform individual subtasks only as it is needed. This 'just-in-time' perceptual strategy apparently uses fixation as a form of deictic marker to simplify multi-step tasks, reducing memory load.

Chapter 3 explores the tradeoff between frequent fixations and working memory load. Experiments manipulating the relative cost of frequent fixations and the amount of visual information required to complete the task show that subjects modify the tradeoff between working memory load and frequent fixations depending on immediate task constraints. The new instrumentation also makes it possible to study the coordination of the eye, head, and hand during complex tasks. Chapter 4 examines the coordination of the eye and head subsystems, and their coordination with the hand. The experiments show that the performance of the subsystems, and the coordination of those subsystems show the same task-dependence found in the studies of gaze in the earlier chapters.

The results of all the experiments support the interpretation that subjects serialize complex tasks into a number of subtasks that are executed sequentially by a system with limited central resources. Chapter 5 presents a series of experiments in which subjects had to perform a secondary task in addition to the primary block-copying task to probe the limits of the central resource. Even a seemingly unrelated verbal shadowing task caused significant interference with performance of the primary, visual/spatial block-copying task. The implications of these experiments are discussed in Chapter 6.

To Chap. 2: The Block Copying Paradigm

==> "Visual Representations in a Natural Visuo-motor Task"

By: Jeff B. Pelz
Center for Imaging Science, Rochester Institute of Technology
Department of Brain and Cognitive Sciences, University of Rochester