A Multi World Approach to Question Answering about Real World Scenes based on Uncertain Input

Transcript

1 A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input Mario Fritz Mateusz Malinowski Max Planck Institute for Informatics ̈ ucken, Germany Saarbr { } @mpi-inf.mpg.de mmalinow,mfritz Abstract We propose a method for automatically answering questions about images by bringing together recent advances from natural language processing and computer vision. We combine discrete reasoning with uncertain predictions by a multi- world approach that represents uncertainty about the perceived world in a bayesian framework. Our approach can handle human questions of high complexity about realistic scenes and replies with range of answer like counts, object classes, in- stances and lists of them. The system is directly trained from question-answer pairs. We establish a first benchmark for this task that can be seen as a modern attempt at a visual turing test. 1 Introduction As vision techniques like segmentation and object recognition begin to mature, there has been an increasing interest in broadening the scope of research to full scene understanding. But what is meant by “understanding” of a scene and how do we measure the degree of “understanding”? Most often “understanding” refers to a correct labeling of pixels, regions or bounding boxes in terms of semantic annotations. All predictions made by such methods inevitably come with uncertainties attached due to limitations in features or data or even inherent ambiguity of the visual input. Equally strong progress has been made on the language side, where methods have been proposed that can learn to answer questions solely from question-answer pairs [1]. These methods operate on a set of facts given to the system, which is refered to as a world. Based on that knowledge the answer is inferred by marginalizing over multiple interpretations of the question. However, the correctness of the facts is a core assumption. We like to unite those two research directions by addressing a question answering task based on real- world images. To combine the probabilistic output of state-of-the-art scene segmentation algorithms, we propose a Bayesian formulation that marginalizes over multiple possible worlds that correspond to different interpretations of the scene. To date, we are lacking a substantial dataset that serves as a benchmark for question answering on real-world images. Such a test has high demands on “understanding” the visual input and tests a whole chain of perception, language understanding and deduction. This very much relates to the “AI-dream” of building a turing test for vision. While we are still not ready to test our vision system on completely unconstrained settings that were envisioned in early days of AI, we argue that a question-answering task on complex indoor scenes is a timely step in this direction. Contributions: In this paper we combine automatic, semantic segmentations of real-world scenes with symbolic reasoning about questions in a Bayesian framework by proposing a multi-world approach for automatic question answering. We introduce a novel dataset of more than 12,000 1

2 question-answer pairs on RGBD images produced by humans, as a modern approach to a visual turing test. We benchmark our approach on this new challenge and show the advantages of our multi-world approach. Furthermore, we provide additional insights regarding the challenges that lie ahead of us by factoring out sources of error from different components. 2 Related work : Our work is mainly inspired by [1] that learns the semantic representation for Semantic parsers the question answering task solely based on questions and answers in natural language. Although the architecture learns the mapping from weak supervision, it achieves comparable results to the semantic parsers that rely on manual annotations of logical forms ([2], [3]). In contrast to our work, [1] has never used the semantic parser to connect the natural language to the perceived world. : Previous work [4, 5] has proposed models for the language grounding Language and perception problem with the goal of connecting the meaning of the natural language sentences to a perceived world. Both methods use images as the representation of the physical world, but concentrate rather on constrained domain with images consisting of very few objects. For instance [5] considers only two mugs, monitor and table in their dataset, whereas [4] examines objects such as blocks, plastic food, and building bricks. In contrast, our work focuses on a diverse collection of real-world indoor RGBD images [6] - with many more objects in the scene and more complex spatial relationship between them. Moreover, our paper considers complex questions - beyond the scope of [4] and [5] - and reasoning across different images using only textual question-answer pairs for training. This imposes additional challenges for the question-answering engines such as scalability of the semantic parser, good scene representation, dealing with uncertainty in the language and perception, efficient inference and spatial reasoning. Although others [7, 8] propose interesting alternatives for learning the language binding, it is unclear if such approaches can be used to provide answers on questions. Integrated systems that execute commands : Others [9, 10, 11, 12, 13] focus on the task of learn- ing the representation of natural language in the restricted setting of executing commands. In such scenario, the integrated systems execute commands given natural language input with the goal of us- ing them in navigation. In our work, we aim for less restrictive scenario with the question-answering system in the mind. For instance, the user may ask our architecture about counting and colors (’How many green tables are in the image?’), negations (’Which images do not have tables?’) and superla- tives (’What is the largest object in the image?’). : Similarly to [14] that reduces Named Entity Recognition problem into the Probabilistic databases inference problem from probabilistic database, we sample multiple-worlds based on the uncertainty introduced by the semantic segmentation algorithm that we apply to the visual input. 3 Method Our method answers on questions based on images by combining natural language input with output from visual scene analysis in a probabilistic framework as illustrated in Figure 1. In the single world based on segmentations - a unique interpretation approach, we generate a single perceived world W of a visual scene. In contrast, our multi-world approach integrates over many latent worlds W , and hence taking different interpretations of the scene and question into account. Single-world approach for question answering problem We build on recent progress on end-to- ( Q,A ) end question answering systems that are solely trained on question-answer pairs [1]. Top part of Figure 1 outlines how we build on [1] by modeling the logical forms associated with a question as T given W . More formally the task of predicting an answer A latent variable given a single world Q and a world W is performed by computing the following posterior which marginalizes a question over the latent logical forms (semantic trees in [1]) T : ∑ P | Q, W ) := ( A P P A (1) , W ) ( ( T| Q ) . |T T P ( A |T , W ) corresponds to denotation of a logical form T on the world W . In this setting, )] the answer is unique given the logical form and the world: ( A |T , W ) = 1 [ A ∈ σ P ( T W σ , which evaluates a logical form on the world W . Following with the evaluation function W [1] we use DCS Trees that yield the following recursive evaluation function σ ) := : σ T ( W W 2

3 Semantic Semantic evaluation single parsing A Q W S T world logical world answer question approach form sofa (1,brown, image 1, X,Y,Z) table (1,brown, image 1,X,Y,Z) wall (1,white, image 1, X,Y,Z) Scene bed (1, white, image 2 X,Y,Z) chair (1,brown, image 4, X,Y,Z) analysis chair (2,brown, image 4, X,Y,Z) chair (1,brown, image 5, X,Y,Z) ... Semantic Semantic parsing evaluation multi-world A Q W S T approach logical latent answer question segmentation form worlds Figure 1: Overview of our approach to question answering with multiple latent worlds in contrast to single world approach. ⋂ d v : v ∈ σ is ( p ) , t ∈ σ 〉 ( T ) ) , R R ( v,t ) } where T := 〈 p, ( T , , R { ) , ( T T , R ( ) ,..., 2 1 d 1 j d j W W 2 j , and T p , the semantic tree with a predicate associated with the current node, its subtrees ,..., T T d 1 2 that define the relationship between the current node and a subtree R . relations T j j T ( T| Q ) ∝ exp( θ )) φ ( Q, T In the predictions, we use a log-linear distribution over the logical P φ Q and T and parameters θ learnt forms with a feature vector measuring compatibility between φ from training data. Every component is the number of times that a specific feature template j occurs in ) . We use the same templates as [1]: string triggers a predicate, string is under a ( Q, T relation, string is under a trace predicate, two predicates are linked via relation and a predicate has a child. The model learns by alternating between searching over a restricted space of valid trees θ and gradient descent updates of the model parameters . We use the Datalog inference engine to produce the answers from the latent logical forms. The linguistic phenomena such as superlatives and negations are handled by the logical forms and the inference engine. For a detailed exposition, we refer the reader to [1]. Question answering on real-world images based on a perceived world Similar to [5], we W . This still corre- extend the work of [1] to operate now on what we call perceived world sponds to the single world approach in our overview Figure 1. However our world is now popu- S . For this purpose, we lated with “facts” derived from automatic, semantic image segmentations build the world by running a state-of-the-art semantic segmentation algorithm [15] over the im- ages and collect the recognized information about objects such as object class, 3D position, and 50 50 50 50 50 100 50 100 100 100 150 100 150 150 100 150 150 200 200 150 200 200 200 250 250 250 200 250 250 300 300 300 250 300 350 300 350 350 350 300 400 350 400 400 400 350 300 150 50 400 200 250 100 550 450 500 350 50 50 200 150 100 50 550 500 450 400 350 300 250 350 200 500 450 400 550 300 250 150 100 50 400 100 50 150 550 200 500 250 300 350 450 400 400 100 100 550 500 450 400 350 300 250 200 150 100 50 250 200 500 450 400 350 300 550 100 150 50 50 50 50 50 50 50 50 50 50 150 150 100 100 100 100 100 100 100 100 100 50 50 50 50 50 50 150 150 200 200 150 150 150 150 150 150 150 100 100 100 100 100 100 200 200 200 200 200 200 200 200 200 250 250 150 150 150 150 150 150 250 250 250 250 250 250 250 250 250 200 200 300 300 200 200 200 200 300 300 300 300 300 300 300 300 300 250 250 250 250 250 350 350 250 350 350 350 350 350 350 350 350 350 300 300 300 300 300 300 400 400 400 400 400 400 400 400 400 400 400 350 350 350 350 350 50 350 400 100 150 200 300 550 500 450 250 350 250 200 150 100 50 550 500 450 400 350 300 300 500 100 150 200 250 50 350 400 450 550 500 350 400 50 100 150 200 250 550 300 450 500 450 400 350 300 250 200 150 100 50 550 300 50 450 400 350 500 250 200 150 100 550 550 450 500 400 350 200 150 100 50 300 250 150 550 450 50 100 500 400 350 300 250 200 500 550 200 300 250 350 150 100 50 400 450 250 500 400 350 300 450 200 150 100 50 550 200 250 300 350 400 150 450 500 100 550 50 400 400 400 400 400 400 450 400 50 100 150 250 500 550 300 350 200 300 200 450 400 350 500 250 550 150 100 50 550 500 450 400 350 300 250 50 200 150 100 450 350 250 150 400 300 500 100 550 50 200 150 250 100 300 350 400 550 450 500 50 200 200 400 550 350 300 250 500 150 100 50 450 (a) Sampled worlds. (b) Object’s coordinates. Figure 2: Fig. 2a shows a few sampled worlds where only segments of the class ’person’ are shown. In the clock-wise order: original picture, most confident world, and three possible worlds (gray-scale values denote the class confidence). Although, at first glance the most confident world seems to be a reasonable approach, our experiments show opposite - we can benefit from imperfect but multiple worlds. Fig. 2b shows object’s coordinates (original and Z , Y , X images in the clock-wise order), which better represent the spatial location of the objects than the image coordinates. 3

4 Predicate Definition ( A,B above ( A,B ) and ( Y closeAbove ( B ) < Y ) ( A ) +  ) max min closeLeftOf leftOf ( A,B ) and ( X ( ( B ) < X ) ( A ) +  ) A,B min max B inFrontOf ( A,B ) and ( Z ( ( closeInFrontOf ) < Z ) ( A ) +  ) A,B max min ( A,B ) X ) X A ) < X A ( B ) and X ( ( B ) < X ( mean max mean aux min ( ) ) Z Z ( A ) < Z A,B ( B ) and Z A ( B ) < Z ( mean aux mean max min h A,B ) closeAbove ( A,B ) or closeBelow ( A,B ( ) aux ( A,B ) closeLeftOf ( v ) or closeRightOf ( A,B ) A,B aux auxiliary relations ) d ) closeInFrontOf ( A,B A,B or closeBehind ( A,B ) ( aux ( leftOf X A,B ( A ) < X ) ( B )) mean mean above ) Y < Y ( A ) A,B ( ( B ) mean mean ) ) Z inFrontOf ( A A,B < Z )) ( B ( mean mean spatial ( A,B ) closeAbove ( A,B ) and Z A,B ( A,B ) and X ) ( on aux aux ( A,B ) h close ( A,B ) or v ) ( A,B ) or d A,B ( aux aux aux and . Auxiliary relations define actual spatial re- Table 1: Predicates defining spatial relations between B A axis points downwards, functions X lations. The ,X Y ,... take appropriate values from the tuple min max , and  is a ’small’ amount. Symmetrical relations such as rightOf , below , behind , etc. can readily predicate below be defined in terms of other relations (i.e. A,B ) = above ( B,A ) ). ( color [16] (Figure 1 - middle part). Every object hypothesis is therefore represented as an n-tuple: predicate instance ( ∈{ id, color, spatial ) id, image predicate loc bag,bed,books,... } , where instance id is the object’s id, image id is id of the image containing the object, color is esti- mated color of the object [16], and spatial is the object’s position in the image. Latter is loc ( represented as ,X X ,X and defines mini- ,Y ) ,Y ,Z ,Y ,Z ,Z min mean max mean min mean max min max X,Y,Z mal, maximal, and mean location of the object along axes. To obtain the coordinates we fit axis parallel cuboids to the cropped 3d objects based on the semantic segmentation. Note that the X,Y,Z coordinate system is aligned with direction of gravity [15]. As shown in Figure 2b, this is a more meaningful representation of the object’s coordinates over simple image coordinates. The complete schema will be documented together with the code release. We realize that the skilled use of spatial relations is a complex task and grounding spatial relations is a research thread on its own (e.g. [17], [18] and [19]). For our purposes, we focus on predefined relations shown in Table 1, while the association of them as well as the object classes are still dealt within the question answering architecture. Multi-worlds approach for combining uncertain visual perception and symbolic reasoning Up to now we have considered the output of the semantic segmentation as “hard facts”, and hence ignored uncertainty in the class labeling. Every such labeling of the segments corresponds to dif- ferent interpretation of the scene - different perceived world. Drawing on ideas from probabilistic databases [14], we propose a multi-world approach (Figure 1 - lower part) that marginalizes over multiple possible worlds - multiple interpretations of a visual scene - derived from the segmen- W . Therefore the posterior over the answer A given question Q tation S S and semantic segmentation W and logical forms T : of the image marginalizes over the latent worlds ∑ ∑ | P ) = A Q,S ( (2) P ( A |W , T ) P ( W | S ) P ( T | Q ) T W s with the associated probabilities The semantic segmentation of the image is a set of segments i where over the C object categories c } . More precisely S = { ( s ) ,L ,L ) , ( p s ,L ( ) ,..., s 2 k 1 k 1 j ij 2 C L { ( c ,p ) } = is the number of segments of given image. Let k = p ) = , P ( s c , and ij j i i ij j j =1 } { ˆ = s be an assignment of the categories into segments of ,c S )) ) , ( s ,c ,c s ( ) ,..., ( k 2 1 f ( k ) f f (2) f (1) { 1 ,...,k } 1 ,...,C } the image according to the binding function f ∈ F = . With such notation, for { ˆ W W is a set of tuples consistent with f S a fixed binding function , and define P ( , a world | S ) = f ∏ k . Eq. 2 becomes p . Hence we have as many possible worlds as binding functions, that is C i )) ( i,f ( i quickly intractable for and C seen in practice, wherefore we use a sampling strategy that draws a k ~ ,..., = ( W s , W under an assumption that for each segment finite sample W ) W from P ( ·| S ) N 2 1 i every object’s category c . A few sampled perceived worlds is drawn independently according to p ij j are shown in Figure 2a. ∑ ( can be done inde- P ( A |W ) , T ) P Regarding the computational efficiency, computing T | Q i T pendently for every W , and therefore in parallel without any need for synchronization. Since for i small N the computational costs of summing up computed probabilities is marginal, the overall cost is about the same as single inference modulo parallelism. The presented multi-world approach to question answering on real-world scenes is still an end-to-end architecture that is trained solely on the question-answer pairs. 4

5 50 50 50 100 100 100 150 150 150 200 200 200 250 250 250 300 300 300 350 350 350 400 400 400 500 550 450 50 100 150 200 250 300 350 400 50 450 500 150 550 100 200 250 300 350 400 200 550 350 300 250 400 150 100 50 500 450 50 50 50 100 100 100 150 150 150 200 200 200 250 250 250 300 300 300 350 350 350 400 400 400 150 100 50 500 450 400 350 550 300 250 200 550 150 450 400 350 300 250 200 100 50 500 50 500 350 300 250 200 150 100 400 550 450 Z axis, ground truth and predicted semantic segmentations. Figure 3: NYU-Depth V2 dataset: image, Description Template Example How many { object } are in { image id counting ? How many cabinets are in image1? } counting and colors { color }{ object } are in { image id How many ? How many gray cabinets are in image1? } room type Which type of the room is depicted in { image id } ? Which type of the room is depicted in image1? Individual superlatives What is the largest { object } in { image id } ? What is the largest object in image1? counting and colors How many color }{ object } ? How many black bags? { Which images do not have { } ? Which images do not have sofa? negations type 1 object set Which images are not room type } ? Which images are not bedroom? negations type 2 { { object } negations type 3 { object } ? Which images have desk but do not have a lamp? Which images have but do not have a Table 2: Synthetic question-answer pairs. The questions can be about individual images or the sets of images. Implementation and Scalability For worlds containing many facts and spatial relations the in- duction step becomes computationally demanding as it considers all pairs of the facts (we have 4 million predicates in the worst case). Therefore we use a batch-based approximation in such about situations. Every image induces a set of facts that we call a batch of facts. For every test image, we find nearest neighbors in the space of training batches with a boolean variant of TF.IDF to mea- k k images with most similar sure similarity [20]. This is equivalent to building a training world from = 3 25 worlds in our experiments. content to the perceived world of the test image. We use and k 1 . Dataset and the source code can be found in our website 4 Experiments 4.1 DAtaset for QUestion Answering on Real-world images (DAQUAR) Our new dataset for question answering is built on top of Images and Semantic Segmentation 1449 the NYU-Depth V2 dataset [6]. NYU-Depth V2 contains RGBD images together with anno- tated semantic segmentations (Figure 3) where every pixel is labeled into some object class with a confidence score. Originally classes are considered. According to [15], we preprocess the data 894 X , to obtain canonical views of the scenes and use , Z coordinates from the depth sensor to define Y spatial placement of the objects in 3D. To investigate the impact of uncertainty in the visual analysis of the scenes, we also employ computer vision techniques for automatic semantic segmentation. We use a state-of-the-art scene analysis method [15] which maps every pixel into 40 infor- classes: 37 mative object classes as well as ’other structure’, ’other furniture’ and ’other prop’. We ignore the 795 training and latter three. We use the same data split as [15]: test images. To use our spatial 654 representation on the image content, we fit 3d cuboids to the segmentations. New dataset of questions and answers In the spirit of a visual turing test, we collect question answer pairs from human annotators for the NYU dataset. In our work, we consider two types of the synthetic question-answer pairs are automatically generated annotations: synthetic and human. The question-answer pairs, which are based on the templates shown in Table 2. These templates are then instantiated with facts from the database. To collect human question-answer pairs we ask 5 12468 in-house participants to provide questions and answers. They were instructed to give valid answers 894 categories) or sets of those. Besides the that are either basic colors [16], numbers or objects ( answers, we don’t impose any constraints on the questions. We also don’t correct the questions as we believe that the semantic parsers should be robust under the human errors. Finally, we use 6794 2 5674 test question-answer pairs – about 9 pairs per image on average (8 . 63 , 8 . 75) training and . 1 https://www.d2.mpi-inf.mpg.de/visual-turing-challenge 2 1 Q x,y denotes mean x and trimean y . We use Tukey’s trimean ( ( ) Our notation +2 Q + Q ) , where Q 3 2 j 1 4 denotes the j -th quartile [21]. This measure combines the benefits of both median (robustness to the extremes) and empirical mean (attention to the hinge values). 5

6 The database exhibit some biases showing humans tend to focus on a few prominent objects. For instance we have more than occurrences of table and chair in the answers. In average the 400 (14 object’s category occurs , 4) times in training set and (22 . 48 , 5 . 75) times in total. Figure 4 . 25 shows example question-answer pairs together with the corresponding image that illustrate some of the challenges captured in this dataset. While the quality of an answer that the system produces can be measured Performance Measure in terms of accuracy w.r.t. the ground truth (correct/wrong), we propose, inspired from the work on Fuzzy Sets [22], a soft measure based on the WUP score [23], which we call WUPS (WUP Set) score. As the number of classes grows, the semantic boundaries between them are becoming more fuzzy. For example, both concepts ’carton’ and ’box’ have similar meaning, or ’cup’ and ’cup of coffee’ are almost indifferent. Therefore we seek a metric that measures the quality of an answer and penalizes naive solutions where the architecture outputs too many or too few answers. ∑ N 1 i i i i Standard Accuracy is defined as: T = -th answer and i }· 100 where A 1 { T A are , =1 i N ground-truth respectively. Since both the answers may include more than one object, it is beneficial . From this point of view we have for every { t to represent them as sets of the objects ,t T ,... = } 2 1 1 , i ,...,N } : ∈{ 2 i i i i i i i i i i A T ⊆ T 1 ∩ = } ⊆ A T } = min { 1 { A = ⊆ T A } , 1 { T { ⊆ A 1 }} (3) { ∏ ∏ ∏ ∏ i i i i { = min T (4) } } ) A ∈ t ( 1 { t ∈ A , }}≈ min { 1 { μ a , μ ( a ∈ T ∈ ) i i i i T a ∈ t T A t ∈ A ∈ a ∈ We use a soft equivalent of the intersection operator in Eq. 3, and a set membership measure μ , with properties μ ( x ∈ X ) = 1 if x ∈ X , μ ( x ∈ X ) = max , 1] , μ ( x = y ) and μ ( x = y ) ∈ [0 X ∈ y μ μ we use a variant of Wu-Palmer similarity [23, 24]. . For 1 = in Eq. 4 with equality whenever ) calculates similarity based on the depth of two words a and b in the taxonomy[25, 26], WUP ( a,b and define the WUPS score: N ∏ ∑ ∏ 1 A,T ) = ( WUPS { max }· ) a,t ( max (5) 100 WUP WUP ( a,t ) , min i i N t A ∈ T a ∈ i i =1 i t A ∈ a T ∈ 0 Empirically, we have found that in our task a WUP score of around 9 is required for precise . answers. Therefore we have implemented down-weighting WUP ( a,b ) by one order of magnitude ( 0 . 1 · WUP) whenever WUP ( a,b ) < t for a threshold t . We plot a curve over thresholds t ranging from 0 1 (Figure 5). Since ”WUPS at 0” refers to the most ’forgivable’ measure without any down- to weighting and ”WUPS at 1.0” corresponds to plain accuracy. Figure 5 benchmarks architectures by requiring answers with precision ranging from low to high. Here we show some examples of the pure = 0 . 94 , WUP(carton, box) = WUP score to give intuitions about the range: WUP(curtain, blinds) . 94 , WUP(stove, fire extinguisher) = 0 . 82 . 0 4.2 Quantitative results We perform a series of experiments to highlight particular challenges like uncertain segmenta- tions, unknown true logical forms, some linguistic phenomena as well as show the advantages of our proposed multi-world approach. In particular, we distinguish between experiments on syn- SynthQA ) based on templates and those collected by annotators ( Hu- thetic question-answer pairs ( manQA AutoSeg ) with a computer vision algorithm [15] and hu- ), automatic scene segmentation ( man segmentations ( HumanSeg ) based on the ground-truth annotations in the NYU dataset as well as single world ( single ) and multi-world ( multi ) approaches. 4.2.1 Synthetic question-answer pairs (SynthQA) Based on human segmentations (HumanSeg, 37 classes) (1st and 2nd rows in Table 3) uses au- tomatically generated questions (we use templates shown in Table 2) and human segmentations. We have generated training and 40 test question-answer pairs per template category, in total 140 20 training and test pairs (as an exception negations type 1 and 2 have 10 training and 20 test exam- 280 ples each). This experiment shows how the architecture generalizes across similar type of questions provided that we have human annotation of the image segments. We have further removed negations of type 3 in the experiments as they have turned out to be particularly computationally demanding. Performance increases hereby from 56% to 59 . 9% with about 80% training Accuracy. Since some incorrect derivations give correct answers, the semantic parser learns wrong associations. Other dif- ficulties stem from the limited training data and unseen object categories during training. Based on automatic segmentations (AutoSeg, 37 classes, single) (3rd row in Table 3) tests the ar- chitecture based on uncertain facts obtained from automatic semantic segmentation [15] where the 6

7 most likely object labels are used to create a single world. Here, we are experiencing a severe drop in performance from . 9% to 11 . 25% by switching from human to automatic segmentation. Note 59 that there are only 37 classes available to us. This result suggests that the vision part is a serious bottleneck of the whole architecture. Based on automatic segmentations using multi-world approach (AutoSeg, 37 classes, multi) (4th row in Table 3) shows the benefits of using our multiple worlds approach to predict the an- swer. Here we recover part of the lost performance by an explicit treatment of the uncertainty in the segmentations. Performance increases from 25% to 13 . 75% . 11 . 4.3 Human question-answer pairs (HumanQA) (1st row in Table 4) switch- Based on human segmentations 894 classes (HumanSeg, 894 classes) ing to human generated question-answer pairs. The increase in complexity is twofold. First, the human annotations exhibit more variations than the synthetic approach based on templates. Second, the questions are typically longer and include more spatially related objects. Figure 4 shows a few samples from our dataset that highlights challenges including complex and nested spatial reference 7 . 86% in this scenario. As argued above, and use of reference frames. We yield an accuracy of we also evaluate the experiments on the human data under the softer WUPS scores given different thresholds (Table 4 and Figure 5). In order to put these numbers in perspective, we also show perfor- 4 . 4% Accuracy, mance numbers for two simple methods: predicting the most popular answer yields 0 0 18% and 1 . 3% Accuracy and WUPS (at and our untrained architecture gives . 9 ). . (2nd row in Table 4) uses hu- Based on human segmentations 37 classes (HumanSeg, 37 classes) man segmentation and question-answer pairs. Since only 37 classes are supported by our automatic segmentation algorithm, we run on a subset of the whole dataset. We choose the 25 test images yielding a total of 286 12 . 47% and question answer pairs for the following experiments. This yields 15 . 89% Accuracy and WUPS at 0 . 9 respectively. Based on automatic segmentations (AutoSeg, 37 classes) (3rd row in Table 4) Switching from the 12 . to 9 . 69% in Accuracy and 47% human segmentations to the automatic yields again a drop from we observe a similar trend for the whole spectrum of the WUPS scores. 37 classes, multi) Based on automatic segmentations using multi-world approach (AutoSeg, (4th row in Table 4) Similar to the synthetic experiments our proposed multi-world approach yields an improvement across all the measure that we investigate. (5th and 6th rows in Table 4 for Human baseline and 37 classes) shows human predictions on 894 our dataset. We ask independent annotators to provide answers on the questions we have collected. 37 or 894 cate- They are instructed to answer with a number, basic colors [16], or objects (from gories) or set of those. This performance gives a practical upper bound for the question-answering algorithms with an accuracy of 60 . 27% for the 37 class case and 50 . 20% for the 894 class case. We also ask to compare the answers of the AutoSeg single world approach with HumanSeg single world and AutoSeg multi-worlds methods. We use a two-sided binomial test to check if difference in preferences is statistically significant. As a result AutoSeg single world is the least preferred method with the p-value below 0 . 01 in both cases. Hence the human preferences are aligned with our accuracy measures in Table 4. 4.4 Qualitative results We choose examples in Fig. 6 to illustrate different failure cases - including last example where all methods fail. Since our multi-world approach generates different sets of facts about the perceived worlds, we observe a trend towards a better representation of high level concepts like ’counting’ (leftmost the figure) as well as language associations. A substantial part of incorrect answers is attributed to missing segments, e.g. no pillow detection in third example in Fig. 6. 5 Summary We propose a system and a dataset for question answering about real-world scenes that is reminiscent of a visual turing test. Despite the complexity in uncertain visual perception, language understanding and program induction, our results indicate promising progress in this direction. We bring ideas together from automatic scene analysis, semantic parsing with symbolic reasoning, and combine them under a multi-world approach. As we have mature techniques in machine learning, computer vision, natural language processing and deduction at our disposal, it seems timely to bring these disciplines together on this open challenge. 7

8 The annotators are using different names to Some objects, like the table on the left of QA: (What is behind the table?, window) ! ! QA: (What is in front of toilet?, door) , QA: (what is beneath the candle holder call the same things. The names of the image, are severely occluded or truncated. Spatial relation like ‘behind’ are dependent Here the ‘open door’ to the restroom is not ! decorative plate) brown object near the bed include ‘night Yet, the annotators refer to them in the on the reference frame. Here the annotator clearly visible, yet captured by the annotator. ! Some annotators use variations on spatial stand’, ‘stool’, and ‘cabinet’. questions. ! uses observer-centric view. relations that are similar, e.g. ‘beneath’ is ! closely related to ‘below’. ! QA: (what is in front of the wall divider?, cabinet) Annotators use additional properties to clarify object references (i.e. wall divider). Moreover, the perspective plays an important role in these spatial relations interpretations. QA: (How many drawers are there ?, 8) ! QA1:(How many doors are in the image?, 1) ! QA: (What is the shape of the green ! The annotators use their common-sense ! QA2:(How many doors are in the image?, 5) chair?, horse shaped) ! QA: (what is behind the table ?, sofa) ! knowledge for amodal completion. Here the Different interpretation of ‘door’ results in In this example, an annotator refers to a Spatial relations exhibit different reference annotator infers the 8th drawer from the different counts: 1 door at the end of the hall “horse shaped chair” which requires a quite frames. Some annotations use observer- context vs. 5 doors including lockers ! abstract reasoning about the shapes. ! centric, others object-centric view QA: (how many lights are on?, 6) ! Moreover, some questions require detection of states ‘light on or off’ QA1: (what is in front of the curtain behind ! the armchair ?, guitar) ! ?, QA2: (what is in front of the curtain guitar) ! ! Spatial relations matter more in complex QA: (What is the object on the counter in QA: (How many doors are open?, 1) ! QA: (Where is oven?, on the right side of environments where reference resolution ! the corner ?, microwave) Notion of states of object (like open) is not refrigerator ! ) becomes more relevant. In cluttered scenes, References like ‘corner’ are difficult to well captured by current vision techniques. On some occasions, the annotators prefer to pragmatism starts playing a more important resolve given current computer vision Annotators use such attributes frequently use more complex responses. With spatial Q: what is at the back side of the sofas ? ! role models. Yet such scene features are ! for disambiguation. relations, we can increase the answer’s Annotators use wide range spatial relations, ! frequently used by humans. ! precision. such as ‘backside’ which is object-centric. Figure 4: Examples of human generated question-answer pairs illustrating the associated challenges. In the descriptions we use following notation: ’A’ - answer, ’Q’ - question, ’QA’ - question-answer pair. Last two examples (bottom-right column) are from the extended dataset not used in our experiments. HumanQA ● ● ● ● ● ● ● ● ● 0.8 synthetic question-answer pairs (SynthQA) ● ● ● ● ● ● ● ● ● World(s) Accuracy Segmentation # classes ● ● 0.6 ● ● ● ● ● ● ● ● ● ● ● ● ● 37 Single with Neg. 3 HumanSeg . 56 0% ● ● ● ● ● ● ● ● ● ● ● ● ● ● . 5% Single 37 59 HumanSeg ● 0.4 ● ● WUPS ● ● HumanSeg, Single, 894 ● ● . Single 11 AutoSeg 37 25% ● HumanSeg, Single, 37 ● ● ● AutoSeg, Single, 37 0.2 75% . 13 37 Multi AutoSeg ● ● ● ● ● ● AutoSeg, Multi, 37 ● ● Human Baseline, 894 Human Baseline, 37 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 Threshold Figure 5: WUPS scores for different thresholds. Table 3: Accuracy results for the experiments with syn- thetic question-answer pairs. Human question-answer pairs (HumanQA) #classes Accuracy WUPS at 0 . 9 WUPS at 0 Segmentation World(s) 86% Single . 86% 11 . 7 38 . 79% HumanSeg 894 Single 37 12 . 47% 16 . 49% 50 . 28% HumanSeg Single AutoSeg 9 . 69% 14 . 73% 48 . 57% 37 Multi 18 12 . 73% AutoSeg . 10% 51 . 47% 37 . 894 . 20% 50 50 82% 67 . 27% Human Baseline Human Baseline 37 60 . 27% 61 . 04% 78 . 96% Table 4: Accuracy and WUPS scores for the experiments with human question-answer pairs. We show WUPS scores at two opposite sides of the WUPS spectrum. ! Q: What is on the right side of the table? ! ! Q: What is the object on the chair? ! Q: What is in front of television? Q: What is behind the television? ! Q: What is on the right side of cabinet? ! Q: What is on the wall? ! ! Q: How many chairs are at the table? Q: How many red chairs are there? ! H: chair H: pillow ! H: pillow ! H: lamp ! H: mirror H: picture H: wall ! H: () M: window, floor, wall ! ! M: floor, wall ! M: chair M: brown, pink, purple ! M: bed ! M: bed ! M: 4 ! ! M: 6 C: floor C: wall C: picture C: picture C: bed C: picture C: chair C: blinds ! Figure 6: Questions and predicted answers. Notation: ’Q’ - question, ’H’ - architecture based on human segmentation, ’M’ - architecture with multiple worlds, ’C’ - most confident architecture, ’()’ - no answer. Red color denotes correct answer. 8

9 References [1] Liang, P., Jordan, M.I., Klein, D.: Learning dependency-based compositional semantics. Com- putational Linguistics (2013) [2] Kwiatkowski, T., Zettlemoyer, L., Goldwater, S., Steedman, M.: Inducing probabilistic ccg grammars from logical form with higher-order unification. In: EMNLP. (2010) [3] Zettlemoyer, L.S., Collins, M.: Online learning of relaxed ccg grammars for parsing to logical form. In: EMNLP-CoNLL-2007. (2007) [4] Matuszek, C., Fitzgerald, N., Zettlemoyer, L., Bo, L., Fox, D.: A joint model of language and perception for grounded attribute learning. In: ICML. (2012) [5] Krishnamurthy, J., Kollar, T.: Jointly learning to parse and perceive: Connecting natural language to the physical world. TACL (2013) [6] Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: ECCV. (2012) [7] Kong, C., Lin, D., Bansal, M., Urtasun, R., Fidler, S.: What are you talking about? text-to- image coreference. In: CVPR. (2014) [8] Karpathy, A., Joulin, A., Fei-Fei, L.: Deep fragment embeddings for bidirectional image sentence mapping. In: NIPS. (2014) [9] Matuszek, C., Herbst, E., Zettlemoyer, L., Fox, D.: Learning to parse natural language com- mands to a robot control system. In: Experimental Robotics. (2013) [10] Levit, M., Roy, D.: Interpretation of spatial language in a map navigation task. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on (2007) [11] Vogel, A., Jurafsky, D.: Learning to follow navigational directions. In: ACL. (2010) [12] Tellex, S., Kollar, T., Dickerson, S., Walter, M.R., Banerjee, A.G., Teller, S.J., Roy, N.: Un- derstanding natural language commands for robotic navigation and mobile manipulation. In: AAAI. (2011) [13] Kruijff, G.J.M., Zender, H., Jensfelt, P., Christensen, H.I.: Situated dialogue and spatial orga- nization: What, where... and why. IJARS (2007) [14] Wick, M., McCallum, A., Miklau, G.: Scalable probabilistic databases with factor graphs and mcmc. In: VLDB. (2010) [15] Gupta, S., Arbelaez, P., Malik, J.: Perceptual organization and recognition of indoor scenes from rgb-d images. In: CVPR. (2013) [16] Van De Weijer, J., Schmid, C., Verbeek, J.: Learning color names from real-world images. In: CVPR. (2007) [17] Regier, T., Carlson, L.A.: Grounding spatial language in perception: an empirical and compu- tational investigation. Journal of Experimental Psychology: General (2001) [18] Lan, T., Yang, W., Wang, Y., Mori, G.: Image retrieval with structured object queries using latent ranking svm. In: ECCV. (2012) [19] Guadarrama, S., Riano, L., Golland, D., Gouhring, D., Jia, Y., Klein, D., Abbeel, P., Darrell, T.: Grounding spatial relations for human-robot interaction. In: IROS. (2013) ̈ [20] Manning, C.D., Raghavan, P., Sch utze, H.: Introduction to information retrieval. Cambridge university press Cambridge (2008) [21] Tukey, J.W.: Exploratory data analysis. (1977) [22] Zadeh, L.A.: Fuzzy sets. Information and control (1965) [23] Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: ACL. (1994) [24] Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Mooney, R., Darrell, T., Saenko, K.: Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: ICCV. (2013) [25] Miller, G.A.: Wordnet: a lexical database for english. CACM (1995) [26] Fellbaum, C.: WordNet. Wiley Online Library (1999) 9

Related documents

2017 NAICS Manual

2017 NAICS Manual

ORTH N A MERICAN I NDUSTRY LASSIFICATION C YSTEM S United States, 2017 EXECUTIVE OFFICE OF THE PRESIDENT AND BUDGET OFFICE OF MANAGEMENT

More info »
U7112 UCARE CONNECT + MEDICARE PROVIDERDIR MAY 2019 DATA.sv

U7112 UCARE CONNECT + MEDICARE PROVIDERDIR MAY 2019 DATA.sv

UCare Connect + Medicare Provider and Pharmacy Directory Introduction This Provider and Pharmacy Directory includes information about the provider and pharmacy types in UCare Connect + Medicare and li...

More info »
u7112 connectplus directories 2019

u7112 connectplus directories 2019

UCare Connect + Medicare Provider and Pharmacy Directory Introduction This Provider and Pharmacy Directory includes information about the provider and pharmacy types in UCare Connect + Medicare and li...

More info »
Nastran Dmap Error Message List

Nastran Dmap Error Message List

Overview of Error Messages NX Nastran displays User Information, Warning, and Error messages in the printed output. The amount of information reported in a message is controlled by system cell 319. Wh...

More info »
Microsoft Word   147400

Microsoft Word 147400

Federal Communications Commission FCC 15 -24 the Before Federal Communications Commission Washington, D.C. 20554 In the Matter of ) ) GN Docket No. 14 -28 Protecting and Promoting the Open Internet ) ...

More info »
capitalists

capitalists

∗ Capitalists in the Twenty-First Century Matthew Smith, US Treasury Department Danny Yagan, UC Berkeley and NBER Owen Zidar, Princeton and NBER Eric Zwick, Chicago Booth and NBER December 19, 2018 Ab...

More info »
Microsoft Word   2019 Directory of CJ Agencies 032019

Microsoft Word 2019 Directory of CJ Agencies 032019

N ORK S TATE Y EW Andrew M. Cuomo, Governor DIRECTORY of New York State Criminal Justice Agencies 33rd Edition March 2018 D IVISION OF C RIMINAL J USTICE S ERVICES Michael C. Green Executive Deputy Co...

More info »
rptAgencyDirectory

rptAgencyDirectory

Washington State Directory of Certified Mental Health, Substance Use Disorder, and Problem & Pathological Gambling Services June 2018 Referrals to Behavioral Health Services and Crisis Intervention: ‡...

More info »
FD GeneralGuidelines BestPractices HandlingRetrievals Chargebacks

FD GeneralGuidelines BestPractices HandlingRetrievals Chargebacks

Retrieval & Chargeback Best Practices A Merchant User Guide to Help Manage Disputes Visa MasterCard Discover American Express April 2018 www.First D ata.com

More info »
Board Meeting Reference Manual

Board Meeting Reference Manual

Board Meeting Reference Manual January 2015 Publication 311

More info »
Microsoft Word   ReportSP363EB751MARKTinternalmarketEN final.doc

Microsoft Word ReportSP363EB751MARKTinternalmarketEN final.doc

Special Eurobarometer 363 European Commission Internal Market: Awareness, Perceptions and Impacts REPORT February-March 2011 Fieldwork: Publication: September 2011 TNS opinion & social This survey has...

More info »
web directory

web directory

2019 Telephone Directory National Interagency Fire Center 3833 South Development Avenue Boise, Idaho 83705 -5354 (208) 387 + 4 -digit extension Information (208) 387 -5512 Updated 5/ 201 9

More info »
MasterBooklist18 19

MasterBooklist18 19

Ursuline Academy of Dallas Master Booklist 2018-19 NOTE: BOOKS CODED AS "NEW" AND HIGHLIGHTED IN YELLOW ARE NEW FOR THIS YEAR, YOU CAN PURCHASE USED IF AVAIL. 3313 ALGEBRA II/PRE-CALCULUS H OR 6132 HI...

More info »
tr recent

tr recent

Page: 1 Date: 05/01/2019 City of San Diego New Business Listing Report 04/01/2019 to 04/30/2019 Creation Date Business Name Owner Name Owner Start Exp Business Tax -----Business Location----- -----Bus...

More info »
Baby 2018 19 Dance.xls

Baby 2018 19 Dance.xls

5/12/2019 1 of 5 2016-2018 Big Band Ballroom Dances/Cruises Dance Compliments of Merrymakers Dances & Cruises (BB or dance related) Big Band Ballroom 2019-2021 See current schedule at <www.dancemm.com...

More info »
BE 13A: Survey of New Foreign Direct Investment in the United States

BE 13A: Survey of New Foreign Direct Investment in the United States

FORM (REV. 11/2014 ) OMB No. 0608-0035: Approval Expires 10/31/2017 BE-13A 5 00 3 MANDATORY — CONFIDENTIAL* SURVEY OF NEW FOREIGN DIRECT INVESTMENT IN THE UNITED STATES FORM BE-13A (Report for Acquisi...

More info »
Government in Palm Beach County

Government in Palm Beach County

February 2019 Contact Public Affairs at 561-355-2754 for informatio n. this document may be made available in an alternate format. In accordance with the provisions of the ADA, Prepared by Palm Beach ...

More info »