# A Multi World Approach to Question Answering about Real World Scenes based on Uncertain Input

4 Predicate Definition ( A,B above ( A,B ) and ( Y closeAbove ( B ) < Y ) ( A ) +  ) max min closeLeftOf leftOf ( A,B ) and ( X ( ( B ) < X ) ( A ) +  ) A,B min max B inFrontOf ( A,B ) and ( Z ( ( closeInFrontOf ) < Z ) ( A ) +  ) A,B max min ( A,B ) X ) X A ) < X A ( B ) and X ( ( B ) < X ( mean max mean aux min ( ) ) Z Z ( A ) < Z A,B ( B ) and Z A ( B ) < Z ( mean aux mean max min h A,B ) closeAbove ( A,B ) or closeBelow ( A,B ( ) aux ( A,B ) closeLeftOf ( v ) or closeRightOf ( A,B ) A,B aux auxiliary relations ) d ) closeInFrontOf ( A,B A,B or closeBehind ( A,B ) ( aux ( leftOf X A,B ( A ) < X ) ( B )) mean mean above ) Y < Y ( A ) A,B ( ( B ) mean mean ) ) Z inFrontOf ( A A,B < Z )) ( B ( mean mean spatial ( A,B ) closeAbove ( A,B ) and Z A,B ( A,B ) and X ) ( on aux aux ( A,B ) h close ( A,B ) or v ) ( A,B ) or d A,B ( aux aux aux and . Auxiliary relations define actual spatial re- Table 1: Predicates defining spatial relations between B A axis points downwards, functions X lations. The ,X Y ,... take appropriate values from the tuple min max , and  is a ’small’ amount. Symmetrical relations such as rightOf , below , behind , etc. can readily predicate below be defined in terms of other relations (i.e. A,B ) = above ( B,A ) ). ( color [16] (Figure 1 - middle part). Every object hypothesis is therefore represented as an n-tuple: predicate instance ( ∈{ id, color, spatial ) id, image predicate loc bag,bed,books,... } , where instance id is the object’s id, image id is id of the image containing the object, color is esti- mated color of the object [16], and spatial is the object’s position in the image. Latter is loc ( represented as ,X X ,X and defines mini- ,Y ) ,Y ,Z ,Y ,Z ,Z min mean max mean min mean max min max X,Y,Z mal, maximal, and mean location of the object along axes. To obtain the coordinates we fit axis parallel cuboids to the cropped 3d objects based on the semantic segmentation. Note that the X,Y,Z coordinate system is aligned with direction of gravity [15]. As shown in Figure 2b, this is a more meaningful representation of the object’s coordinates over simple image coordinates. The complete schema will be documented together with the code release. We realize that the skilled use of spatial relations is a complex task and grounding spatial relations is a research thread on its own (e.g. [17], [18] and [19]). For our purposes, we focus on predefined relations shown in Table 1, while the association of them as well as the object classes are still dealt within the question answering architecture. Multi-worlds approach for combining uncertain visual perception and symbolic reasoning Up to now we have considered the output of the semantic segmentation as “hard facts”, and hence ignored uncertainty in the class labeling. Every such labeling of the segments corresponds to dif- ferent interpretation of the scene - different perceived world. Drawing on ideas from probabilistic databases [14], we propose a multi-world approach (Figure 1 - lower part) that marginalizes over multiple possible worlds - multiple interpretations of a visual scene - derived from the segmen- W . Therefore the posterior over the answer A given question Q tation S S and semantic segmentation W and logical forms T : of the image marginalizes over the latent worlds ∑ ∑ | P ) = A Q,S ( (2) P ( A |W , T ) P ( W | S ) P ( T | Q ) T W s with the associated probabilities The semantic segmentation of the image is a set of segments i where over the C object categories c } . More precisely S = { ( s ) ,L ,L ) , ( p s ,L ( ) ,..., s 2 k 1 k 1 j ij 2 C L { ( c ,p ) } = is the number of segments of given image. Let k = p ) = , P ( s c , and ij j i i ij j j =1 } { ˆ = s be an assignment of the categories into segments of ,c S )) ) , ( s ,c ,c s ( ) ,..., ( k 2 1 f ( k ) f f (2) f (1) { 1 ,...,k } 1 ,...,C } the image according to the binding function f ∈ F = . With such notation, for { ˆ W W is a set of tuples consistent with f S a fixed binding function , and define P ( , a world | S ) = f ∏ k . Eq. 2 becomes p . Hence we have as many possible worlds as binding functions, that is C i )) ( i,f ( i quickly intractable for and C seen in practice, wherefore we use a sampling strategy that draws a k ~ ,..., = ( W s , W under an assumption that for each segment finite sample W ) W from P ( ·| S ) N 2 1 i every object’s category c . A few sampled perceived worlds is drawn independently according to p ij j are shown in Figure 2a. ∑ ( can be done inde- P ( A |W ) , T ) P Regarding the computational efficiency, computing T | Q i T pendently for every W , and therefore in parallel without any need for synchronization. Since for i small N the computational costs of summing up computed probabilities is marginal, the overall cost is about the same as single inference modulo parallelism. The presented multi-world approach to question answering on real-world scenes is still an end-to-end architecture that is trained solely on the question-answer pairs. 4

9 References [1] Liang, P., Jordan, M.I., Klein, D.: Learning dependency-based compositional semantics. Com- putational Linguistics (2013) [2] Kwiatkowski, T., Zettlemoyer, L., Goldwater, S., Steedman, M.: Inducing probabilistic ccg grammars from logical form with higher-order unification. In: EMNLP. (2010) [3] Zettlemoyer, L.S., Collins, M.: Online learning of relaxed ccg grammars for parsing to logical form. In: EMNLP-CoNLL-2007. (2007) [4] Matuszek, C., Fitzgerald, N., Zettlemoyer, L., Bo, L., Fox, D.: A joint model of language and perception for grounded attribute learning. In: ICML. (2012) [5] Krishnamurthy, J., Kollar, T.: Jointly learning to parse and perceive: Connecting natural language to the physical world. TACL (2013) [6] Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: ECCV. (2012) [7] Kong, C., Lin, D., Bansal, M., Urtasun, R., Fidler, S.: What are you talking about? text-to- image coreference. In: CVPR. (2014) [8] Karpathy, A., Joulin, A., Fei-Fei, L.: Deep fragment embeddings for bidirectional image sentence mapping. In: NIPS. (2014) [9] Matuszek, C., Herbst, E., Zettlemoyer, L., Fox, D.: Learning to parse natural language com- mands to a robot control system. In: Experimental Robotics. (2013) [10] Levit, M., Roy, D.: Interpretation of spatial language in a map navigation task. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on (2007) [11] Vogel, A., Jurafsky, D.: Learning to follow navigational directions. In: ACL. (2010) [12] Tellex, S., Kollar, T., Dickerson, S., Walter, M.R., Banerjee, A.G., Teller, S.J., Roy, N.: Un- derstanding natural language commands for robotic navigation and mobile manipulation. In: AAAI. (2011) [13] Kruijff, G.J.M., Zender, H., Jensfelt, P., Christensen, H.I.: Situated dialogue and spatial orga- nization: What, where... and why. IJARS (2007) [14] Wick, M., McCallum, A., Miklau, G.: Scalable probabilistic databases with factor graphs and mcmc. In: VLDB. (2010) [15] Gupta, S., Arbelaez, P., Malik, J.: Perceptual organization and recognition of indoor scenes from rgb-d images. In: CVPR. (2013) [16] Van De Weijer, J., Schmid, C., Verbeek, J.: Learning color names from real-world images. In: CVPR. (2007) [17] Regier, T., Carlson, L.A.: Grounding spatial language in perception: an empirical and compu- tational investigation. Journal of Experimental Psychology: General (2001) [18] Lan, T., Yang, W., Wang, Y., Mori, G.: Image retrieval with structured object queries using latent ranking svm. In: ECCV. (2012) [19] Guadarrama, S., Riano, L., Golland, D., Gouhring, D., Jia, Y., Klein, D., Abbeel, P., Darrell, T.: Grounding spatial relations for human-robot interaction. In: IROS. (2013) ̈ [20] Manning, C.D., Raghavan, P., Sch utze, H.: Introduction to information retrieval. Cambridge university press Cambridge (2008) [21] Tukey, J.W.: Exploratory data analysis. (1977) [22] Zadeh, L.A.: Fuzzy sets. Information and control (1965) [23] Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: ACL. (1994) [24] Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Mooney, R., Darrell, T., Saenko, K.: Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: ICCV. (2013) [25] Miller, G.A.: Wordnet: a lexical database for english. CACM (1995) [26] Fellbaum, C.: WordNet. Wiley Online Library (1999) 9

