Devi Parikh

Transcript

1 Forcing Vision and Language Models to Not Just Talk But Also Actually See Devi Parikh 1 Slide credit: Devi Parikh

2 2 Slide credit: Devi Parikh

3 3 Slide credit: Devi Parikh

4 VizWiz Challenge Entry: 10:50 am to 11:20 am 4 Slide credit: Devi Parikh

5 “Color College Avenue”, Blacksburg, VA, May 2012 5 Slide credit: Devi Parikh Slide credit: Devi Parikh

6 People coloring a street in rural Virginia. “Color College Avenue”, Blacksburg, VA, May 2012 6 Slide credit: Devi Parikh Slide credit: Devi Parikh

7 It was a great event! It brought families out, and the whole community together. “Color College Avenue”, Blacksburg, VA, May 2012 7 Slide credit: Devi Parikh Slide credit: Devi Parikh

8 Q. What are they coloring the street with? A. Chalk “Color College Avenue”, Blacksburg, VA, May 2012 8 Slide credit: Devi Parikh Slide credit: Devi Parikh

9 AI: What a nice picture! What event was this? User: “Color College Avenue”. It was a lot of fun! AI: I am sure it was! Do they do this every year? I wish they would. I don’t think User: they’ve organized it again since 2012. ... “Color College Avenue”, Blacksburg, VA, May 2012 9 Slide credit: Devi Parikh Slide credit: Devi Parikh

10 Applications 10 Slide credit: Devi Parikh

11 VQA Challenge @ CVPR16, 17, 18 - State - of the - art: à 72% 54% 11 Slide credit: Devi Parikh

12 Success Cases Q: What room is the cat located in? Q: What is the woman holding? GT A: kitchen laptop GT A: Machine A: kitchen Machine A: laptop Q: Is this a casino? Q: Is it going to rain soon? no GT A: yes GT A: no Machine A: Machine A: yes 12 Slide credit: Aishwarya Agrawal Slide credit: Devi Parikh

13 Models affected by language priors 13 Slide credit: Devi Parikh

14 Test Sample Nearest Neighbor Training Samples Q: What color are the safety cones? GT Ans : green Predicted Ans : orange 14 Slide credit: Aishwarya Agrawal Slide credit: Devi Parikh

15 Test Sample Nearest Neighbor Training Samples Q: What color are the safety cones? GT Ans : green Predicted Ans : orange 15 Slide credit: Aishwarya Agrawal Slide credit: Devi Parikh

16 Test Sample Nearest Neighbor Training Samples Q: What color Q: What color are the are the safety cones? cones? : orange Ans GT GT Ans : green orange : Predicted Ans 16 Slide credit: Aishwarya Agrawal Slide credit: Devi Parikh

17 Test Sample Nearest Neighbor Training Samples Q: What color What color Q: What color Q: are the are the is the cones? safety cones? cone? orange : green GT Ans GT Ans : GT Ans : orange orange : Ans Predicted 17 Slide credit: Aishwarya Agrawal Slide credit: Devi Parikh

18 Test Sample Nearest Neighbor Training Samples What color Q: Q: What color What color Q: What color Q: are the are the are is the cones? safety cones? the cones? cone? : green Ans : orange GT GT Ans orange Ans : orange GT GT Ans : Predicted orange : Ans 18 Slide credit: Aishwarya Agrawal Slide credit: Devi Parikh

19 GT Ans : yes 19 Slide credit: Aishwarya Agrawal Slide credit: Devi Parikh

20 GT Ans : 6 20 Slide credit: Aishwarya Agrawal Slide credit: Devi Parikh

21 GT Ans : yes 21 Slide credit: Aishwarya Agrawal Slide credit: Devi Parikh

22 Q: ? What does the red sign say Predicted Ans : stop Correct Response Incorrect Responses 22 Slide credit: Aishwarya Agrawal Slide credit: Devi Parikh

23 How many zebras? Q: 2 Predicted Ans : Correct Response Incorrect Responses 23 Slide credit: Aishwarya Agrawal Slide credit: Devi Parikh

24 Q: What covers the ground? Predicted Ans : snow All Correct Responses 24 Slide credit: Aishwarya Agrawal Slide credit: Devi Parikh

25 How do we force vision+language models to also look ? 25 Slide credit: Devi Parikh

26 Outline • Datasets E.g., VQA v2.0 – Evaluation metrics • – E.g., # pairs of images correctly answered • Novel problem spaces – E.g., changing priors (VQA - CP), novel object captioning, robust captioning, visual coreference resolution • Inductive biases in models – - VQA, Neural Baby Talk E.g., G • Human - in - the - loop evaluation – E.g., accuracy is not the final metric, performance of human - AI team is 26 Slide credit: Devi Parikh

27 Outline • Datasets E.g., VQA v2.0 – Evaluation metrics • – E.g., # pairs of images correctly answered • Novel problem spaces – E.g., changing priors (VQA - CP), novel object captioning, robust captioning, visual coreference resolution • Inductive biases in models – - VQA, Neural Baby Talk E.g., G • Human - in - the - loop evaluation – E.g., accuracy is not the final metric, performance of human - AI team is 27 Slide credit: Devi Parikh

28 Balancing the VQA dataset What game is this? Tennis 28 Slide credit: Devi Parikh

29 Balancing the VQA dataset 29 Slide credit: Devi Parikh

30 Balancing the VQA dataset 30 Slide credit: Devi Parikh

31 Balancing the VQA dataset 31 Slide credit: Devi Parikh

32 Balancing the VQA dataset 32 Slide credit: Devi Parikh

33 Balancing the VQA dataset 33 Slide credit: Devi Parikh

34 Balancing the VQA dataset 34 Slide credit: Devi Parikh

35 Balancing the VQA dataset 35 Slide credit: Devi Parikh

36 VQA v2.0 More balanced than VQA v1.0 • – Entropy of answers increases by 56% • Bigger than VQA v2.0 – ~1.8 times image - question pairs 36 Slide credit: Devi Parikh

37 Benchmarking SOTA VQA models • SOTA VQA models Drop in performance by 7 8% – - Gain 1 – 2% back when re - trained on balanced - dataset • By answer types – - 12%) Biggest drop in performance in yes/no (10 – Biggest improvement gained by re - training in yes/no (3 - 4%) and number (2 - 3%) 37 Slide credit: Devi Parikh

38 Trends 0.15% 1.51% 7.03% 3.5% 38 Slide credit: Devi Parikh

39 VQA v2.0 nd rd 2 and 3 VQA Challenges @ CVPR17, 18. 39 Slide credit: Devi Parikh

40 Removing Language Priors 40 Slide credit: Devi Parikh Slide credit: Yash Goyal and Peng Zhang

41 Removing Language Priors Answer: No Answer: Yes complementary scenes Question: Is the girl walking the bike? 41 Slide credit: Devi Parikh Slide credit: Devi Parikh Slide credit: Yash Goyal and Peng Zhang

42 Outline • Datasets E.g., VQA v2.0 – Evaluation metrics • – E.g., # pairs of images correctly answered • Novel problem spaces – E.g., changing priors (VQA - CP), novel object captioning, robust captioning, visual coreference resolution • Inductive biases in models – - VQA, Neural Baby Talk E.g., G • Human - in - the - loop evaluation – E.g., accuracy is not the final metric, performance of human - AI team is 42 Slide credit: Devi Parikh

43 Classifying a pair of complementary scenes Training set Unbalanced Balanced Blind (no image features) 0 0 Holistic image features 03.20 23.13 Attention - based image features 09.84 34.73 43 Slide credit: Devi Parikh

44 Outline • Datasets E.g., VQA v2.0 – Evaluation metrics • – E.g., # pairs of images correctly answered • Novel problem spaces – E.g., changing priors (VQA - CP), novel object captioning, robust captioning, visual coreference resolution • Inductive biases in models – - VQA, Neural Baby Talk E.g., G • Human - in - the - loop evaluation – E.g., accuracy is not the final metric, performance of human - AI team is 44 Slide credit: Devi Parikh

45 (A related) problem with existing setup Te s t Tr a i n Q Q : What color is the dog? What color is the dog? : A : White A Black : Training Prior Prediction: white White red blue green yellow Slide credit: Devi Parikh Slide credit: Aishwarya Agrawal

46 (A related) problem with existing setup Te s t Tr a i n Q Q What color is the dog? What color is the dog? : : A White : A Black : Training Prior Prediction: white White red blue green yellow Slide credit: Devi Parikh Slide credit: Aishwarya Agrawal

47 (A related) problem with existing setup Te s t Tr a i n Q Q What color is the dog? What color is the dog? : : A White : A Black : Training Prior Prediction: white White red blue green yellow Slide credit: Devi Parikh Slide credit: Aishwarya Agrawal

48 (A related) problem with existing setup Te s t Tr a i n Q Is the person wearing shorts? : Q Is the person wearing shorts? : A : No A : Ye s Training Prior Prediction: no No female woman Slide credit: Devi Parikh Slide credit: Aishwarya Agrawal

49 (A related) problem with existing setup Similar priors in train and test • • Memorization does not hurt as much • Problematic for benchmarking progress Slide credit: Devi Parikh Slide credit: Aishwarya Agrawal

50 CP! - Meet VQA • Visual Question Answering under Changing Priors • A new split of the VQA v1.0 dataset ( Antol et al., ICCV 2015) Slide credit: Devi Parikh Slide credit: Aishwarya Agrawal

51 - - CP Train Split VQA VQA CP Test Split 51 Slide credit: Devi Parikh Slide credit: Aishwarya Agrawal

52 - CP Performance of VQA models on VQA 31% drop Antol ( et al. ICCV15) 25% drop (Andreas et al. CVPR16) 29% drop (Yang et al. CVPR16) 27% drop (Fukui et al. EMNLP16) Slide credit: Devi Parikh Slide credit: Aishwarya Agrawal

53 Outline • Datasets E.g., VQA v2.0 – Evaluation metrics • – E.g., # pairs of images correctly answered • Novel problem spaces – E.g., changing priors (VQA - CP), novel object captioning, robust captioning, visual coreference resolution • Inductive biases in models – - VQA, Neural Baby Talk E.g., G • Human - in - the - loop evaluation – E.g., accuracy is not the final metric, performance of human - AI team is 53 Slide credit: Devi Parikh

54 VQA (GVQA) - Grounded Concept Classifier (VCC) Visual Concepts Att Hop 1 Att Hop 2 Image (I) Answer Predictor (AP) grouped into clusters VGG bus car VQA cone fc fc red Answers + yellow Q Question main green (998) Extractor 5 2 (Q) 3 Visual Verifier (VV) Answer Cluster Predictor (ACP) object fc Concept color LSTM VQA clusters fc number Answers + Non yes/no Question (Yes / No) Concept Extractor (CE) Classifier yes/no fc Glove POS Tags based extractor Slide credit: Devi Parikh Slide credit: Aishwarya Agrawal

55 VQA (GVQA) - Grounded Concept Classifier (VCC) Visual Concepts Att Hop 1 Att Hop 2 Image (I) Answer Predictor (AP) grouped into clusters VGG bus car VQA cone fc fc red Answers + yellow Q Question main green (998) Extractor 5 2 (Q) 3 Visual Verifier (VV) Answer Cluster Predictor (ACP) object fc Concept color LSTM VQA clusters fc number Answers + Non yes/no Question (Yes / No) Concept Extractor (CE) Classifier yes/no fc Glove POS Tags based extractor Slide credit: Devi Parikh Slide credit: Aishwarya Agrawal

56 VQA (GVQA) - Grounded Concept Classifier (VCC) Visual Concepts Att Hop 1 Att Hop 2 Image (I) Answer Predictor (AP) grouped into clusters VGG bus car VQA cone fc fc red Answers + yellow Q Question main green (998) Extractor 5 2 (Q) 3 Visual Verifier (VV) Answer Cluster Predictor (ACP) object fc Concept color LSTM VQA clusters fc number Answers + Non yes/no Question (Yes / No) Concept Extractor (CE) Classifier yes/no fc Glove POS Tags based extractor Slide credit: Devi Parikh Slide credit: Aishwarya Agrawal

57 VQA (GVQA) - Grounded Concept Classifier (VCC) Visual Concepts Att Hop 1 Att Hop 2 Image (I) Answer Predictor (AP) grouped into clusters VGG bus car VQA cone fc fc red Answers + yellow Q Question main green (998) Extractor 5 2 (Q) 3 Visual Verifier (VV) Answer Cluster Predictor (ACP) object fc Concept color LSTM VQA clusters fc number Answers + Non yes/no Question (Yes / No) Concept Extractor (CE) Classifier yes/no fc Glove POS Tags based extractor Slide credit: Devi Parikh Slide credit: Aishwarya Agrawal

58 Visual Coreference Resolution in Visual Dialog using Neural Module Networks 58 Slide credit: Devi Parikh

59 Neural Baby Talk 59 Slide credit: Devi Parikh

60 Framework dog More This is a photograph Grounded DPM ... ... ... CRF one dog and one of cake. The dog is ... Prep Obj Attr cake dog fram ( ) with a ) is e A ( cabinet Detector sitting a with at ( ) tie Neural Baby ( ). ... chair RNN Talk CNN table feature Region cake puppy tie cake table More a on A dog is sitting CNN RNN Natural couch with a toy. Conv feature Lu Slide credit: Jiasen Slide credit: Devi Parikh

61 A close up of a stuffed animal on a plate. Slide credit: Jiasen Lu Slide credit: Devi Parikh

62 A person is sitting at a table with a sandwich . Slide credit: Jiasen Lu Slide credit: Devi Parikh

63 table with bear sitting on a teddy A a plate of food. Lu Slide credit: Jiasen Slide credit: Devi Parikh

64 and a Mr. Ted sitting at a table with a pie A cup of coffee . Lu Slide credit: Jiasen Slide credit: Devi Parikh

65 Outline • Datasets E.g., VQA v2.0 – Evaluation metrics • – E.g., # pairs of images correctly answered • Novel problem spaces – E.g., changing priors (VQA - CP), novel object captioning, robust captioning, visual coreference resolution • Inductive biases in models – - VQA, Neural Baby Talk E.g., G • Human - in - the - loop evaluation – E.g., accuracy is not the final metric, performance of human - AI team is 65 Slide credit: Devi Parikh

66 Slide credit: Devi Parikh

67 Slide credit: Devi Parikh

68 Image Guessing Game Compared two bots • • One did better at this game in bot - bot evaluation • Trend did not generalize when evaluated with humans* * caveats apply Slide credit: Devi Parikh

69 Summary • VQA is a rich problem space Vision, language, attention, reasoning, external knowledge, HCI, ... – Challenges: • – Grounding and generalization • Forcing models to look Counter via datasets – E.g., VQA v2.0, VQA - CP • – Counter via evaluation metrics • E.g., # pairs of images correctly answered – Counter via novel problem spaces • E.g., novel object captioning, robust captioning, visual coreference resolution – Counter via inductive biases in models • E.g., G - VQA, Neural Baby Talk – Counter via human - in - the - loop • E.g., accuracy is not the final metric, performance of human - AI team is 69 Slide credit: Devi Parikh

70 Thank you. 70 Slide credit: Devi Parikh

Related documents