1 Paper Doll Parsing: Retrieving Similar Styles to Parse Clothing Items M. Hadi Kiapour Kota Yamaguchi Tamara L. Berg Stony Brook University UNC at Chapel Hill UNC at Chapel Hill Stony Brook, NY, USA Chapel Hill, NC, USA Chapel Hill, NC, USA [email protected] [email protected] [email protected] In this paper, we take a data driven approach to cloth- Abstract ing parsing. We first collect a large, complex, real world collection of outfit pictures from a social network focused Clothing recognition is an extremely challenging prob- on fashion, . Using a very small set of lem due to wide variation in clothing item appearance, hand parsed images in combination with the text tags asso- layering, and style. In this paper, we tackle the clothing ciated with each image in the collection, we can parse our parsing problem using a retrieval based approach. For a large database accurately. Now, given a query image with- query image, we find similar styles from a large database of out any associated text, we can predict an accurate parse by tagged fashion images and use these examples to parse the retrieving similar outfits from our parsed collection, build- query. Our approach combines parsing from: pre-trained ing local models from retrieved clothing items, and trans- global clothing models, local clothing models learned on ferring inferred clothing items from the retrieved samples the fly from retrieved examples, and transferred parse masks to the query image. Final iterative smoothing produces our (paper doll item transfer) from retrieved examples. Exper- end result. In each of these steps we take advantage of the imental evaluation shows that our approach significantly relationship between clothing and body pose to constrain outperforms state of the art in parsing accuracy. prediction and produce a more accurate parse. We call this because it essentially transfers predic- paper doll parsing tions from retrieved samples to the query, like laying paper 1. Introduction cutouts of clothing items onto a paper doll. Consistencies Clothing choices vary widely across the global popu- in dressing make this retrieval based effort possible. lation. For example, one person’s style may lean toward In particular, we propose a retrieval based approach to preppy while another’s trends toward goth. However, there clothing parsing that combines: are commonalities. For instance, walking through a col- • Pre-trained global models of clothing items. lege campus you might notice student after student consis- Local models of clothing items learned on the fly from • tently wearing combinations of jeans, t-shirts, sweatshirts, retrieved examples. and sneakers. Or, you might observe those who have just stumbled out of bed and are wandering to class looking di- Parse mask predictions transferred from retrieved ex- • sheveled in their pajamas. Even hipsters who purport to be amples to the query image. independent in their thinking and dress, tend to wear similar • Iterative label smoothing. outfits consisting of variations on tight-fitting jeans, button Clothing recognition is a challenging and societally im- down shirts, and thick plastic glasses. In some cases, style portant problem – global sales for clothing total over a choices can be a strong cue for visual recognition. hundred billion dollars, much of which is conducted on- In addition to style variation, individual clothing items line. This is reflected in the growing interest in clothing also display many different appearance characteristics. As related recognition papers [11, 10, 25, 16, 26, 7, 2, 4], per- a concrete example, shirts have an incredibly wide range of haps boosted by recent advances in pose estimation [27, 3]. appearances based on cut, color, material, and pattern. This Many of these papers have focused on specific aspects of can make identifying part of an outfit as a shirt very chal- clothing recognition such as predicting attributes of cloth- lenging. Luckily, for any particular choice of these param- ing [7, 2, 4], outfit recommendation [15], or identifying as- , blue and white checked button down, there are eters, e.g. pects of socio-identity through clothing [18, 20]. many shirts with similar appearance. It is this visual simi- We attack the problem of clothing parsing, assigning a larity and the existence of some consistency in style choices semantic label to each pixel in the image where labels can discussed above that we exploit in our system. 1

2 1 key points [email protected] [email protected] Parsing Retrieval Linear 0.8 Global Ta g tags parse prediction 0.6 Input Final image NN parse Precision 0.4 parse Style Smoothing NN NN retrieval NN mean- std Transferred images Ta g g e d images 0.2 images Parse pooling images 0 0.2 0.4 1 0 0.8 0.6 Figure 1: Parsing pipeline. Figure 2: Spatial descriptors Recall for style representation. Figure 3: Tag prediction PR-plot. tomatically exclude any duplicate pictures from the Paper be selected from background, skin, hair, or from a large set of clothing items (e.g. boots, tights, sweater). Effective so- Doll dataset. From the remaining, we select pictures tagged lutions to clothing parsing could enable useful end-user ap- with at least one clothing item and run a full-body pose de- plications such as pose independent clothing retrieval [26] tector [27], keeping those that have a person detection. This results in 339,797 pictures weakly annotated with cloth- or street to shop applications [16]. This problem is closely related to the general image parsing problem which has ing items and estimated pose. Though the annotations are not always complete – users often do not label all depicted been approached successfully using related non-parametric methods [21, 14, 22]. However, we view the clothing pars- items, especially small items or accessories – it is rare to find images where an annotated tag is not present. We use ing problem as suitable for specialized exploration because it deals with people, a category that has obvious signifi- the Paper Doll dataset for style retrieval. cance. The clothing parsing problem is also special in that 3. Approach overview one can take advantage of body pose estimates during pars- ing, and we do so in all parts of our method. For a query image, our approach consists of two steps: Previous state of the art on clothing parsing [26] per- 1. Retrieve similar images from the parsed database. formed quite well on the constrained parsing problem, 2. Use retrieved images and tags to parse the query. where test images are parsed given user provided tags indi- cating depicted clothing items. However, they were less ef- Figure 1 depicts the overall parsing pipeline. fective at unconstrained clothing parsing, where test images 3.1. Low-level features are parsed in the absense of any textual information. We provide an approach to unconstrained clothing parsing that We first run a pose estimator [27] and normalize the performs much better than previous state of the art, boost- full-body bounding box to fixed size. The pose estimator ing overall image labeling performance from 77% to 84% is trained using the Fashionista training split and negative and performance of labeling foreground pixels (those actu- samples from the INRIA dataset. During parsing, we com- 23% ally on the body) from to , an increase of 74% of 40% pute the parse in this fixed frame size then warp it back to the previous accuracy. the original image, assuming regions outside the bounding box are background. 2. Dataset Our methods draw from a number of dense feature types (each parsing method uses some subset): This paper uses the Fashionista dataset provided in [26] RGB RGB color of the pixel. and an expansion called the Paper Doll dataset which we Lab L*a*b* color of the pixel. collected for this paper. The Fashionista dataset provides 685 fully parsed images that we use for supervised training Maximum Response Filters [23]. MR8 and performance evaluation, 456 for training and 229 for Image gradients at the pixel. Gradients testing. The training samples are used for learning feature HOG descriptor at the pixel. HOG transforms, building global clothing models, and adjusting the Negative Boundary Distance log-distance from parameters. The testing samples are reserved for evaluation. boundary of an image. The Paper Doll dataset is a large collection of tagged Pose Distance Negative log-distance from 14 body joints fashion pictures. We collected over 1 million pictures from and any body limbs. with associated metadata tags denoting characteristics such as color, clothing item, or occasion. Whenever we use a statistical model built upon these fea- Since the Fashionista dataset also uses Chictopia, we au- tures, we first normalize features by subtracting their mean

3 and dividing by 3 standard deviations for each dimension. tags, but tags removed at this stage can never be predicted. Also, when we use logistic regression [8], we use these nor- Therefore, we wish to obtain the best possible predictive malized features and their squares, along with a constant performance in the high recall regime. bias. So, for an N -dimensional feature vector, we always Tag prediction is based on a simple voting approach + 1 parameters. N learn 2 from KNN. Each tag in the retrieved samples provides a vote weighted by the inverse of its distance from the query, which forms a confidence for presence of that item. We 4. Style retrieval threshold this confidence to predict the presence of an item. Our goal for retrieving similar pictures is two-fold: a) to We experimentally selected this simple KNN prediction predict depicted clothing items, and b) to obtain information instead of other models, because it turns out KNN works helpful for parsing clothing items. well for the high-recall prediction task. Figure 3 shows performance of linear vs KNN at 10 and 25. While linear 4.1. Style descriptor classification (clothing item classifiers trained on subsets of We design a descriptor for style retrieval that is useful body parts, e.g. on lower body keypoints), works well pants for finding styles with similar appearance. For an image, we in the low-recall high precision regime, KNN outperforms obtain a set of 24 key points interpolated from the 27 pose in the high-recall range. KNN at 25 also outperforms 10. estimated body joints. These key points are used to extract Since the goal here is only to eliminate obviously irrel- part-specific spatial descriptors - a mean-std pooling of nor- evant items while keeping most potentially relevant items, malized dense features in 4-by-4 cells in a 32-by-32 patch we tune the threshold to give 0.5 recall in the Fashionista around the key point. That is, for each cell in the patch, training split. Due to the skewed item distribution in the we compute mean and standard deviation of the normalized Fashionista dataset, we use the same threshold for all items features (Figure 2 illustrates). The features included in this to avoid over-fitting the predictive model. In the parsing descriptor are RGB, Lab, MR8, HOG, Boundary Distance, background stage, we always include hair , and in ad- , skin and Skin-hair Detection. dition to the predicted clothing tags. Skin-hair Detection is computed using logistic regres- , sion for skin hair , background , and clothing at each pixel. 5. Clothing parsing For its input, we use RGB, Lab, MR8, HOG, Boundary Dis- Following tag prediction, we start to parse the image in tance, and Pose Distance. Note that we do not include Pose a per-pixel fashion. Parsing has two major phases: Distance as a feature in the style descriptor, but instead use 1. Compute pixel-level confidence from three methods: Skin-hair detection to indirectly include pose-dependent in- global parse, nearest neighbor parse, and transferred formation in the representation since the purpose of the style parse. descriptor is to find similar styles independent of pose. 2. Apply iterative label smoothing to get a final parse. For each key point, we compute the above spatial de- scriptors and concatenate to describe the overall style, re- Figure 5 illustrates outputs from each parsing stage. sulting in a 39,168 dimensional vector for an image. For efficiency of retrieval, we use PCA for dimensionality re- 5.1. Pixel confidence duction to a 441 dimensional representation. We use the . The Let us denote y i as the clothing item label at pixel i Fashionista training split to build the Skin-hair detector and first step in parsing is to compute a confidence score of as- also to train the PCA model. y l . We model this scoring function signing clothing item to i as the mixture of three confidence functions. S 4.2. Retrieval λ 1 We use L2-distance over the style descriptors to find the ( x | y ( S x · ) ,D S ≡ y ) | ,D i i global i i K nearest neighbors (KNN) in the Paper Doll dataset. For λ 2 · ) S ( y ,D | x nearest i i efficiency, we build a KD-tree [24] to index samples. In this λ 3 (1) , ) ,D ( S x y | i transfer i for all the experiments. Figure 4 = 25 K paper, we fix shows two examples of nearest neighbor retrievals. denotes pixel features, ,λ ,λ are mix- ] where x λ [ Λ ≡ i 1 2 3 is a set of nearest neighbor samples. ing parameters, and D 4.3. Tag prediction The retrieved samples are first used to predict clothing 5.1.1 Global parse items potentially present in a query image. The purpose of tag prediction is to obtain a set of tags that might be relevant The first term in our model is a global clothing likelihood, to the query, while eliminating definitely irrelevant items for trained for each clothing item on the hand parsed Fashion- consideration. Later stages can remove spuriously predicted ista training split. This is modeled as a logistic regression

4 accessories boots boots skirt belt pumps skirt flats necklace shirt belt shirt shoes skirt top bag cardigan heels skirt shorts top skirt tights t-shirt dress jacket sweater bag blazer boots skirt blazer shoes shorts belt dress heels accessories blazer bracelet jacket belt blazer boots pants shoes top shorts top jacket shoes shorts shoes shorts top shorts t-shirt top Figure 4: Retrieval examples. The leftmost column shows query images with ground truth item annotation. The rest are retrieved images with associated tags in the top 25. Notice retrieved samples sometimes have missing item tags. null skin hair belt heels necklace shirt shoes skirt top 3. Transferred parse 2. NN parse Combined (1+2+3) Pose estimation 1. Global parse Smoothed result Figure 5: Parsing outputs at each step. Labels are MAP assignments of the scoring functions. that computes the likelihood of a label assignment to each Because of the tremendous number of pixels in the pixel for a given set of possible clothing items: dataset, we subsample pixels to train each of the logistic regression models. During subsampling, we try to sample g S | x τ ,D ) ≡ P ( y (2) = l | x , ,θ ( )] ) · 1 [ l ∈ y ( D i i global i i pixels so that the resulting label distribution is close to uni- l form in each image, preventing learned models from only and model where P is logistic regression given feature x i predicting large items. g ] is an indicator function, and τ ( D ) is a θ parameter , 1 [ · l set of predicted tags from nearest neighbor retrieval. We use RGB, Lab, MR8, HOG, and Pose Distances as features. 5.1.2 Nearest neighbor parse Any unpredicted items receive zero probability. g is trained on the Fashionista The model parameter θ The second term in our model is also a logistic regression, l g training split. For training each θ , we select negative pixel but trained only on the retrieved nearest neighbor (NN) im- l samples only from those images having at least one posi- ages. Here we learn a local appearance model for each tive pixel. That is, the model gives localization probability clothing item based on examples that are similar to the is present in the picture. This could given that a label l query, e.g. blazers that look similar to the query blazer be- potentially increase confusion between similar item types, cause they were retrieved via style similarity. These local and since they usually do not appear blazer such as jacket models are much better models for the query image than together, in favor of better localization accuracy. We chose those trained globally (because blazers in general can take to resolve such confusion. τ to rely on the tag prediction on a huge range of appearances).

5 pants boots jacket Nearest , the se- with s Let us denote the super-pixel of pixel i i neighbors Transferred r with lected corresponding super-pixel from image s , i,r parse Input h ) . and the bag-of-words features of super-pixel s with s ( Then, our transferred parse is computed as: jacket shoes pants ∑ ) 1 ,s y ( M i,r i y S ( ≡ | x ,D (4) ) , i i transfer s ‖ − s 1 + ) ‖ h ( ( h ) Z i i,r r D ∈ Confidence transfer skin cardigan dress where we define: Dense super-pixel matching ∑ hair skin 1 g ,θ x | l = y ( P (5) , ≡ ) )] y ( M r ( τ ∈ l [ 1 · ) ,s jeans shirt i i,r i i l s | | i,r jacket boots j s ∈ i,r Figure 6: Transferred parse. Likelihoods in nearest neigh- which is a mean of the global parse over the super-pixel in bors are transferred to the input via dense matching. r a retrieved image. Here we denote a set of tags of image τ with ( . Z r ) , and normalization constant 5.1.4 Combined confidence n ∈ l [ 1 · ) (3) . )] D ( τ P | S x = y ,θ ( ,D ) x | l ≡ y ( i i i i nearest l After computing our three confidence scores, we combine n to get the final pixel confidence them with parameter S Λ The model parameter θ is locally learned from the l as described in Equation 1. We choose the best mixing pa- , using RGB, Lab, Gradient, MR8, D retrieved samples rameter such that MAP assignment of pixel labels gives the Boundary Distance, and Pose Distance. best foreground accuracy in the Fashionista training split In this step, predicted pixel-level annotations from the by solving the following optimization (on foreground pix- retrieved samples are used (computed during pre-processing F ): els detailed in Section 5.3) to learn local appearance models. ] [ NN models are trained using any pixel (with subsampling) ∑ in the retrieved samples in a one-vs-all fashion. ) 1 S | max = arg max y y ̃ x (6) , ( i i i Λ Λ y i i ∈ F 5.1.3 Transferred parse y ̃ where . The i is the ground truth annotation of the pixel i nearest neighbors S in D are dropped in the notation for The third term in our model is obtained by transferring the simplicity. We use a simplex search algorithm to solve for S parse mask likelihoods estimated by the global parse global the optimum parameter starting from uniform values. In our from the retrieved images to the query image (Figure 6 visu- , . experiment, we obtained (0 . 41 . 0 . 18 , 0 39) alizes an example). This approach is similar in spirit to ap- We exclude background pixels from this optimization proaches for general segmentation that transfer likelihoods because of the skew in the label distribution – background using over segmentation and matching [1, 13, 17], but here pixels in Fashionista dataset represent 77% of total pixels, because we are performing segmentation on people we can which tends to direct the optimizer to find meaningless local take advantage of pose estimates during transfer. optima; i.e., predicting everything as . background In our approach, we find dense correspondence based on super-pixels instead of pixels (e.g., [21]) to overcome the 5.2. Iterative label smoothing difficulty in naively transferring deformable, often occluded The combined confidence gives a rough estimate of item clothing items pixel-wise. Our approach first computes an localization. However, it does not respect boundaries of ac- over-segmentation of both query and retrieved images us- tual clothing items since it is computed per-pixel. There- ing a fast and simple segmentation algorithm [9], then finds fore, we introduce an iterative smoothing stage that con- corresponding pairs of super-pixels between the query and siders all pixels together to provide a smooth parse of an each retrieved image based on pose and appearance: image. Following the approach of [19], we formulate this 1. For each super-pixel in the query, find the 5 nearest smoothing problem by considering the joint labeling of pix- s super-pixels in each retrieved image using L2 Pose } ≡ { y , ≡ { and item appearance models Θ els θ Y } i l s Distance. is a model for a label l . The goal is to find the op- θ where l ∗ ∗ 2. Compute a concatenation of bag-of-words from RGB, Y and item models Θ timal joint assignment for a given Lab, MR8, and Gradient for each of those super-pixels. image. 3. Pick the closest super-pixel from each retrieved image We start this problem by initializing the current predicted ˆ using L2 distance on the bag-of-words feature. parsing Y with the MAP assignment under the combined 0

6 ˆ confidence Y S as training data to build . Then, we treat 6. Experimental results 0 ˆ (logistic regressions). Θ initial image-specific item models 0 We evaluate parsing performance on the 229 testing sam- For these models, we only use RGB, Lab, and Boundary ples from the Fashionista dataset. The task is to predict a Distance since otherwise models easily over-fit. Also, we label for every pixel where labels represent a set of 56 dif- use a higher regularization parameter for training instead ferent categories – a very large and challenging variety of of finding the best cross-validation parameter, assuming the clothing items. ˆ are noisy. Y initial training labels 0 Performance is measured in terms of standard metrics: ˆ ˆ Y After obtaining , we solve for the optimal as- Θ and 0 0 accuracy, average precision, average recall, and average F- ˆ at the current step with the following opti- signment t Y t 1 over pixels. In addition, we also include foreground ac- mization: curacy (See eqn 6) as a measure of how accurately each ∏ ∏ ˆ ˆ (7) Θ ∈ ) Φ( y arg max | Ψ( y x ,y Y | x , , x ,S, ) method is at parsing foreground regions (those pixels on i t i i t i j j Y i i,j ∈ V the body, not on the background). Note that the aver- age measures are over non-empty labels after calculating where we define: pixel-based performance for each since some labels are not s λ λ − 1 ˆ · P ( y Φ( | x y ,θ ≡ S ) ( y Θ , (8) | x ,S, ) | x ) present in the test set. Since there are some empty predic- t i i i i i i l 2 tions, F-1 does not necessarily match the geometric mean − β ‖ − ‖ x x j i · 1 [ y exp 6 = y { ] } . γe ,y Ψ( | x y , x (9) ) ≡ j i j i j i of average precision and recall. Here, , are V is a set of neighboring pixel pairs, λ , β γ Table 1 summarizes predictive performance of our pars- the parameters of the model, which we experimentally de- ing method, including a breakdown of how well the in- termined in this paper. We use the graph-cut algorithm termediate parsing steps perform. For comparison, we in- [6, 5, 12] to find the optimal solution. clude the performance of previous state of the art on cloth- ˆ With the updated estimate of the labels Y , we train the ing parsing [26]. Our approach outperforms the previous t ˆ logistic regressions Θ and repeat until the algorithm con- 84.68 method in overall accuracy ( 77.45 %). It also % vs t verges. Note that this iterative approach is not guaranteed provides a huge boost in foreground accuracy. The previ- to converge. We terminate the iteration when 10 iterations 23.11 % foreground accuracy, while ous approach provides pass, when the number of changes in label assignment is %. We also obtain much higher preci- 40.20 we obtain less than 100, or the ratio of the change is smaller than 5%. sion ( %) without much decrease in recall 33.34 10.53 % vs 17.2 %). 15.35 % vs ( 5.3. Offline processing Figure 7 shows examples from our parsing method, with ground truth annotation and the method of [26]. We observe Our retrieval techniques require the large Paper Doll that our method produces a parse that respects the actual Dataset to be pre-processed (parsed), for building nearest item boundary, even if some items are incorrectly labeled; neighbor models on the fly from retrieved samples and for pants e.g., predicting . However, as jeans , or jacket as blazer transferring parse masks. Therefore, we estimate a clothing often these confusions are due to high similarity in appear- parse for each sample in the 339K image dataset, making ance between items and sometimes due to non-exclusivity use of pose estimates and the tags associated with the im- . pants are a type of jeans in item types, i.e., age by the photo owner. This parse makes use of the global Figure 8 plots F-1 scores for non-empty items (items clothing models (constrained to the tags associated with the predicted on the test set) comparing the method of [26] image by the photo owner) and iterative smoothing parts of with our method. Our model outperforms the prior work our approach. on many items, especially major foreground items such as Although these training images are tagged, there are of- coat skirt , , jeans , shorts , or dress . This results in a signif- ten clothing items missing in the annotation. This will lead icant boost in foreground accuracy and perceptually better back- iterative smoothing to mark foreground regions as parsing results. item label unknown ground . To prevent this, we add an ˆ Though our method is successful at foreground predic- with uniform probability and initialize Y together with the 0 tion overall, there are a few drawbacks to our approach. global clothing model at all samples. This effectively pre- ˆ By design, our style descriptor is aimed at representing vents the final estimated labeling Y to mark missing items whole outfit style rather than specific details of the outfit. with incorrect labels. Consequently, small items like accessories tend to be less Offline processing of the Paper Doll Dataset took a few weighted during retrieval and are therefore poorly predicted of days with our Matlab implementation in a distributed en- during parsing. However, prediction of small items is inher- vironment. For an unseen query image, our full parsing ently extremely challenging because they provide limited pipeline takes 20 to 40 seconds, including pose estimation. appearance information. The major computational bottlenecks are in pose estimation and iterative smoothing. Another issue for future work is the prevention of con-

7 Method Accuracy Avg. precision Avg. recall Avg. F-1 F.g. accuracy 77.45 23.11 17.20 10.35 CRF [26] 10.53 35.88 15.18 12.98 79.63 18.59 1. Global parse 38.18 21.45 14.73 2. NN parse 80.73 12.84 83.06 33.20 31.47 12.24 11.85 3. Transferred parse 83.01 39.55 25.84 4. Combined (1+2+3) 14.22 15.53 5. Our final parse 40.20 33.34 15.35 14.87 84.68 Table 1: Parsing performance for final and intermediate results (MAP assignments at each step). Input CRF [26] Our method Input Truth CRF [26] Our method Truth background cape flats jacket pants scarf socks t-shirt watch blazer skin blouse cardigan glasses jeans pumps shirt stockings tie wedges hair bodysuit gloves jumper purse shoes suit tights clogs boots ring hat leggings accessories shorts sunglasses top coat bag dress heels loafers romper skirt sweater vest bra belt bracelet earrings intimate necklace sandals sneakers sweatshirt wallet Figure 7: Parsing examples. Our method sometimes confuses similar items, but gives overall perceptually better results. flicting items from being predicted for the same image, such and clothing items that look like skin, e.g. slim khaki pants. and skirt , or boots dress shoes which tend not to be as and Also, it is very challenging to differentiate for example be- worn together. Our iterative smoothing is effectively reduc- tween bold stripes and a belt using low-level image features. ing such confusion, but the parsing result sometimes con- These cases will require higher-level knowledge about out- tains one item split into two conflicting items. One way to fits to correctly parse. resolve this would be to enforce constraints on the overall 7. Conclusion combination of predicted items, but this leads to a difficult optimization problem and we leave it as future work. We describe a clothing parsing method based on near- Lastly, we find it difficult to predict items with skin-like est neighbor style retrieval. Our system combines: global color or coarsely textured items (similar to issues reported parse models, nearest neighbor parse models, and trans- in [26]). Because of the variation in lighting condition in ferred parse predictions. Experimental evaluation shows pictures, it is very hard to distinguish between actual skin successful results, demonstrating a significant boost of over-

8 1 CRF [26] 0.8 Our method 0.6 0.4 0.2 0 top hat belt hair bag skin skirt vest shirt coat cape scarf clogs tights heels jeans dress boots pants purse t-shirt socks jacket watch shoes shorts blazer gloves blouse jumper glasses wedges sandals sweater bracelet earrings leggings cardigan necklace stockings sunglasses background accessories Figure 8: F-1 score of non-empty items. We observe significant performance gains, especially for large items. all accuracy and especially foreground parsing accuracy [13] B. Leibe, A. Leonardis, and B. Schiele. Robust object detec- tion with interleaved categorization and segmentation. IJCV , over previous work. It is our future work to resolve the con- 77(1-3):259–289, 2008. 5 fusion between very similar items and to incorporate higher [14] C. Liu, J. Yuen, and A. Torralba. Sift flow: Dense correspon- level knowledge about outfits. dence across scenes and its applications. Pattern Analysis Research supported by Google Fac- Acknowledgements: , 33(5):978– and Machine Intelligence, IEEE Transactions on ulty Award, ”Seeing Social: Exploiting Computer Vision in 994, 2011. 2 Online Communities” & NSF Career Award #1054133. [15] S. Liu, J. Feng, Z. Song, T. Zhang, H. Lu, C. Xu, and S. Yan. Hi, magic closet, tell me what to wear! In ACM international conference on Multimedia , pages 619–628. ACM, 2012. 1 References [16] S. Liu, Z. Song, G. Liu, C. Xu, H. Lu, and S. Yan. Street-to- [1] E. Borenstein and J. Malik. Shape guided object segmenta- shop: Cross-scenario clothing retrieval via parts alignment tion. In , volume 1, pages 969–976, 2006. 5 CVPR and auxiliary set. In , pages 3330–3337, 2012. 1, 2 CVPR [2] L. Bossard, M. Dantone, C. Leistner, C. Wengert, T. Quack, [17] M. Marszałek and C. Schmid. Accurate object recognition , ACCV and L. Van Gool. Apparel classification with style. , 97(2):191–209, 2012. 5 IJCV with shape masks. pages 1–14, 2012. 1 [18] A. C. Murillo, I. S. Kwak, L. Bourdev, D. Kriegman, and [3] L. Bourdev, S. Maji, T. Brox, and J. Malik. Detecting peo- S. Belongie. Urban tribes: Analyzing group photos from a ECCV , ple using mutually consistent poselet activations. In social perspective. In CVPR Workshops , pages 28–35, 2012. 2010. 1 1 [19] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Texton- [4] L. Bourdev, S. Maji, and J. Malik. Describing people: A boost: Joint appearance, shape and context modeling for poselet-based approach to attribute classification. In , ICCV ECCV multi-class object recognition and segmentation. , pages 1543–1550, 2011. 1 pages 1–15, 2006. 5 [5] Y. Boykov and V. Kolmogorov. An experimental compari- [20] Z. Song, M. Wang, X.-s. Hua, and S. Yan. Predicting oc- son of min-cut/max-flow algorithms for energy minimization , pages cupation via human clothing and contexts. In ICCV Pattern Analysis and Machine Intelligence, IEEE in vision. 1084–1091, 2011. 1 , 26(9):1124–1137, 2004. 6 Transactions on [21] J. Tighe and S. Lazebnik. Superparsing: scalable nonpara- [6] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate , pages 352– metric image parsing with superpixels. ECCV energy minimization via graph cuts. Pattern Analysis and 365, 2010. 2, 5 Machine Intelligence, IEEE Transactions on , 23(11):1222– [22] J. Tighe and S. Lazebnik. Finding things: Image parsing with 1239, 2001. 6 CVPR regions and per-exemplar detectors. , 2013. 2 [7] H. Chen, A. Gallagher, and B. Girod. Describing clothing by [23] M. Varma and A. Zisserman. A statistical approach to tex- ECCV , pages 609–623. 2012. 1 semantic attributes. In ture classification from single images. , 62(1-2):61–81, IJCV [8] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. 2005. 2 Lin. LIBLINEAR: A library for large linear classification. [24] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable Journal. Machine Learning Research , 9:1871–1874, 2008. 3 library of computer vision algorithms, 2008. 3 [9] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient graph- [25] N. Wang and H. Ai. Who blocks who: Simultaneous clothing based image segmentation. , 59(2):167–181, 2004. 5 IJCV ICCV segmentation for grouping images. In , pages 1535– [10] A. C. Gallagher and T. Chen. Clothing cosegmentation for 1542, 2011. 1 CVPR recognizing people. In , pages 1–8, 2008. 1 [26] K. Yamaguchi, M. H. Kiapour, L. E. Ortiz, and T. L. Berg. [11] B. Hasan and D. Hogg. Segmentation using deformable spa- , pages Parsing clothing in fashion photographs. In CVPR BMVC tial priors with application to clothing. In , 2010. 1 3570–3577, 2012. 1, 2, 6, 7 Articulated pose estimation [27] Y. Yang and D. Ramanan. [12] V. Kolmogorov and R. Zabin. What energy functions can be CVPR with flexible mixtures-of-parts. In , pages 1385–1392, Pattern Analysis and Machine minimized via graph cuts? 2011. 1, 2 Intelligence, IEEE Transactions on , 26(2):147–159, 2004. 6

Related documents