1 YOLOv3: An Incremental Improvement Joseph Redmon, Ali Farhadi University of Washington 38 Abstract YOLOv3 RetinaNet-50 RetinaNet-101 G 36 time mAP Method We present some updates to YOLO! We made a bunch 28.0 61 [B] SSD321 of little design changes to make it better. We also trained 85 [C] DSSD321 28.0 34 85 29.9 [D] R-FCN this new network that’s pretty swell. It’s a little bigger than F 31.2 125 [E] SSD513 156 33.2 [F] DSSD513 last time but more accurate. It’s still fast though, don’t 32 172 36.2 [G] FPN FRCN E YOLOv3 runs in 22 ms at 28.2 mAP, 320 × 320 worry. At 73 32.5 RetinaNet-50-500 COCO AP 34.4 90 RetinaNet-101-500 D 30 as accurate as SSD but three times faster. When we look 198 37.8 RetinaNet-101-800 22 28.2 YOLOv3-320 at the old .5 IOU mAP detection metric YOLOv3 is quite 31.0 29 YOLOv3-416 28 B C good. It achieves 57 . 9 in 51 ms on a Titan X, com- AP 33.0 51 YOLOv3-608 50 in 198 ms by RetinaNet, similar perfor- 5 . pared to 57 AP 50 100 150 200 250 50 mance but 3.8 × faster. As always, all the code is online at inference time (ms) . https://pjreddie.com/yolo/ Figure 1. We adapt this figure from the Focal Loss paper . YOLOv3 runs significantly faster than other detection methods with comparable performance. Times from either an M40 or Titan X, they are basically the same GPU. 1. Introduction 2.1. Bounding Box Prediction Sometimes you just kinda phone it in for a year, you know? I didn’t do a whole lot of research this year. Spent Following YOLO9000 our system predicts bounding a lot of time on Twitter. Played around with GANs a little. boxes using dimension clusters as anchor boxes . The I had a little momentum left over from last year  ; I t network predicts 4 coordinates for each bounding box, , x managed to make some improvements to YOLO. But, hon- t t t , . If the cell is offset from the top left corner of the , h y w estly, nothing like super interesting, just a bunch of small image by and the bounding box prior has width and ) ,c c ( x y changes that make it better. I also helped out with other p , , then the predictions correspond to: p height w h people’s research a little. Actually, that’s what brings us here today. We have c ) + b σ = t ( a camera-ready deadline  and we need to cite some of x x x the random updates I made to YOLO but we don’t have a = b c ) + t ( σ y y y source. So get ready for a TECH REPORT! t w e b = p w w The great thing about tech reports is that they don’t need t h = e b p h h intros, y’all know why we’re here. So the end of this intro- duction will signpost for the rest of the paper. First we’ll tell you what the deal is with YOLOv3. Then we’ll tell you how During training we use sum of squared error loss. If the we do. We’ll also tell you about some things we tried that ˆ our gra- t ground truth for some coordinate prediction is didn’t work. Finally we’ll contemplate what this all means. * dient is the ground truth value (computed from the ground ˆ t − . This ground truth truth box) minus our prediction: t * * 2. The Deal value can be easily computed by inverting the equations So here’s the deal with YOLOv3: We mostly took good above. ideas from other people. We also trained a new classifier YOLOv3 predicts an objectness score for each bounding network that’s better than the other ones. We’ll just take box using logistic regression. This should be 1 if the bound- you through the whole system from scratch so you can un- ing box prior overlaps a ground truth object by more than derstand it all. any other bounding box prior. If the bounding box prior 1
2 c Next we take the feature map from 2 layers previous and x upsample it by . We also take a feature map from earlier × 2 in the network and merge it with our upsampled features p w using concatenation. This method allows us to get more c y meaningful semantic information from the upsampled fea- b tures and finer-grained information from the earlier feature w map. We then add a few more convolutional layers to pro- cess this combined feature map, and eventually predict a σ = b (t )+c σ (t ) x x x y p similar tensor, although now twice the size. b h h b )+c (t = σ y y y We perform the same design one more time to predict t w ) σ (t =p e b x w w boxes for the final scale. Thus our predictions for the 3rd t h b e =p scale benefit from all the prior computation as well as fine- h h grained features from early on in the network. We still use k-means clustering to determine our bound- ing box priors. We just sort of chose 9 clusters and 3 Figure 2. Bounding boxes with dimension priors and location scales arbitrarily and then divide up the clusters evenly prediction. We predict the width and height of the box as offsets across scales. On the COCO dataset the 9 clusters were: from cluster centroids. We predict the center coordinates of the × (10 × (59 , 45) × (62 , 61) × (30 , 23) × (33 , 30) × (16 , 13) box relative to the location of filter application using a sigmoid (373 90) × (116 , 119) , (156 × 198) , × 326) . function. This figure blatantly self-plagiarized from . 2.4. Feature Extractor is not the best but does overlap a ground truth object by We use a new network for performing feature extraction. more than some threshold we ignore the prediction, follow- Our new network is a hybrid approach between the network ing . We use the threshold of . 5 . Unlike  our system used in YOLOv2, Darknet-19, and that newfangled residual only assigns one bounding box prior for each ground truth 3 1 and 3 × × 1 network stuff. Our network uses successive object. If a bounding box prior is not assigned to a ground convolutional layers but now has some shortcut connections truth object it incurs no loss for coordinate or class predic- as well and is significantly larger. It has 53 convolutional tions, only objectness. layers so we call it... wait for it... Darknet-53! 2.2. Class Prediction Output Size Type Filters 256 × 256 3 × 3 32 Convolutional Each box predicts the classes the bounding box may con- 128 × 128 3 × 3 / 2 Convolutional 64 tain using multilabel classification. We do not use a softmax 1 × 1 Convolutional 32 as we have found it is unnecessary for good performance, 3 × 3 64 Convolutional × 1 instead we simply use independent logistic classifiers. Dur- 128 × 128 Residual ing training we use binary cross-entropy loss for the class 64 × 64 3 × 3 / 2 128 Convolutional predictions. 1 × 1 64 Convolutional This formulation helps when we move to more complex 3 × 3 128 Convolutional × 2 64 × 64 Residual domains like the Open Images Dataset . In this dataset 32 × 32 3 × 3 / 2 256 Convolutional there are many overlapping labels (i.e. Woman and Person). 1 × 1 128 Convolutional Using a softmax imposes the assumption that each box has 3 × 3 256 Convolutional 8 × exactly one class which is often not the case. A multilabel 32 × 32 Residual approach better models the data. 16 × 16 3 × 3 / 2 512 Convolutional 1 × 1 256 Convolutional 2.3. Predictions Across Scales 3 × 3 512 Convolutional 8 × 16 × 16 Residual YOLOv3 predicts boxes at 3 different scales. Our sys- 8 × 8 3 × 3 / 2 1024 Convolutional tem extracts features from those scales using a similar con- 1 × 1 512 Convolutional cept to feature pyramid networks . From our base fea- 3 × 3 1024 Convolutional × 4 ture extractor we add several convolutional layers. The last 8 × 8 Residual of these predicts a 3-d tensor encoding bounding box, ob- Global Avgpool jectness, and class predictions. In our experiments with 1000 Connected COCO  we predict 3 boxes at each scale so the tensor is Softmax [3 N (4 + 1 + 80)] × × ∗ N for the 4 bounding box offsets, Darknet-53. Table 1. 1 objectness prediction, and 80 class predictions.
3 This new network is much more powerful than Darknet- models like RetinaNet in this metric though. 19 but still more efficient than ResNet-101 or ResNet-152. However, when we look at the “old” detection metric of Here are some ImageNet results: mAP at IOU = . 5 (or AP in the chart) YOLOv3 is very 50 strong. It is almost on par with RetinaNet and far above Backbone Top-5 Top-1 BFLOP/s FPS Bn Ops the SSD variants. This indicates that YOLOv3 is a very 1246 171 Darknet-19  74.1 91.8 7.29 strong detector that excels at producing decent boxes for ob- ResNet-101 1039 93.7 19.7 53 77.1 jects. However, performance drops significantly as the IOU 1090 37 77.6 93.8 29.4 ResNet-152  threshold increases indicating YOLOv3 struggles to get the 18.7 1457 93.8 77.2 Darknet-53 78 boxes perfectly aligned with the object. In the past YOLO struggled with small objects. How- Accuracy, billions of oper- Comparison of backbones. Table 2. ever, now we see a reversal in that trend. With the new ations, billion floating point operations per second, and FPS for multi-scale predictions we see YOLOv3 has relatively high various networks. AP performance. However, it has comparatively worse S Each network is trained with identical settings and tested performance on medium and larger size objects. More in- 256 , single crop accuracy. Run times are measured at × 256 vestigation is needed to get to the bottom of this. 256 × 256 on a Titan X at . Thus Darknet-53 performs on metric (see When we plot accuracy vs speed on the AP 50 par with state-of-the-art classifiers but with fewer floating figure 5) we see YOLOv3 has significant benefits over other point operations and more speed. Darknet-53 is better than detection systems. Namely, it’s faster and better. ResNet-101 and 1 . 5 × faster. Darknet-53 has similar perfor- mance to ResNet-152 and is faster. × 2 4. Things We Tried That Didn’t Work Darknet-53 also achieves the highest measured floating We tried lots of stuff while we were working on point operations per second. This means the network struc- YOLOv3. A lot of it didn’t work. Here’s the stuff we can ture better utilizes the GPU, making it more efficient to eval- remember. uate and thus faster. That’s mostly because ResNets have offset predictions. x,y Anchor box We tried using the just way too many layers and aren’t very efficient. normal anchor box prediction mechanism where you pre- offset as a multiple of the box width or height dict the x,y 2.5. Training using a linear activation. We found this formulation de- We still train on full images with no hard negative mining creased model stability and didn’t work very well. or any of that stuff. We use multi-scale training, lots of data We tried x,y Linear predictions instead of logistic. augmentation, batch normalization, all the standard stuff. using a linear activation to directly predict the offset x,y We use the Darknet neural network framework for training instead of the logistic activation. This led to a couple point and testing . drop in mAP. We tried using focal loss. It dropped our Focal loss. 3. How We Do mAP about 2 points. YOLOv3 may already be robust to the problem focal loss is trying to solve because it has sep- YOLOv3 is pretty good! See table 3. In terms of COCOs arate objectness predictions and conditional class predic- weird average mean AP metric it is on par with the SSD tions. Thus for most examples there is no loss from the variants but is 3 faster. It is still quite a bit behind other × class predictions? Or something? We aren’t totally sure. backbone AP AP AP AP AP AP 50 75 S L M Two-stage methods Faster R-CNN+++  ResNet-101-C4 34.9 55.7 37.4 38.7 50.9 15.6 Faster R-CNN w FPN  ResNet-101-FPN 36.2 59.1 39.0 18.2 39.0 48.2 Faster R-CNN by G-RMI  Inception-ResNet-v2  34.7 55.5 36.7 13.5 38.1 52.0 16.2 Faster R-CNN w TDM  Inception-ResNet-v2-TDM 36.8 57.7 39.2 39.8 52.1 One-stage methods 5.0 YOLOv2  DarkNet-19  21.6 35.5 22.4 44.0 19.2 10.2 33.3 ResNet-101-SSD 31.2 50.4 34.5 49.8 SSD513 [11, 3] DSSD513  35.4 ResNet-101-DSSD 33.2 53.3 35.2 13.0 51.1 21.8 50.2 42.7 42.3 59.1 39.1 ResNet-101-FPN RetinaNet  RetinaNet  40.8 61.1 44.1 51.2 24.1 44.2 ResNeXt-101-FPN 34.4 57.9 33.0 Darknet-53 608 × 608 YOLOv3 41.9 35.4 18.3 Table 3. I’m seriously just stealing all these tables from  they take soooo long to make from scratch. Ok, YOLOv3 is doing alright. × Keep in mind that RetinaNet has like 3 . 8 longer to process an image. YOLOv3 is much better than SSD variants and comparable to state-of-the-art models on the AP metric. 50
4 G 58 YOLOv3 RetinaNet-50 RetinaNet-101 56 time mAP-50 Method 61 45.4 [B] SSD321 85 46.1 [C] DSSD321 54 51.9 85 [D] R-FCN F 125 50.4 [E] SSD513 156 [F] DSSD513 53.3 52 D 172 59.1 [G] FPN FRCN 73 50.9 RetinaNet-50-500 COCO mAP-50 E 90 53.1 RetinaNet-101-500 50 198 57.5 RetinaNet-101-800 22 51.5 YOLOv3-320 29 55.3 YOLOv3-416 48 51 57.9 YOLOv3-608 200 250 100 50 150 C B inference time (ms) Figure 3. Again adapted from the , this time displaying speed/accuracy tradeoff on the mAP at .5 IOU metric. You can tell YOLOv3 is good because it’s very high and far to the left. Can you cite your own paper? Guess who’s going to try, this guy → . Oh, I forgot, we also fix a data loading bug in YOLOv2, that helped by like 2 mAP. Just sneaking this in here to not throw off layout. prisingly difficult.”  If humans have a hard time telling Dual IOU thresholds and truth assignment. Faster R- the difference, how much does it matter? CNN uses two IOU thresholds during training. If a predic- But maybe a better question is: “What are we going to tion overlaps the ground truth by .7 it is as a positive exam- [ 7] ple, by it is ignored, less than .3 for all ground truth 3 do with these detectors now that we have them?” A lot of − . . the people doing this research are at Google and Facebook. objects it is a negative example. We tried a similar strategy but couldn’t get good results. I guess at least we know the technology is in good hands and definitely won’t be used to harvest your personal infor- We quite like our current formulation, it seems to be at mation and sell it to... wait, you’re saying that’s exactly a local optima at least. It is possible that some of these what it will be used for?? Oh. techniques could eventually produce good results, perhaps they just need some tuning to stabilize the training. Well the other people heavily funding vision research are the military and they’ve never done anything horrible like 1 5. What This All Means killing lots of people with new technology oh wait... I have a lot of hope that most of the people using com- YOLOv3 is a good detector. It’s fast, it’s accurate. It’s puter vision are just doing happy, good stuff with it, like not as great on the COCO average AP between .5 and .95 counting the number of zebras in a national park , or IOU metric. But it’s very good on the old detection metric tracking their cat as it wanders around their house . But of .5 IOU. computer vision is already being put to questionable use and Why did we switch metrics anyway? The original as researchers we have a responsibility to at least consider COCO paper just has this cryptic sentence: “A full discus- the harm our work might be doing and think of ways to mit- sion of evaluation metrics will be added once the evaluation igate it. We owe the world that much. server is complete”. Russakovsky et al report that that hu- In closing, do not @ me. (Because I finally quit Twitter). mans have a hard time distinguishing an IOU of .3 from .5! “Training humans to visually inspect a bounding box with 1 The author is funded by the Office of Naval Research and Google. IOU of 0.3 and distinguish it from one with IOU 0.5 is sur-
5  O. Russakovsky, L.-J. Li, and L. Fei-Fei. Best of both References worlds: human-machine collaboration for object annotation. , Mar 2018. 1  Analogy. Wikipedia In Proceedings of the IEEE Conference on Computer Vision  M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and and Pattern Recognition , pages 2121–2131, 2015. 4 A. Zisserman. The pascal visual object classes (voc) chal-  M. Scott. Smart camera gimbal bot scanlime:027, Dec 2017. , 88(2):303– lenge. International journal of computer vision 4 338, 2010. 6  A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Be-  C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. yond skip connections: Top-down modulation for object de- arXiv preprint Dssd: Deconvolutional single shot detector. , 2016. 3 arXiv preprint arXiv:1612.06851 tection. , 2017. 3 arXiv:1701.06659  C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.  D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, Inception-v4, inception-resnet and the impact of residual and A. Farhadi. Iqa: Visual question answering in interactive connections on learning. 2017. 3 environments. arXiv preprint arXiv:1712.03316 , 2017. 1  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. In Proceedings of the IEEE con- , pages ference on computer vision and pattern recognition 770–778, 2016. 3  J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. 3  I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, A. Veit, S. Belongie, V. Gomes, A. Gupta, C. Sun, G. Chechik, D. Cai, Z. Feng, D. Narayanan, and K. Murphy. Open- images: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github.com/openimages , 2017. 2  T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 2117–2125, 2017. 2, 3 ́  T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll ar. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002 , 2017. 1, 3, 4  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- ́ ar, and C. L. Zitnick. Microsoft coco: Com- manan, P. Doll mon objects in context. In European conference on computer vision , pages 740–755. Springer, 2014. 2  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.- Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision , pages 21–37. Springer, 2016. 3  I. Newton. Philosophiae naturalis principia mathematica . William Dawson & Sons Ltd., London, 1687. 1  J. Parham, J. Crall, C. Stewart, T. Berger-Wolf, and D. Rubenstein. Animal population censusing at scale with citizen science and photographic identification. 2017. 4  J. Redmon. Darknet: Open source neural networks in c. http://pjreddie.com/darknet/ , 2013–2016. 3  J. Redmon and A. Farhadi. Yolo9000: Better, faster, stronger. In Computer Vision and Pattern Recognition (CVPR), 2017 , pages 6517–6525. IEEE, 2017. 1, 2, 3 IEEE Conference on  J. Redmon and A. Farhadi. Yolov3: An incremental improve- , 2018. 4 arXiv ment.  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: To- wards real-time object detection with region proposal net- , 2015. 2 arXiv preprint arXiv:1506.01497 works.
6 100 100 75 75 All the other slow ones All the other slow ones YOLOv3 YOLOv3 50 50 mAP 50 mAP 50 25 25 0 0 0 50 25 37.5 12.5 200 100 50 150 0 FPS Execution time (ms) Figure 4. Zero-axis charts are probably more intellectually honest... and we can still screw with the variables to make ourselves look good! Rebuttal precise bounding boxes are more important than better classifi- cation? A miss-classified example is much more obvious than a We would like to thank the Reddit commenters, labmates, bounding box that is slightly shifted. emailers, and passing shouts in the hallway for their lovely, heart- mAP is already screwed up because all that matters is per-class felt words. If you, like me, are reviewing for ICCV then we know rank ordering. For example, if your test set only has these two you probably have 37 other papers you could be reading that you’ll images then according to mAP two detectors that produce these invariably put off until the last week and then have some legend in results are JUST AS GOOD: the field email you about how you really should finish those re- views execept it won’t entirely be clear what they’re saying and Detector #1 Bird: 99% maybe they’re from the future? Anyway, this paper won’t have be- Person: 99% Person: 99% come what it will in time be without all the work your past selves Camel: 99% will have done also in the past but only a little bit further forward, Dog: 99% Horse: 99% not like all the way until now forward. And if you tweeted about it I wouldn’t know. Just sayin. Reviewer #2 AKA Dan Grossman (lol blinding who does that) insists that I point out here that our graphs have not one but two Detector #2 non-zero origins. You’re absolutely right Dan, that’s because it Bird: 90% Dog: 45% Horse: 52% Person: 11% Person: 42% looks way better than admitting to ourselves that we’re all just Camel: 10% Horse: 60% here battling over 2-3% mAP. But here are the requested graphs. Bird: 89% Dog: 48% I threw in one with FPS too because we look just like super good Horse: 70% Bird: 75% when we plot on FPS. Reviewer #4 AKA JudasAdventus on Reddit writes “Entertain- ing read but the arguments against the MSCOCO metrics seem a Figure 5. These two hypothetical detectors are perfect according to bit weak”. Well, I always knew you would be the one to turn on mAP over these two images. They are both perfect. Totally equal. me Judas. You know how when you work on a project and it only comes out alright so you have to figure out some way to justify how what you did actually was pretty cool? I was basically trying Now this is OBVIOUSLY an over-exaggeration of the prob- to do that and I lashed out at the COCO metrics a little bit. But lems with mAP but I guess my newly retconned point is that there now that I’ve staked out this hill I may as well die on it. are such obvious discrepancies between what people in the “real world” would care about and our current metrics that I think if See here’s the thing, mAP is already sort of broken so an up- we’re going to come up with new metrics we should focus on date to it should maybe address some of the issues with it or at least these discrepancies. Also, like, it’s already mean average preci- justify why the updated version is better in some way. And that’s sion, what do we even call the COCO metric, average mean aver- the big thing I took issue with was the lack of justification. For age precision? VOC, the IOU threshold was ”set deliberately low to ac- P ASCAL Here’s a proposal, what people actually care about is given an count for inaccuracies in bounding boxes in the ground truth data“ image and a detector, how well will the detector find and classify . Does COCO have better labelling than VOC? This is defi- objects in the image. What about getting rid of the per-class AP nitely possible since COCO has segmentation masks maybe the and just doing a global average precision? Or doing an AP calcu- labels are more trustworthy and thus we aren’t as worried about lation per-image and averaging over that? inaccuracy. But again, my problem was the lack of justification. Boxes are stupid anyway though, I’m probably a true believer The COCO metric emphasizes better bounding boxes but that in masks except I can’t get YOLO to learn them. emphasis must mean it de-emphasizes something else, in this case classification accuracy. Is there a good reason to think that more