Last Updated: October 2020. This article was originally written by Michał Maj with further contributions from the Appsilon team.
YOLO (“You Only Look Once”) is an effective real-time object recognition algorithm, first described in the seminal 2015 paper by Joseph Redmon et al. In this article we introduce the concept of object detection, the YOLO algorithm itself, and one of the algorithm’s open source implementations: Darknet. To learn more about PP-YOLO (or PaddlePaddle YOLO), which is an improvement on YOLOv4, read our explanation of why PP-YOLO is faster than YOLOv4.
Image classification is one of the many exciting applications of convolutional neural networks. Aside from simple image classification, there are plenty of fascinating problems in computer vision, with object detection being one of the most interesting. It is commonly associated with self-driving cars where systems blend computer vision, LIDAR and other technologies to generate a multidimensional representation of the road with all its participants. Object detection is also commonly used in video surveillance, especially in crowd monitoring to prevent terrorist attacks, count people for general statistics or analyze customer experience with walking paths within shopping centers.
Update: Recently an improvement on the YOLOv4 algorithm was released: PP-YOLO. You can read about the benefits of PP-YOLO here.
To explore the concept of object detection it is useful to begin with image classification. Image Classification goes through levels of incremental complexity.
Image classification (1) aims at assigning an image to one of a number of different categories (e.g. car, dog, cat, human, etc.), essentially answering the question “What is in this picture?”. One image has only one category assigned to it.
Object localization (2) then allows us to locate our object in the image, so our question changes to “What is it and where it is?”.
In a real real-life scenario, we need to go beyond locating just one object but rather multiple objects in one image. For example, a self-driving car has to find the location of other cars, traffic lights, signs, humans and to take appropriate action based on this information.
Object detection (3) provides the tools for doing just that – finding all the objects in an image and drawing the so-called bounding boxes around them. There are also some situations where we want to find exact boundaries of our objects in the process called instance segmentation, but this is a topic for another post.
Want to get started with Image Classification? Read Getting Started With Image Classification: fastai, ResNet, MobileNet, and More.
There are a few different algorithms for object detection and they can be split into two groups:
To understand the YOLO algorithm, it is necessary to establish what is actually being predicted. Ultimately, we aim to predict a class of an object and the bounding box specifying object location. Each bounding box can be described using four descriptors:
In addition, we have to predict the pc value, which is the probability that there is an object in the bounding box.
As we mentioned above, when working with the YOLO algorithm we are not searching for interesting regions in our image that could potentially contain an object.
Instead, we are splitting our image into cells, typically using a 19×19 grid. Each cell is responsible for predicting 5 bounding boxes (in case there is more than one object in this cell). Therefore, we arrive at a large number of 1805 bounding boxes for one image.
Most of these cells and bounding boxes will not contain an object. Therefore, we predict the value pc, which serves to remove boxes with low object probability and bounding boxes with the highest shared area in a process called non-max suppression.
Interested in Convolutional Neural Networks? Read our Introduction to Convolutional Neural Networks.
There are a few different implementations of the YOLO algorithm on the web. Darknet is one such open source neural network framework (a PyTorch implementation can be found here or with some extra fast.ai functionality here; a Keras implementation can be found here). Darknet was written in the C Language and CUDAtechnology, which makes it really fast and provides for making computations on a GPU, which is essential for real-time predictions.
Installation is simple and requires running just 3 lines of code (in order to use GPU it is necessary to modify the settings in the Makefile script after cloning the repository). For more details go here.
git clone https://github.com/pjreddie/darknet
cd darknet
make
After installation, we can use a pre-trained model or build a new one from scratch. For example here’s how you can detect objects on your image using model pre-trained on COCO dataset:
./darknet detect cfg/yolov3.cfg yolov3.weights data/my_image.jpg
The algorithm deals well even with object representations.
If you want to see more, go to the Darknet website.
You don’t have to build your Machine Learning model from scratch. In fact, it’s usually better not to. Read our Introduction to Transfer Learning to find out why.