YOLO (You Only Look Once) is state of art deep learning algorithm used for detecting objects in video and images in real time. The words “real time” sets YOLO apart from other algorithms like R-CNN, DPM , FAST R-CNN etc. Humans can identify objects easily, but it’s a whole new game when it comes to teaching a machine to identify objects.
Traditionally models like DPM (Deformity Parts Models) or R-CNN have been used to detect objects. DPM processes about 0.07 FPS and takes 14 seconds to detect objects inside an image. R-CNN processes about 0.05 FPS and takes 20 seconds to detect objects in an image. Now imagine you are designing a self driving car application and need a model to detect objects as car travels at 60 mph on freeway. So if you use R-CNN thats takes 20 s between seeing an image and detect objects like cars, people , the car would have travelled 1/3 miles or 1670 feet by the time it is able to detect objects. Thats will be too late as typical distance between cars on freeway is about 10 ft. This could cause collisions. That;’s where the need to detect object in real time comes in and YOLO shines there. YOLO can process 45 FPS and detect objects in image in 22 ms which is equivalent to car traveling a distance of 2 ft at 60 mph. This pretty impressive and thats why YOLO is so appealing to lot of real time object detection use cases.
The table below gives comparison of various models used for detection objects. mAP indicates mean average precision (accuracy).
The speed primarily comes from the fact that YOLO looks at image only once whereas other algorithms looks at various regions and end up looking at image several hundred times. YOLO also looks at whole image and knows context well and can identify correlations among objects inside image. YOLO is also found to be generalizing well for artwork images as well.
Original Paper on YOLO can be found here.
YOLO algorithm starts with dividing image into SxS grid
Each cell in the grid is then asked to predict if the cell contains an object with a confidence score and bounding rectangles coordinates.
The actual type of object does not matter here. Here is an example of cell that covers car object.
There are also cells that do not cover any object. They still predict bounding rectangles but confidence score will be very low.
When all cells repeat this process, we get an image with multiple rectangles.
Here each rectangle predicts 5 attributes
X: X Coordinate of the rectangle
Y: Y co-ordinate of rectangle
w: width of rectangle
h: height of rectangle
C : Confidence score that indicates i cell contains an object
The next step is to identify type of an object (dog, bicycle, car) . Here we compute probability of type of an object given that its already an object i.e. P(class | object)
Then we just multiple P(Object) and P(class|Object) to get final probability. We then apply thresholding to removing bounding boxes with low threshold. We also apply Non Maxima Threshold to remove overlapping boxes. The final output looks like this.
The entire process is illustrated in one picture below
The code starts off by loading existing model confg and weights from darknet. It then captures video stream using opencv’s VideoCapture method.
It then starts processing video frame by frame. For each frame , it normalizes pixels intensity values by a factor 255, resizes the image to 414x416 and then feeds to YOLO V3 model and predicts bounding boxes and confidence score. It then applies thresholding and Non Maxima suppression to keep relevant bounding boxes. It then writes output to another video file. It repeats the process until all the frames in video are processed.
If you would like to learn step by step about entire process and code walkthrough, please sign up for my video course here
About Author Evergreen Technologies:
Active in teaching online courses in Computer vision , Natural Language Processing and SaaS system development
Over 20 years of experience in fortune 500 companies
•Linked in: @evergreenllc2020
Over 22,000 students in 145 countries