Concepts in YOLO

4 min readApr 20, 2023

Yolo is developed on the principle You Only Look Once. Unlike other object detection at that time that had the sliding window principle.

Introduction : Object Detection Algorithm
Input : Image Output: Bounding Box + Class that bounding box belongs too

So let us start by understanding the new concepts that were brought into while developing the YOLO architecture.

Concept 1: Sliding Window Principle

In this case we move a box of different shape multiple times across the image and predict if there is an object present that the box encloses or not.
Bounding Box sizes are fixed and the image is being looked at multiple times. So, then came YOLO(You only look once).

Concept 2: YOLO
Instead of having multiple bounding boxes and running them across the image. Divide the image into multiple grid cells and identify the object in the grid cell.

Problem: What happens when your grid cell does not enclose the object completely?

Fig 2: Dividing Image into multiple grid cells

Concept 3: Bounding Box
Instead of giving out the grid cell coordinates as bounding box coordinates, change it to mid point of the grid cell and height and width offset(x,y,h,w).
Now in this way we can encode the box that has the object completely enclosed.

Fig 3 : Bounding Box generation based on midpoint, height and width offset

So an object, in our case Car can overlap with multiple grid cells.

Problem : What happens where two objects are part of the same grid cell? What do you then?

Fig 4: Multiple classes overlapping in the same grid cell

Concept 3: Anchor Boxes
Practical Limitation of having multiple classes in YOLO_V1 gave rise to anchor boxes which was introduced in YOLO 9000. Essentially each grid cell has multiple anchor boxes and each anchor box can have one class.

Fig 6: Final Prediction after incorporating Anchor Boxes

Problem : Now what happens where there are too many boxes being generated for the same object? How do you make sure you take only the most appropriate box while testing when they are no ground truths to compare it with?

Fig 7: Multiple boxes being predicted for the same object

Concept 4 : Non Maximum Suppression
This came up in YOLO_v5 where in order to make sure we are not giving multiple overlapping boxes for the same object as prediction.

So now each bounding box has a probability with which the box has been predicted and the class of the bounding box. So we sort the bounding boxes based on probabilities in descending order, then calculate the IOU value of the bounding predicted with the highest probability with the rest of the boxes and suppress all the ones which have a IOU value based on a threshold. Mon-max means that you’re going to output your maximal probabilities classifications but suppress the close-by ones that are non-maximal.

Once finally we have generated the bounding boxes and predicted the class as well for the bounding box, how do we do metrics calculation on an object detection algorithm?

Object detection algorithm metrics calculation is similar to Image classification metrics + Bounding box accuracy prediction
So to understand how accurately your model predicted Bounding Box, there is a metric called mAP which is mean average precision.

Concept 5 : mAP - Mean Average Precision
Each predicted bounding box is associated with the largest IOU value(above the threshold) ground truth box and if there are multiple such boxes its done greedily. For different such thresholds we can calculate the average precision for each class. And the mean of average precision of all these classes is mean average precision.

REFERENCES

Concepts in YOLO

Written by Yamini Lakshmi Narasimhan

No responses yet