Intel RealSense depth camera D415 project – people counter

My friend passed me a Intel RealSense cam and ask me to explore its power and ability. Initially I am thinking of using for 3D scanning or facial recognition projects . But I think a ‘people counting and social distancing automation robot’ is more practical 🤙. So let’s begin to code and build. 🤖

Below 3 pics shows the Z axis (distance) measurement capability of the D415. You have to zoom in to see the reading(distant reading pointed by the arrow).

Using the Intel sdk Depth Quality Tool, the Depth Stream window gives us distant and the IR stream gives us the pixel pointed by the mouse. Both provide ROI window which can be controlled at the side menu bar.

To interface D415 with Python Opencv, certain libraries have to be downloaded:

  • Opencv
  • Pyrealsense2

Download CMake.

Ok, let’s start to build the system for counting people.

Using Intel D415 depth camera to build a people counter

In this post, we will use Python OpenCV deep learning module call DNN(Deep Neural Networks). DNN is not an end to end deep learning framework, which means we can’t train the network as there is no back propagation hence no learning takes place.

The input is fed into a pre-trained model which produces the results. This is refer to as inference and only forward pass takes place.

The DNN process:

Following is the Python code for using OpenCV DNN to detect objects in a streamed in video.

import cv2
import numpy as np

cam = cv2.VideoCapture(2)
if cam.isOpened()==False:
	print("No video Stream")

cam.set(cv2.CAP_PROP_FRAME_WIDTH, 1920)

all_rows = open(r'C:\Users\Ls\Documents\ACCS\blog post\synset_words.txt').read().strip().split("\n")
classes = [r[r.find(' ')+1:] for r in all_rows]
net = cv2.dnn.readNetFromCaffe(r'C:\Users\Ls\Documents\ACCS\blog post\bvlc_googlenet.prototxt',r'C:\Users\Ls\Documents\ACCS\blog post\bvlc_googlenet.caffemodel')

while (True):
	ret, frame = # capture frame from video
	blob = cv2.dnn.blobFromImage(frame,1,(224,224))
	outp = net.forward()
	for i in np.argsort(outp[0])[::-1][:5]:
		txt = ' "%s" probability "%.3f" ' % (classes[i], outp[0][i]*100)
		cv2.putText(frame,txt, (0, 25 + 40*r), cv2.FONT_HERSHEY_SIMPLEX, 1, (255,0,0),2)
	#imgGray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
	cv2.imshow('Frame', frame)
	# loop will be broken when 'q' is pressed on the keyboard
	if cv2.waitKey(10) & 0xFF == ord('q'):

Below is the video screenshot of the objects identified and their probabilty.

  • cowboy hat 31.337% – not really correct, I’m wearing a cap
  • Shower cap 28.293% – Cap is correct but not for shower
  • electric fan – 13.634% – Correct
  • Strainer 3.507% and solar dish 1.478% – I think the window grill does look like a big strainer 😉 and the grill with black colour window glass does look like a solar panel 😀

Now we will use YOLO (you only look once) version 3 together with Caffe and Darknet deep learning framework :

The official website for Yolo:, where you can learn and better understand Yolo.

The Python code can be downloaded from Github.

# Usage:
# python --video=<path to video file>
# python --image=<path to image file>

import numpy as np
import cv2
import argparse
import sys
import numpy as np
import os.path

# Initialize the parameters
confThreshold = 0.5  #Confidence threshold
nmsThreshold = 0.4   #Non-maximum suppression threshold

parser = argparse.ArgumentParser(description='Object Detection using YOLO in OPENCV')
parser.add_argument('--image', help='Path to image file.')
parser.add_argument('--video', help='Path to video file.')
args = parser.parse_args()

# Load names of classes from coco

classes = open(r"C:\Users\Linawati\Documents\ACCS\blog post\coco.names").read().strip().split('\n')

net = cv2.dnn.readNetFromDarknet(r"C:\Users\Linawati\Documents\ACCS\blog post\yolov3.cfg", r"C:\Users\Linawati\Documents\ACCS\blog post\yolov3.weights")

# Get the names of the output layers
def getOutputsNames(net):
    # Get the names of all the layers in the network
    layersNames = net.getLayerNames()
    # Get the names of the output layers, i.e. the layers with unconnected outputs
    return [layersNames[i[0] - 1] for i in net.getUnconnectedOutLayers()]

# Draw the predicted bounding box
def drawPred(classId, conf, left, top, right, bottom):
    # Draw a bounding box.
    cv2.rectangle(frame, (left, top), (right, bottom), (255, 178, 50), 3)

    label = '%.2f' % conf

    # Get the label for the class name and its confidence
    if classes:
        assert(classId < len(classes))
        label = '%s:%s' % (classes[classId], label)

    #Display the label at the top of the bounding box
    labelSize, baseLine = cv2.getTextSize(label, cv2.FONT_HERSHEY_SIMPLEX, 0.5, 1)
    top = max(top, labelSize[1])
    cv2.rectangle(frame, (left, top - round(1.5*labelSize[1])), (left + round(1.5*labelSize[0]), top + baseLine), (255, 255, 255), cv2.FILLED)
    cv2.putText(frame, label, (left, top), cv2.FONT_HERSHEY_SIMPLEX, 0.75, (0,0,0), 1)

# Remove the bounding boxes with low confidence using non-maxima suppression
def postprocess(frame, outp):
    frameHeight = frame.shape[0]
    frameWidth = frame.shape[1]

    # Scan through all the bounding boxes output from the network and keep only the
    # ones with high confidence scores. Assign the box's class label as the class with the highest score.
    classIds = []
    confidences = []
    boxes = []
    for out in outp:
        for detection in out:
            scores = detection[5:]
            classId = np.argmax(scores)
            confidence = scores[classId]
            if confidence > confThreshold:
                center_x = int(detection[0] * frameWidth)
                center_y = int(detection[1] * frameHeight)
                width = int(detection[2] * frameWidth)
                height = int(detection[3] * frameHeight)
                left = int(center_x - width / 2)
                top = int(center_y - height / 2)
                boxes.append([left, top, width, height])

    # Perform non maximum suppression to eliminate redundant overlapping boxes with
    # lower confidences.
    indices = cv2.dnn.NMSBoxes(boxes, confidences, confThreshold, nmsThreshold)
    for i in indices:
        i = i[0]
        box = boxes[i]
        left = box[0]
        top = box[1]
        width = box[2]
        height = box[3]
        drawPred(classIds[i], confidences[i], left, top, left + width, top + height)

outputFile = "YOLOv3_output.avi"
if (args.image):
    # Open the image file
    if not os.path.isfile(args.image):
        print("Input image file ", args.image, " doesn't exist")
    cap = cv2.VideoCapture(args.image)
    outputFile = args.image[:-4]+'_YOLOv3_output.jpg'
elif (
    # Open the video file
    if not os.path.isfile(
        print("Input video file ",, " doesn't exist")
    cap = cv2.VideoCapture(
    outputFile =[:-4]+'_YOLOv3_output.avi'
    # Webcam input
    cap = cv2.VideoCapture(0)

# Get the video writer initialized to save the output video
if (not args.image):
    vid_writer = cv2.VideoWriter(outputFile, cv2.VideoWriter_fourcc('M','J','P','G'), 30, (round(cap.get(cv2.CAP_PROP_FRAME_WIDTH)),round(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))))

while cv2.waitKey(1) < 0:

    hasFrame, frame =

    # Stop if end of video
    if not hasFrame:
        print("File with YOLOv3 output is here :  ", outputFile)

    # Create a 4D blob from a frame.
    blob = cv2.dnn.blobFromImage(frame, 1/255, (416,416), [0,0,0], 1, crop=False)

    # Sets the input to the network

    # Runs the forward pass to get output of the output layers
    outp = net.forward(getOutputsNames(net))

    # Remove the bounding boxes with low confidence
    postprocess(frame, outp)

    # Write the frame with the detection boxes
    if (args.image):
        cv2.imwrite(outputFile, frame.astype(np.uint8))

    cv2.imshow('Image', frame)

I added another condition in the If/elif/else statement(line 108 to 110) as I am using a live video steam as the input. The 1st below pic is using the webcam (with cap = cv2.VideoCapture(0)) and the 2nd below pic is using the Intel D415 camera.

Using the webcam video stream as an input. Yes I am a person and the number(not shown clearly) should be 1.00 (100%), the probability.

I change the argument for the VideoCapture to 2 (cap = cv2.VideoCapture(2)) which means the input stream is from a USB camera (Intel D415).

So obviously the D415 resolution is much better and the algorithm identify 3 objects, Person(100% probability), cell phone (91% probability) and chair(57% probability), pretty impressive.
The hardware, computer and D415 Intel depth camera.

Ok, so the finally step is to count the number of persons and tabulate into somewhere like Excel, csv sheet or even to the cloud.

to be continue……. very soon