Joe Wilkins | Software Engineer

Blob Tracking with Compute Shaders in Unity

carousel

TOOLS:
C#, HLSL, RealSense SDK 2.0
RESOURCES:

Unity Compute Shaders Documentation
Intel Realsense SDK 2.0 wrapper for Unity
HLSL

Overview

I've been developing computer vision software solutions for over 10 years now, and recently, my curiosity led me to wonder whether it would be possible to implement blob detection and tracking in a compute shader.

In the past, when using C#, I've often found libraries like OpenCV to be a bit slow and memory intensive. Whilst more performant languages like C++ might work more effectively with image processing libraries, I wanted to prototype a solution which would allow me to implement an effective people tracking system specifically for use in Unity.

Process

In essence, my main goal was to develop a fast and efficient method for tracking the positions of multiple people or objects in a space, which could then be utilised for clients who were interested in creating tracked interactive installations. I wanted the solution to be scalable, such that it would be possible to add or remove sensors in order for the system to be adaptable to different sized spaces. In terms of performance, my main aim was to find a solution which hit the sweet spot between speed, accuracy and reliability.

Having experimented with compute shaders in Unity since around late 2022, I've learnt about the ways in which they can be used to offload the more process-intensive tasks in an application from the CPU to the GPU. Compute shaders essentially allow tasks to be dispatched to the GPU, which has the benefit of large-scale parallel processing. GPUs are also far better suited to processing image-based data when compared with CPUs, and since a large part of computer vision is rooted in processing images, a compute shader seemed like a logical way of optimising and speeding up people-tracking algorithms.

In terms of the design of the software, I first made use of the RealSense Unity SDK 2.0 Wrapper to create enough RealSense Device objects to match the number of physical cameras that I had. In my case I have two RealSense D455 cameras which I physically positioned and rotated to look at the area that I wanted to track. I then matched the physical camera positions and rotations on the camera objects in the software, and configured them to stream and visualise point clouds.

Following a bit of calibration, in which I changed the physical and virtual positions of the cameras so that the respective point clouds overlapped and aligned with one another, I then created a third virtual camera, set to use orthographic projection, which was positioned and rotated to observe the portion of the point cloud that I wanted to track. By adjusting the virtual camera's size, and near and far planes, I was able to define the region of interest that I wanted to track within.

By sending the output of the virtual camera to a render texture, I was then able to use a combination of C# classes and HLSL shaders to blob track the resulting image. In the example videos, I'm also doing some processing to handle the smoothing of positions and to maintain the persistence of blob instances where data may not be so consistent, for example, when an object or person appears at the edges of an image or where they move through the overlaps between point clouds.

The results of the prototype tracking system can be seen in the videos at the top of this page. Of the three examples, the first video demonstrates a very rough concept for how the blob tracking could be used to track people in a space, alongside projected visuals which react to their movement. In the second, you can see a demo of a person being tracked whilst walking around in a volume of approximately 3 x 2 x 2 m, in which two depth cameras are being used. And finally, in the third video, you can see an example of two hands being tracked whilst moving around on a smaller scale setup in which the volume of the track area is roughly 40 x 30 x 30 cm, in which one depth camera is being used. Across all of the demos, the framerate fluctuates between 45 and 70 fps, the blobs appear to stay mostly on the target and tracking seems relatively robust, which I'm really pleased with.

Considerations

In its current form, the prototype is still a bit experimental and would benefit from some further refinement and testing in a larger (and more obstacle-free!) space in which I can get all of the settings finely tuned.
I'm also interested to see how it performs on different system specifications. In my past experimentation with compute shaders, I've found that they run surprisingly well on machines that don't have a particularly high specification, so I'd be interested to work out the operating tolerances for this particular implementation.
Whilst it may be an obvious point, when using multiple cameras, calibration is key. In the demos that used more than one camera, I had to make sure that both cameras were able to see at least one object in the scene that I could use to then adjust their position and rotation to ensure a good join between point clouds.
One of the features that I have considered implementing in future is some form of auto-rotation setting which makes use of the built-in Inertial Measurement Unit (IMU) that many of the RealSense cameras have. I'm wondering whether the IMU could be used to get an initial rotation for each depth camera, which can then be fine-tuned, just to take some of the initial work out of the calibration process.
Since the virtual camera's frustum defines the region of interest for tracking, it's also worth noting that adjustments to the position rotation and view volume would allow for tracking in a wide range of use cases. For instance, cameras could be arranged to track movement or gestures on vertical surfaces as well as horizontal, and could even be used to create hot spots within a room that trigger content when passed through.
Taking the idea of creating interactive hot spots further, it would be feasible to create an interactive experience in which multiple people have to collaborate by standing or moving through specific areas in space at the same time to trigger different visuals or outcomes.

← Back