Just wanted to post a quick update on a subsystem for Bubbles the robot dog. I have been working on-and-off on getting a Monocular depth model working (and running fast enough) within Docker for real-time depth estimation.
High-level, RGBD is a normal RGB color image + a depth channel; useful for generating a point cloud or 3D space semantic understanding, etc. to help a robot navigate within its environment. Depth is most conventionally computed from stereoscopic view with overlapping pixels. This requires two cameras and limits the depth information to the overlap between the two images. But wouldn’t it be great if there was a way to get depth from a single, cheap, off-the-shelf camera? Enter monocular depth, a technique to infer the depth channel using a single camera (and ideally a single image frame from the camera). This would let us plop a little camera onto a robot and still get depth.
But what about lidar? Yes, lidar could give a dense point cloud for navigation. In fact, 2D lidar may be a more straight-forward approach to generating a 2D navigable map. However, we lose the other interesting part of images, color (and even more interesting a colored point-cloud). Monocular depth may give us depth with cheap camera hardware and provide useful color information if we can generate the estimate.
I have hand-waved the depth estimation but time to operationalize it. Depth estimation is going to be a compute heavy process to take features in the RGB image to produce a depth. So while our camera hardware may be cheap, the GPU needed to number crunch won’t be. Further, the depth estimation function is not going to be a straightforward calculation, we will need to train a model or use an existing model. For our purposes we will definitely be using a model.
We will need our model to meet a few requirements.
- Low latency
- We need the full chain from image capture on hardware, image streaming into model, to the depth estimate output to be real-time. Our robot’s movements are slow so 1 Hz or better is probably fine.
- Reasonably accurate
- Our robot is going to use the output depth estimate to produce a 2D local map for collision avoidance. Intra-frame metric distance space must be consistent for us to find a path to navigate through. Our model doesn’t have to be real meters in space, we can calibrate a mapping between internal units and real-world units.
- Stable
- If we are going to use the depth estimate for SLAM we need inter-frame continuity of the metric space used in the estimate. If our model tells us we are one meter away from a wall and when we move half a meter closer, the wall is now four meters away then our depth prediction is meaningless to us.
Enter PatchFusion. This model claims to be stable and accurate but has some latency concerns. I wanted to do a quick test to prove it out. Below are my results (assuming an averaged duration for three inference passes; except CPU since it took so long).
Device | Patch Size | Batch Size | Compute Time (s) |
CPU | r1 | 1 | 446 |
GPU | r1 | 1 | 12 |
GPU | r1 | 2 | 7 |
GPU | r1 | 4 | 5 |
GPU | r1 | 5 | 6 |
GPU | r2 | 4 | 5 |
GPU | m1 | 4 | 3 |
GPU | m2 | 4 | 6 |
Hardware is an AMD Ryzen 5 5600X 6-Core @ 3.70 GHz with 32 GB RAM and NVidia 3070 GPU.
My not very scientific test is telling me this won’t work. The best I was able to get was 3 seconds using m1 mode, which produces a non-useful output with distinct artifacts.

Tiling artifact is present. Resulting depth estimate will not be useful for real-time navigation even if we could inference faster than 3 seconds.
Versus the slower M2 mode

Much better inference but costs 5 seconds.
Or R1 mode

R1 is a variant of M2 mode with single patch based on M2 offset rules. Again about 5 seconds.
For reference out input images was the group photo of my robots taken years ago.

Group of robots (with Bubbles on the right).
Patchfusion is a cool model but given my hardware config and real-time inference requirements it just doesn’t cut it. Not even worth testing the other two requirements (accuracy and stability).
On to the next model.