The thing about PatchFusion – Monocular Depth Estimation


Just wanted to post a quick update on a subsystem for Bubbles the robot dog. I have been working on-and-off on getting a Monocular depth model working (and running fast enough) within Docker for real-time depth estimation.

At a high-level, RGBD is a normal RGB color image + a depth channel; useful for generating a point cloud or 3D space semantic understanding, etc. to help a robot navigate within its environment. Depth is most conventionally computed from stereoscopic view with overlapping pixels. This requires two cameras and limits the depth information to the overlap between the two images. But wouldn’t it be great if there was a way to get depth from a single, cheap, off-the-shelf camera? Enter monocular depth, a technique to infer the depth channel using a single camera (and ideally a single image frame from the camera). This would let us plop a little camera onto a robot and still get depth.

But what about lidar? Yes, lidar could give a dense point cloud for navigation. In fact, 2D lidar may be a more straight-forward approach to generating a 2D navigable map. However, we lose the other interesting part of images, color (and even more interesting a colored point-cloud). Monocular depth may give us depth with cheap camera hardware and provide useful color information if we can generate the estimate.

I have hand-waved the depth estimation but time to operationalize it. Depth estimation is going to be a compute heavy process to take features in the RGB image to produce a depth. So while our camera hardware may be cheap, the GPU needed to number crunch won’t be. Further, the depth estimation function is not going to be a straightforward calculation, we will need to train a model or use an existing model. For our purposes we will definitely be using a model.

We will need our model to meet a few requirements.

  1. Low latency
    • We need the full chain from image capture on hardware, image streaming into model, to the depth estimate output to be real-time. Our robot’s movements are slow so 1 Hz or better is probably fine.
  2. Reasonably accurate
    • Our robot is going to use the output depth estimate to produce a 2D local map for collision avoidance. Intra-frame metric distance space must be consistent for us to find a path to navigate through. Our model doesn’t have to be real meters in space, we can calibrate a mapping between internal units and real-world units.
  3. Stable
    • If we are going to use the depth estimate for SLAM we need inter-frame continuity of the metric space used in the estimate. If our model tells us we are one meter away from a wall and when we move half a meter closer, the wall is now four meters away then our depth prediction is meaningless to us.

Enter PatchFusion. This model claims to be stable and accurate but has some latency concerns. I wanted to do a quick test to prove it out.

Latency Test

Below are my results (assuming an averaged duration for three inference passes; except CPU since it took so long).

DevicePatch SizeBatch SizeCompute Time (s)
CPUr11446
GPUr1112
GPUr127
GPUr145
GPUr156
GPUr245
GPUm143
GPUm246

Hardware is an AMD Ryzen 5 5600X 6-Core @ 3.70 GHz with 32 GB RAM and NVidia 3070 GPU.

My not very scientific test is telling me this won’t work. The best I was able to get was 3 seconds using m1 mode, which produces a non-useful output with distinct artifacts.

Tiling artifact is present. Resulting depth estimate will not be useful for real-time navigation even if we could inference faster than 3 seconds.

Versus the slower M2 mode

Much better inference but costs 5 seconds.

Or R1 mode

R1 is a variant of M2 mode with single patch based on M2 offset rules. Again about 5 seconds.

For reference out input images was the group photo of my robots taken years ago.

Group of robots (with Bubbles on the right).

Accuracy Test

To benchmark accuracy, I took an image on the ESP32-Camera with an object at known distance. Obviously a single sample at a single distance is not a robust measure but it hints at what to expect. A better way to do this would be to use a calibrated stereoscopic camera to provide ground truth for a full depth map.

Stability Test

Lastly, to measure stability, we can use the previous known distance image and a second image with the same scene. The results from these two frames being estimated should be similar. We can select pixel patches and compute the average discrepancy. We would expect a stable model to have low discrepancy at all patches. Again, this isn’t really a statistically valid test, we should have many image pairs under different conditions but that’s simply too much work for a side project.

Conclusion

Patchfusion is a cool model but given my hardware config and real-time inference requirements it just doesn’t cut it.

On to the next model.


Leave a Reply

Your email address will not be published. Required fields are marked *