Measuring distances is a frequent challenge when developing a Computer Vision based system or application. You may want to measure the size of a detected object, the distance between some elements of the scene, or even how fast a tracked object is moving.
A simple and common approach to this problem is by selecting a reference object whose size is previously known, and then calculate any distance on the scene by its relation with this object. This is an easy and fast solution, and you can follow this PyImageSearch tutorial to learn how to do it with OpenCV. But easiness comes with a major drawback, this method will only work on constrained scenarios where the objects to be measured and their reference are on the same viewing angle, plane, and distance from the camera. If the reference is far away from the object you want to measure, introduced distortion will cause a wrong measurement. It’s intuitive, you just can’t measure the Eiffel tower with your finger.
In situations where multiple measurements have to be made on a fixed scenario, 3D reconstruction could be used. Scanning your environment from multiple angles and positions and using proper software, a 3D reconstruction can be recreated with enough detail to measure distances between any points on the scene with great accuracy.
3D reconstruction is the process of capturing real shape and dimensions, in this case from a set of 2D images, taken from a normal RGB phone camera. Computer Vision algorithms are able to construct a 3D plausible geometry that explains the images by performing a pipeline consisting of two major steps:
This article explains the process of building a 3D model from 2D images without any prior knowledge in image processing. We test available tools for 3D reconstruction to understand its current quality and to check object measurement accuracy. To understand more about the theory behind the pipeline and how each process works, you can start by reading Alice Vision website, which makes a good summary and links the literature about each step. Image processing is quite complex to understand, but, luckily, proper software has been developed and published and the entire process of 3D reconstruction no longer requires complex code but mere button clicks.
There are open-source options like for 3D scene reconstruction like Meshroom and COLMAP as well as paid ones like Photomodeler and RealityCapture. Meshroom was chosen for this post because it is open-source, has an intuitive interface, it’s fast enough and requires no tinkering to get a good initial result without real size measures.
The reconstruction of a 3D scene with real measures requires a few extra steps since the scene itself is dimensionless. That is set by the workspace scaling. The way to get absolute accurate scaling is to have target markers with known distances in our scene photos and scale the 3D model until it matches those distances. The scaling process can be done on Meshroom or using external software like Blender.
Before you start taking pictures from your scene, there are some considerations to take into account:
For the camera:
For the scene:
CCTAG stands for Concentric Circles Target. As the name suggests, it consists of a target made of concentric circles with varying radiuses. The circles form an image with three black rings with a size pattern that will determine the marker’s ID.
Alice Vision provides a repo with code to generate your own markers along with some premade samples (more information on how to generate the markers can be found here). This is relevant because you want markers big enough to be correctly captured by the camera without occupying such a big portion of the surface. The markers should be printed and placed in the scene on a flat surface.
As for the actual placing of the markers, they should be arranged in such a way that the scene is contained within them. Finally, it is crucial to know the correct ID of the markers placed in the scene and their position (the marker generation script provides the option of printing the ID and a crosshair at the center of the marker). In this case, I placed them in a square shape of 1 metre as the side length.
Once the camera and scene setup is complete, it’s is time to capture our images! For this process it is recommended to walk around the scene taking photos from different angles to get as many details as possible. There isn’t really a recommended amount of photos because it will depend on the scene but you probably won’t get a very good outcome if you have less than 40 photos.
After the photos are taken, it is important to move the files to the computer without compromising their metadata. Meshroom uses the picture’s metadata to get important information about the camera (like focal length) that is crucial for a correct output.
When launching the program you will meet the following interface:
On the left side of the interface lies the place to put the photos taken of your scene and on the bottom there’s a sequence of rectangles connected by white lines that represent the pipeline of 3D reconstruction. After importing your images it will look like this:
Notice the green circle on the top left corner of every miniature image? It means the metadata of the images was enough to get all the information needed from the camera (iPhone XS in this case) to make the reconstruction. Depending on the camera and it’s metadata, the circle’s color can be also yellow (still fine probably) or red (not fine, there will be errors. In that case you should try another camera or use AnalogExif to edit the metadata of the photos)
After adding the photos and saving your project you should be able to run the 3D reconstruction pipeline as it is by clicking on the green “Start” button and get a good 3D model that won’t have real-life measures. I recommend doing this and making sure it works before proceeding to change the pipeline. You can use the official sample dataset if errors come up and you need to know whether the problem is in the images or somewhere else.
The default image processing pipeline output may not have real-life size but it certainly has the same distance relations as the real scene. That means all we need to do is apply a scale transformation to the model to match a set of known distances. This is done by detecting the CCTAGs in the images, adding the “SfMTransform” node to the pipeline and manually setting the position of each marker in this node.
To detect the CCTAG3 makers we need to modify the “Descryber Types” field from the “FeatureExtracion”, “FeatureMatching” and “StructureFromMotion” node by selecting “cctag3” along with the default “sift” like this:
Also adding “akaze” describer along with “sift” and “cctag3” may improve 3D reconstruction.
Now that CCTAG marker detection is part of the pipeline, we need to add a node that will use that information to apply a transformation that will take our model to real-life size. This is done by adding the node called “SfMTransform” (to add a node right click on an empty space and write its name) in this manner:
Now it’s only a matter of modifying the “SfMTransform” node to look like this:
Now everything needed is set, just need to hit the green “Start” button and the process will begin.
After all the nodes of the pipeline are finished, you can explore the model with textures in Meshroom by double-clicking the “Texturing” node:
If the result at this stage is not good enough you can increase the “Describer Quality” and “Describer Density” in the “FeatureExtraction” node. If it’s still not good enough, adding more/better photos may help.
In this particular case it can be seen that the legs of the chair could not be reconstructed, this is because of the patternless nature of the material’s paint coating along with the leg’s intricate geometry. The lack of features in the surface is the same reason why the floor, the back and the seat of the chair could be decently reconstructed but not the white wall around it.
Now for the last part, we need to verify that the measures are correct, for this we will open our OBJ model (can be found in the project’s save directory inside the “Texturing” folder) in Blender and use the measuring tool:
According to Blender’s measurements, the chair’s back has a height of 26.5 cm and the seat has a width of 43.9 cm.
The real measurements are 27 cm and 45 cm respectively. That’s a pretty accurate!
Based on the SfM results, we can perform camera localization and retrieve the motion of an animated camera in the scene of the 3D reconstruction using the “CameraLocalization’’ node. This is very useful for doing texture reprojection in other software as part of a texture clean-up pipeline. Could also be used to leverage Meshroom as a 3D camera tracker.
What if I want to find the position of an object that is not part of the original scene? With a known position and orientation of a camera in the 3D scene along with its intrinsic parameters, a 3D world coordinate can be associated with every pixel from an image taken by that camera. By setting some constraints like object size (small compared to the scene) and position (object is resting on a surface instead of floating in the middle of the scene), modern 2D object detection pipelines can be leveraged to achieve a 3D real-world scale object detection system. By doing this, applications like detecting cars and measuring distance or even speed between them could be developed.
How accurate can 3D reconstruction be? We wanted to check how much detail could this technique retrieve. We took 70 images from a small object, and the results were impressive. Every detail on the surface was correctly reconstructed on the 3D model, proving that the real enemy of 3D digitization are plain surfaces of uniform color like walls, rather than small, full of detail Maori carved totems.
You can check the generated 3D model on totem.eidos.ai. We loaded it on Three.js so you can move it on your browser and appreciate the detail capture.
Following up on our previous article about Hand Detection, this is the second part of our series about the possibilities of hand recognition technology. Here, we dive the design, training, and testing of a Hand Keypoint detector, a Neural Network capable of detecting and tracking hand movements.
From my experience with using the Detectron2 library for object detection and training machine learning models, here's an explanation on how to split a dataset into test, train, and validation sets, register metadata for each, and monitor accuracy on validation while training.