Total3DUnderstanding: Joint Layout, Object Pose and Mesh Reconstruction for Indoor Scenes from a Single Image

7/4/2021, 7:11:00 AM
포스팅 종류


Scene understanding and 3D shape modeling from single image


Semantic reconstruction of indoor scenes

Scene understanding
Object reconstruction

Brief history of scene reconstruction

Early works focus on room layout estimation
With the advance of CNNs, object pose estimation
Instead of bounding box, shape retrieval methods w/ 3d model
General shape representation: point cloud, patches, primitives w/ post-processing
Voxel-grid representation: computationally intensive and time-consuming
Object mesh reconstruction from a template

Previous works

Scene understanding w/o shape details of indoor objects (instead, 3D bounding box)
Scene-level reconstruction w/ object shapes under contextual knowledge
Depth or voxel representation (c.f., voxel = 3D pixel)
Mesh-retrieval methods w/ 3D model retrieval module
Object-wise mesh reconstruction (e.g., Mesh R-CNN)

Achievement of this paper

Jointly reconstruct room layout and object(bounding box & meshes) reconstruction
Spatial occupancy of object meshes helps 3D object detection
In turn, object detection helps object-centric reconstruction
First end-to-end solution to authors' knowledge
Coordinates of reconstructed meshes: differentiable (c.f., voxel grids are not)
by jointly learning
Density-aware topology modifier in object mesh generation
the attention mechanism & multilateral relations between objects


"box-in-the-box" manner with 3 modules: LEN, ODN, MGN
produce object bounding boxes with Faster R-CNN (NIPS'15)
Layout Estimation Network (LEN): camera pose and the layout bounding box
3D Object Detection Network (ODN): 3D bounding box in camera system
Mesh Generation Network (MGN): the mesh geometry in object-centric system
Construct the full scene mesh by embedding with joint training (scaled & transformed)

Detection mechanism

For indoor object, 3D center C
K = the camera intrinsic matrix
Camera pose R
Beta, Gamma = pitch, roll
c.f., represented by its 2D projection on the image plane with its distance d to the camera center


Provided that, Multi-lateral relations between objects
take all in-room objects into account in predicting its bounding box
extract the appearance feature with ResNet-34
calculate relational feature with the object relation module (CVPR'18)


Similar architecture to ODN, w/o relational feature and w/ 2 FC layers
Image→ResNet→MLP (2 FC layer)


overcome the limit of Topology Modification Network, TMN (ICCV'19)
TMN approximates object shapes by deforming and modifying with predefined distance threshold
Hard to give a general threshold for diff scales of object meshes
a large shape variance among diff categories
complex background and occlusions
Adaptive manner based on the local density of the ground-truth, rather than distance
Topology modification: whether to reserve a face or not
p: a point on their reconstructed mesh
q: its nearest neightbor on the ground-truth
Cut mesh edges for topology modification instead of faces
Category code provides shape priors
Edge classifier = similar architecture with the shape decoder w/ last FC layer
Boundary refinement module (ICCV'19)

Joint learning

End-to-End training
Individual loss for ODN, LEN and MGN
Last two term: joint losses
Cooperative loss from "Cooperative holistic scene understanding" (NIPS'18)
Global loss as the partial Chamfer distance
SS : ground-truth surface
MiM_i : reconstructed mesh
p,qp, q : a point on MM
NN : # of objects and SS

3D Datasets

10,335 real indoor images with labeled 3D layout, object bounding boxes and coarse point cloud
395 furniture models with 9 categories (10,069 images)


2D detector trained on the COCO dataset and fine-tune on SUN RGB-D
Template sphere: 2562 vertices with unit radius
First train ODN, LEN on SUN RGB-D, and MGN on Pix3D individually
Then, jointly train all networks combining Pix3D into SUN RGB-D

Qualitative analysis

Indoor furniture are overlaid with miscellaneous backgrounds
Mesh R-CNN generates meshes from low-resolution voxel grids (24324^3 voxels)
TMN improves, but distance threshold does not show consistent adaptability
c.f., Original SUN RGB-D does not contain ground-truth object meshes for training

Quantitative analysis

w/ 4 aspects
layout estimation -> SUN RGB-D
camera pose prediction -> SUN RGB-D
3D object detection -> SUN RGB-D
object and scene mesh reconstruction -> Pix3D
test with ablation (w/o joint training)
layout, camera pose
IoU for 3D layout and mean absolute error for pitch & roll
3D object detection
3D bounding box IoU with mean average precision (mAP) for 3D object detection
Considered true positive when IoU is bigger than 0.15
Improved by 2 reasons
Global loss LgL_g involves geometry constraint (physical rationality)
Multi-lateral relational features benefit the 3D detection in predicting spatial occupancy
Object Post Prediction
Mesh Reconstruction
evaluated with LgL_g in Eq.3, where the loss is calculated with the average distance from the point cloud of each object to its nearest neighbor on the reconstructed mesh
Local density method keeps small-scale topology
Cutting edges is more robust in avoiding incorrect mesh modification

Ablation analysis

Joint training consistently improves regardless of relational features
Cooperative loss & global loss both show effect
Cooperative loss on Layout Estimation
Global loss on object detection and scene reconstruction
Relational features help to improve 3D object detection
In turn, reduces the loss is scene mesh reconstruction
Object alignment significantly affects mesh reconstruction