Total3DUnderstanding
•
Scene understanding and 3D shape modeling from single image
Background
Semantic reconstruction of indoor scenes
•
Scene understanding
•
Object reconstruction
Brief history of scene reconstruction
•
Early works focus on room layout estimation
•
With the advance of CNNs, object pose estimation
•
Instead of bounding box, shape retrieval methods w/ 3d model
•
General shape representation: point cloud, patches, primitives w/ post-processing
•
Voxel-grid representation: computationally intensive and time-consuming
•
Object mesh reconstruction from a template
Previous works
•
Scene understanding w/o shape details of indoor objects (instead, 3D bounding box)
•
Scene-level reconstruction w/ object shapes under contextual knowledge
◦
Depth or voxel representation (c.f., voxel = 3D pixel)
◦
Mesh-retrieval methods w/ 3D model retrieval module
◦
Object-wise mesh reconstruction (e.g., Mesh R-CNN)
Achievement of this paper
•
Jointly reconstruct room layout and object(bounding box & meshes) reconstruction
◦
Spatial occupancy of object meshes helps 3D object detection
◦
In turn, object detection helps object-centric reconstruction
•
First end-to-end solution to authors' knowledge
◦
Coordinates of reconstructed meshes: differentiable (c.f., voxel grids are not)
◦
by jointly learning
•
Density-aware topology modifier in object mesh generation
•
the attention mechanism & multilateral relations between objects
Architecture
•
"box-in-the-box" manner with 3 modules: LEN, ODN, MGN
•
produce object bounding boxes with Faster R-CNN (NIPS'15)
•
Layout Estimation Network (LEN): camera pose and the layout bounding box
•
3D Object Detection Network (ODN): 3D bounding box in camera system
•
Mesh Generation Network (MGN): the mesh geometry in object-centric system
•
Construct the full scene mesh by embedding with joint training (scaled & transformed)
Detection mechanism
•
For indoor object, 3D center C
•
K = the camera intrinsic matrix
•
Camera pose R
•
Beta, Gamma = pitch, roll
c.f., represented by its 2D projection on the image plane with its distance d to the camera center
ODN
•
Provided that, Multi-lateral relations between objects
◦
take all in-room objects into account in predicting its bounding box
•
extract the appearance feature with ResNet-34
•
calculate relational feature with the object relation module (CVPR'18)
LEN
•
Similar architecture to ODN, w/o relational feature and w/ 2 FC layers
◦
Image→ResNet→MLP (2 FC layer)
MGN
•
overcome the limit of Topology Modification Network, TMN (ICCV'19)
◦
TMN approximates object shapes by deforming and modifying with predefined distance threshold
•
Hard to give a general threshold for diff scales of object meshes
◦
a large shape variance among diff categories
◦
complex background and occlusions
•
Adaptive manner based on the local density of the ground-truth, rather than distance
◦
Topology modification: whether to reserve a face or not
•
p: a point on their reconstructed mesh
•
q: its nearest neightbor on the ground-truth
•
Cut mesh edges for topology modification instead of faces
•
Category code provides shape priors
•
Edge classifier = similar architecture with the shape decoder w/ last FC layer
•
Boundary refinement module (ICCV'19)
Joint learning
•
End-to-End training
•
Individual loss for ODN, LEN and MGN
•
Last two term: joint losses
◦
Cooperative loss from "Cooperative holistic scene understanding" (NIPS'18)
◦
Global loss as the partial Chamfer distance
•
: ground-truth surface
•
: reconstructed mesh
•
: a point on
•
: # of objects and
3D Datasets
SUN RGB-D: https://rgbd.cs.princeton.edu/
•
10,335 real indoor images with labeled 3D layout, object bounding boxes and coarse point cloud
•
395 furniture models with 9 categories (10,069 images)
Experiment
•
2D detector trained on the COCO dataset and fine-tune on SUN RGB-D
•
Template sphere: 2562 vertices with unit radius
•
First train ODN, LEN on SUN RGB-D, and MGN on Pix3D individually
•
Then, jointly train all networks combining Pix3D into SUN RGB-D
Qualitative analysis
•
Indoor furniture are overlaid with miscellaneous backgrounds
•
Mesh R-CNN generates meshes from low-resolution voxel grids ( voxels)
•
TMN improves, but distance threshold does not show consistent adaptability
c.f., Original SUN RGB-D does not contain ground-truth object meshes for training
Quantitative analysis
•
w/ 4 aspects
◦
layout estimation -> SUN RGB-D
◦
camera pose prediction -> SUN RGB-D
◦
3D object detection -> SUN RGB-D
◦
object and scene mesh reconstruction -> Pix3D
•
test with ablation (w/o joint training)
layout, camera pose
•
IoU for 3D layout and mean absolute error for pitch & roll
3D object detection
•
3D bounding box IoU with mean average precision (mAP) for 3D object detection
◦
Considered true positive when IoU is bigger than 0.15
•
Improved by 2 reasons
◦
Global loss involves geometry constraint (physical rationality)
◦
Multi-lateral relational features benefit the 3D detection in predicting spatial occupancy
Object Post Prediction
Mesh Reconstruction
evaluated with in Eq.3, where the loss is calculated with the average distance from the point cloud of each object to its nearest neighbor on the reconstructed mesh
•
Local density method keeps small-scale topology
•
Cutting edges is more robust in avoiding incorrect mesh modification
Ablation analysis
•
Joint training consistently improves regardless of relational features
◦
Cooperative loss & global loss both show effect
◦
Cooperative loss on Layout Estimation
◦
Global loss on object detection and scene reconstruction
•
Relational features help to improve 3D object detection
◦
In turn, reduces the loss is scene mesh reconstruction
•
Object alignment significantly affects mesh reconstruction