•
Paper
Background: 3D Human Motion Capture
Multi-view infra-based method → Multi-view markerless method →
Single-view (monocular) with depth camera method →
Multiple event camera method →
Monocular event camera method (with help of deep learning)
Infra-based motion capture
•
Sensor-based motion capture
•
Marker-based motion capture
Limitation of infra-based motion capture
•
Costly and only possible indoor
•
Intrusive for the users to wear the marker suites
Markerless image-based method
Overcome the limitation of infra-based motion capture method
Multi-Camera system
Limitation of multi-camera method (including monocular depth method)
•
Synchronizing and calibrating multi-camera is challenging
•
Capturing fast motion at high frame rate requires a large amount of data
◦
The high frame rate camera systems are crucial for tracking fast motions
The high frame rate leads to excessive amounts of raw data and large bandwidth requirement
(e.g., capturing RGB stream of VGA resolution at 1000 fps from a single view for a minute = 51.5GB)
→ Markerless method has no metadata, but just raw data (this incurs inevitable large data)
Event Camera
•
Bio-inspired sensors
◦
Measure the changes of logarithmic brightness independently at each pixel
◦
Provide an async event stream at microsecond resolution
▪
An event occurs at a pixel when the logarithmic brightness change over threshold
◦
Besides the event stream, an intensity image is provided at a lower frame rate
•
Paradigm-shifting device
◦
High dynamic range
◦
Absence of motion blur
◦
low power consumption
•
The specific characteristics of event camera is suitable for tracking fast moving objects
Brief Introduction
•
Contribution
◦
First monocular approach for event camera-based 3D human motion capture
◦
A novel hybrid asynchronous batch-based optimization algorithm
▪
To tackle the challenges of low signal-to-noise ratio (SNR), drifting and difficulty in initialization
◦
Propose an evaluation dataset for event camera-based fast human motion capture and provide high-quality motion capture results at 1000 fps
EventCap Method
•
Capturing high-speed human motion in 3D with a single event camera
◦
High temporal resolution is required
Pipeline Overview
•
Pre-processing step: reconstruct a template mesh of the actor
•
Stage1: Sparse event trajectories generation (between two adjacent intensity images)
◦
Extract the async spatio-temporal info from the event stream
•
Stage2: A performance of batch optimization scheme
◦
Optimize the skeletal motion at 1000 fps using event trajectories and CNN-based body joint detection
•
Stage3: The captured skeletal motion refinement
Template Mesh Acquisition
First Stage: Async Event Trajectory Generation
•
Track the photometric 2D features in an asynchronous manner
◦
Results in the aprse event trajectories { }
◦
Take advantage of the technique of the previous paper below
Intensity Image Sharpening
•
The previous technique relies on sharp intensity images for gradient calculation
◦
Intensity image suffer from severe motion blur due to fast motion
•
Firstly adopt the event-based double integral (EDI) model to sharpen the images
◦
By using latent images instead of original blurry images
Forward and Backward Alignment
•
The feature tracking can drift over time
•
Apply the feature tracking method both forward and backward to reduce drifting
Tracjectory Slicing
•
Evenly slice the continuous event tracjectory at each millisecond time stamp
◦
In order to achieve motion capture at the desired tracking frame rate
◦
: desired frame rate
Second Stage: Hybrid Post Batch Optimization
•
Jointly optimize all the skeleton poses { } for all the tracking frames in a batch
•
Leverages the hybrid input modality from the event camera
◦
Leverage not only the event feature correspondences,
but also the CNN-based 2D and 3D pose estimates
◦
Tackle the drifting due to the accumulation of tracking errors and inherent depth ambiguities of the monocular setting
Event Corresondence Term
•
Exploits the async spatio-temporal motion info encoded in the event stream
◦
is a event correspondences extracted from the sliced trajectories (in fig.3)
2D and 3D Detection Term
•
This terms encourage the posed skeleton to match the 2D and 3D body joint detection obtained by CNN from the intensity images
◦
Used VNect and OpenPose for CNN
Temporal Stabilization Term
•
Since only the moving body parts can trigger events, the non-moving body parts are not constrained by their energy function
•
Introduces a temporal stabilization constraint for the non-moving body parts
Third Stage: Event-Based Pose Refinement
•
Most of the events have a strong corrleation with the actor's silhouette (moving edges)
•
Estimation refinement in an Iterative Closest Point (ICP) manner
•
Stability term is applied, which was mentioned before
◦
In order to stay close to its initial position
= -th boundary pixel
= corresponding 3D position on the mesh
= corresponding target 2D position of the closest event
Closest Event Search
•
How to find the closest event
◦
Based on the temporal and spatial distance between and
Experimental Results
•
A PC with 3.6GHz Intel Xeon E5-1620 and 16GB RAM
EventCap Dataset
•
Propose a new benchmark dataset for monocular event-based 3D motion capture
◦
Consist of 12 sequences of 6 actors performing different activities
▪
karate, dancing, javelin, throwing, boxing, and other fast non-linear motions
◦
Captured with a DAVIS240C event camera
▪
produces an event stream and a low frame rate intensity image stream
(between 7 and 25 fps) at 240 X 180
◦
For reference, also capture with a Sony RX0 camera
▪
Produces a high frame rate
(between 250 and 1000 fps) RGB videos at 1920 X 1080
◦
For a quantitative evaluation, one sequence is also tracked with a multi-view markerless motion capture system at 100 fps
•
For qualitative evaluation, reconstrcut the latent images at 1000 fps from the event stream using method of paper below
•
Precisely overlaid on the latent images and reconstructed poses are plausible in 3D
◦
Even many extreme cases, such as in black ninja suite in the dark
Ablation Study
•
Batch optimization benefits from the integration of CNN-bsaed 2D and 3D pose estimation and the event trajectories
◦
Significantly improves the accuracy and alleviated the drifting problem
•
Fig. 7 shows errors to the ground truth 3D joint positions
Influence of the template mesh accuracy
Comparison to Baselines
•
To their best knowledge, it's the first monocular event-based 3D motion capture method
◦
Compare to existing monocular RGB-based approaches
▪
HMR, MonoPerfCap
◦
HMR_all and Mono_all on all latent images reconstructed at 1000 fps
◦
HMR_linear and Mono_linear only on the raw itensity images and linearly upsample
•
HMR_all and Mono_all suffer from inferior tracking results due to the accumulated error of the reconstructed latent images
•
Mono_linear and HMR_linear fail to track the high-frequency motions
•
HMR_refer and Mono_refer on the high frame rate reference RGB images
•
For a fair comparison, downsample the reference images into the same resolution of the intensity image from the event camera
Discussion and Conclusion
•
Limitations
◦
Not able to handle topology change and severe (self-)occlusion
◦
Requires a stable capture background
◦
Cannot handle the challenging scenarios (moving camera, sudden lighting changes)
•
Conclusion
◦
First approach for markerless 3D human motino capture using a single event camera
◦
Batch optimization makes full usage of hybrid image and event streams