EventCap: Monocular 3D Capture of High-Speed Human Motions using an Event Camera

2021/07/04 06:38
Injung Hwang
포스팅 종류

Background: 3D Human Motion Capture

Multi-view infra-based method → Multi-view markerless method → Single-view (monocular) with depth camera method → Multiple event camera method → Monocular event camera method (with help of deep learning)

Infra-based motion capture

Sensor-based motion capture
Marker-based motion capture

Limitation of infra-based motion capture

Costly and only possible indoor
Intrusive for the users to wear the marker suites

Markerless image-based method

Overcome the limitation of infra-based motion capture method

Multi-Camera system

Limitation of multi-camera method (including monocular depth method)

Synchronizing and calibrating multi-camera is challenging
Capturing fast motion at high frame rate requires a large amount of data
The high frame rate camera systems are crucial for tracking fast motions
The high frame rate leads to excessive amounts of raw data and large bandwidth requirement
(e.g., capturing RGB stream of VGA resolution at 1000 fps from a single view for a minute = 51.5GB)
→ Markerless method has no metadata, but just raw data (this incurs inevitable large data)

Event Camera

Bio-inspired sensors
Measure the changes of logarithmic brightness independently at each pixel
Provide an async event stream at microsecond resolution
An event occurs at a pixel when the logarithmic brightness change over threshold
Besides the event stream, an intensity image is provided at a lower frame rate
Paradigm-shifting device
High dynamic range
Absence of motion blur
low power consumption
The specific characteristics of event camera is suitable for tracking fast moving objects

Brief Introduction

First monocular approach for event camera-based 3D human motion capture
A novel hybrid asynchronous batch-based optimization algorithm
To tackle the challenges of low signal-to-noise ratio (SNR), drifting and difficulty in initialization
Propose an evaluation dataset for event camera-based fast human motion capture and provide high-quality motion capture results at 1000 fps

EventCap Method

Capturing high-speed human motion in 3D with a single event camera
High temporal resolution is required

Pipeline Overview

Pre-processing step: reconstruct a template mesh of the actor
Stage1: Sparse event trajectories generation (between two adjacent intensity images)
Extract the async spatio-temporal info from the event stream
Stage2: A performance of batch optimization scheme
Optimize the skeletal motion at 1000 fps using event trajectories and CNN-based body joint detection
Stage3: The captured skeletal motion refinement

Template Mesh Acquisition

First Stage: Async Event Trajectory Generation

Track the photometric 2D features in an asynchronous manner
Results in the aprse event trajectories { T(h)\Tau(h) }
Take advantage of the technique of the previous paper below

Intensity Image Sharpening

The previous technique relies on sharp intensity images for gradient calculation
Intensity image suffer from severe motion blur due to fast motion
Firstly adopt the event-based double integral (EDI) model to sharpen the images
By using latent images instead of original blurry images

Forward and Backward Alignment

The feature tracking can drift over time
Apply the feature tracking method both forward and backward to reduce drifting

Tracjectory Slicing

Evenly slice the continuous event tracjectory at each millisecond time stamp
In order to achieve motion capture at the desired tracking frame rate
0,1,...,N0, 1, ...,N : desired frame rate

Second Stage: Hybrid Post Batch Optimization

Jointly optimize all the skeleton poses S=S = { SfS_f } for all the tracking frames in a batch
Leverages the hybrid input modality from the event camera
Leverage not only the event feature correspondences, but also the CNN-based 2D and 3D pose estimates
Tackle the drifting due to the accumulation of tracking errors and inherent depth ambiguities of the monocular setting

Event Corresondence Term

Exploits the async spatio-temporal motion info encoded in the event stream
pp_* is a event correspondences extracted from the sliced trajectories (in fig.3)

2D and 3D Detection Term

This terms encourage the posed skeleton to match the 2D and 3D body joint detection obtained by CNN from the intensity images
Used VNect and OpenPose for CNN

Temporal Stabilization Term

Since only the moving body parts can trigger events, the non-moving body parts are not constrained by their energy function
Introduces a temporal stabilization constraint for the non-moving body parts

Third Stage: Event-Based Pose Refinement

Most of the events have a strong corrleation with the actor's silhouette (moving edges)
Estimation refinement in an Iterative Closest Point (ICP) manner
Stability term is applied, which was mentioned before
In order to stay close to its initial position
sbs_b = bb-th boundary pixel vbv_b = corresponding 3D position on the mesh ubu_b = corresponding target 2D position of the closest event

Closest Event Search

How to find the closest event
Based on the temporal and spatial distance between sbs_b and e=(u,t,p)e = (u, t, p)

Experimental Results

A PC with 3.6GHz Intel Xeon E5-1620 and 16GB RAM

EventCap Dataset

Propose a new benchmark dataset for monocular event-based 3D motion capture
Consist of 12 sequences of 6 actors performing different activities
karate, dancing, javelin, throwing, boxing, and other fast non-linear motions
Captured with a DAVIS240C event camera
produces an event stream and a low frame rate intensity image stream (between 7 and 25 fps) at 240 X 180
For reference, also capture with a Sony RX0 camera
Produces a high frame rate (between 250 and 1000 fps) RGB videos at 1920 X 1080
For a quantitative evaluation, one sequence is also tracked with a multi-view markerless motion capture system at 100 fps
For qualitative evaluation, reconstrcut the latent images at 1000 fps from the event stream using method of paper below
Precisely overlaid on the latent images and reconstructed poses are plausible in 3D
Even many extreme cases, such as in black ninja suite in the dark

Ablation Study

Batch optimization benefits from the integration of CNN-bsaed 2D and 3D pose estimation and the event trajectories
Significantly improves the accuracy and alleviated the drifting problem
Fig. 7 shows errors to the ground truth 3D joint positions

Influence of the template mesh accuracy

Comparison to Baselines

To their best knowledge, it's the first monocular event-based 3D motion capture method
Compare to existing monocular RGB-based approaches
HMR, MonoPerfCap
HMR_all and Mono_all on all latent images reconstructed at 1000 fps
HMR_linear and Mono_linear only on the raw itensity images and linearly upsample
HMR_all and Mono_all suffer from inferior tracking results due to the accumulated error of the reconstructed latent images
Mono_linear and HMR_linear fail to track the high-frequency motions
HMR_refer and Mono_refer on the high frame rate reference RGB images
For a fair comparison, downsample the reference images into the same resolution of the intensity image from the event camera

Discussion and Conclusion

Not able to handle topology change and severe (self-)occlusion
Requires a stable capture background
Cannot handle the challenging scenarios (moving camera, sudden lighting changes)
First approach for markerless 3D human motino capture using a single event camera
Batch optimization makes full usage of hybrid image and event streams