next up previous
Next: Related Work Up: Proxy-Based Compression of D Previous: Proxy-Based Compression of D

Introduction

Digital capture of dynamic events has assumed great importance recently [5]. Sensors for 3D data including multicamera systems, laser range scanners, etc are common today. Some of them are suitable for the real-time capture of the shape and appearance of dynamic events, like in the other previously existing multiview capture teleimmersive systems. The $ 2\frac{1}{2}$D model of aligned depth map and image, called a Depth Image, has been popular for Image Based Modeling and Rendering (IBMR). Time varying depth and image sequences, called Depth Movies, can extend IBMR to the dynamic events. The applications of such systems include virtual-space tele-conferencing, remote 3D immersion, 3D entertainment, etc. Tele-immersive environments are now emerging as the next generation of communication medium to allow remote users more effective interaction and collaboration in joint activities at various research, academic and social levels. Tele-immersion creates an illusion that the user is in the same physical space like the other remote participants, though they are miles apart.

Figure 1: Doo-Young Dataset Depth Movie from ETH-Z.
\includegraphics[width=12cm,height=7.5cm]{figures/DMovie.eps}

Depth Movies are time-varying sequences of depth maps and texture, as shown in figure 1. A depth movie consists of three channels of data: (a) The I Channel contains the sequence of image with $ I_k[i, j]$ giving the colour at pixel $ (i, j)$ at time instant $ k$, (b) the D Channel contains the sequence of depth maps with $ D_k[i, j]$ giving the depth or distance at pixel $ (i, j)$ at time instant $ k$, and (c) the C Channel gives the sequence of time varying calibration parameters that help map the depth into the 3D coordinates in a common reference frame, with $ C_k$ giving the $ 3\times 4$ calibration matrix. Continuous, hole-free capture requires multiple depth movies to cover the event from all directions. In this paper, we study dynamic events involving a single human actor, a common case for entertainment and tele-conferencing. We assume that the event is captured using $ m$ depth movies of length $ n$. That is, the $ 2\frac{1}{2}$D structure of the event is available from $ m$ different positions or views for $ n$ uniformly sampled time-instants or ``frames''.

Depth movies are bulky and call for effective representations and compression methods. Compression is essential as the captured event is often transmitted to a remote location for 3D playback. For such a setting, the server at the capture site is linked over a network to a client at the rendering site. Video compression can be used effectively on the I Channel. Compression of static and dynamic lightfields have been studied before [8,9,14]. The C Channel contains light and slow-varying data and a standard data compression scheme can compress it effectively. The D Channel, however, contains time-varying depth maps from multiple sources. Effective compression of such data is important and is the focus of this paper. The D channel is a video of depth values. It may appear that video compression schemes like MPEG can work well on it. Critical differences between images and depth maps make this tricky and undesirable. Video compression is psycho-visually motivated and give less emphasis to the high frequency components. However, the high frequency regions represent occlusion boundaries; they are critical for depth maps [6], especially if used rendering.

Depths from different viewpoints contains redundant information derived from the common geometric structure of the scene. We exploit this spatial redundancy in addition to the temporal redundancy for effective compression of depth movies. The geometric structure is approximated using a light-weight, parametric proxy model for each time instant. We use an articulated human model as the parametric proxy with the joint angles serving as its parameters. The time-varying parameters approximate the underlying geometric structure of the action in a viewpoint independent manner. The depths from each viewpoint is replaced with the residue or the difference from the depth due to the proxy model. The residues are small in range and are correlated spatially and temporally. We show different ways to encode them for different quality/size trade-offs.

We present related work in section 2. Section 3 presents our work on proxy-based compression of depth movies, followed by results in section 4.


next up previous
Next: Related Work Up: Proxy-Based Compression of D Previous: Proxy-Based Compression of D
2008-04-27