Depth Movies are time-varying sequences of depth maps and texture, as shown in figure 1. A depth movie consists of three channels of data: (a) The I Channel contains the sequence of image with giving the colour at pixel at time instant , (b) the D Channel contains the sequence of depth maps with giving the depth or distance at pixel at time instant , and (c) the C Channel gives the sequence of time varying calibration parameters that help map the depth into the 3D coordinates in a common reference frame, with giving the calibration matrix. Continuous, hole-free capture requires multiple depth movies to cover the event from all directions. In this paper, we study dynamic events involving a single human actor, a common case for entertainment and tele-conferencing. We assume that the event is captured using depth movies of length . That is, the D structure of the event is available from different positions or views for uniformly sampled time-instants or ``frames''.
Depth movies are bulky and call for effective representations and compression methods. Compression is essential as the captured event is often transmitted to a remote location for 3D playback. For such a setting, the server at the capture site is linked over a network to a client at the rendering site. Video compression can be used effectively on the I Channel. Compression of static and dynamic lightfields have been studied before [8,9,14]. The C Channel contains light and slow-varying data and a standard data compression scheme can compress it effectively. The D Channel, however, contains time-varying depth maps from multiple sources. Effective compression of such data is important and is the focus of this paper. The D channel is a video of depth values. It may appear that video compression schemes like MPEG can work well on it. Critical differences between images and depth maps make this tricky and undesirable. Video compression is psycho-visually motivated and give less emphasis to the high frequency components. However, the high frequency regions represent occlusion boundaries; they are critical for depth maps [6], especially if used rendering.
Depths from different viewpoints contains redundant information derived from the common geometric structure of the scene. We exploit this spatial redundancy in addition to the temporal redundancy for effective compression of depth movies. The geometric structure is approximated using a light-weight, parametric proxy model for each time instant. We use an articulated human model as the parametric proxy with the joint angles serving as its parameters. The time-varying parameters approximate the underlying geometric structure of the action in a viewpoint independent manner. The depths from each viewpoint is replaced with the residue or the difference from the depth due to the proxy model. The residues are small in range and are correlated spatially and temporally. We show different ways to encode them for different quality/size trade-offs.
We present related work in section 2. Section 3 presents our work on proxy-based compression of depth movies, followed by results in section 4.