On June 6, 2023, Apple Vision Pro, which had garnered significant attention in the global tech community, was officially unveiled at the Apple Worldwide Developers Conference (WWDC). It claimed to significantly enhance the subjective and objective experience of 3D videos through hardware encoding and decoding support for the MV-HEVC coding standard. This announcement sparked a flurry of searches as many developers sought to understand what MV-HEVC is and how it differs from traditional HEVC-based 3D encoding.
Currently, commonly used 3D video imaging technologies include holographic projection, glasses-free 3D screens, and stereoscopic movie display technology.
At present, VR headsets and stereoscopic movies are the most common 3D video content used. They are mainly encoded, transmitted, and displayed based on the left and right viewpoint images. However, a significant amount of 3D video content has not been encoded using specialized video coding standards but rather general-purpose video coding standards. The typical method is to merge the left and right viewpoint images into a single frame in a side-by-side (SBS) format and then encode the combined sequence.
Using the HEVC encoder as an example, it lacks the ability to search for similarities between different camera angles. This means that the left and right halves of the same frame cannot be predicted based on each other. Additionally, due to the limited range of motion estimation search, inter-frame prediction cannot make use of information between different viewpoints. Therefore, eliminating the redundant information between the left and right viewpoints of 3D videos can significantly improve the efficiency of the encoder.
In response to the new features of 3D videos, especially multi-viewpoint stitched 3D videos, the Joint Collaborative Team on 3D Video Coding Extension Development (JCT-3V) was established, and in 2014, they published the MV-HEVC standard extension for 3D multi-viewpoint video coding. Figure 1 illustrates the motion vector diagram for inter-frame prediction of the right viewpoint frames encoded according to the MV-HEVC standard. It can be observed that a significant number of inter-view reference modes are utilized for the right viewpoint, effectively eliminating the redundancy between viewpoints.
Figure 1: Schematic of MV-HEVC 3D Video Coding Right Viewpoint Bitstream Analysis (Green lines with IL labels indicate inter-viewpoint references)
In the MV-HEVC (MultiView-HEVC) standard, a new syntax element called LayerId is introduced in the NALU header. It represents the viewpoint number to which the frame (or slice) encapsulated in that NALU belongs. In 3D videos, we typically use LayerId 0 to indicate that the frame belongs to the left viewpoint (main viewpoint), while LayerId 1 represents the right viewpoint (auxiliary viewpoint). Frames that belong to the same POC but have different LayerIds are referred to as an Access Unit (AU). The reference rules for encoding the main viewpoint images follow the basic HEVC standard, while for the auxiliary viewpoint, each frame is encoded based on the basic HEVC with an additional inter-view reference frame, which is the frame with the same POC in the main viewpoint. This reference structure enables inter-viewpoint referencing.
Frames belonging to different LayerIds can have the same POC (Picture Order Count) number. However, frames with a higher LayerId can reference frames with a smaller LayerId that belong to the same AU (Access Unit), as shown in Figure 2. This referencing relationship can be used in multi-layer video coding to achieve more efficient encoding and decoding.
Figure 2: Illustration of Reference in MV-HEVC Dual-Viewpoint Coding
Due to the introduction of Layers, new syntax elements need to be incorporated. Additionally, different LayerIds can theoretically serve as separate video outputs, which means they require their own SPS and PPS configurations. To address these issues, MV-HEVC extends the VPS, introduces a new Profile Tier Level, and modifies parts of the PPS and SPS syntax. Considering that a significant amount of parameter content (such as frame dimensions, chroma sampling, etc.) in SPS and PPS is redundant across different viewpoints, MV-HEVC makes special provisions for the SPS and PPS syntax referenced by frames with a LayerId other than 0, eliminating this redundant information.
To eliminate information redundancy between viewpoints, MV-HEVC extends the frame inter-prediction mode to different layers, referred to as interlayer mode.
The introduction of the Inter-Layer mode brings new challenges, such as the following scenario:
TMVP (Temporal Motion Vector Prediction) mode is a frame inter-prediction technique in HEVC that selects the motion vector of the corresponding block in the current frame and scales it based on its POC distance in the spatial domain, as shown in the diagram below:
Figure 3: Schematic of Temporal Motion Vector Prediction (TMVP) Mode
The scaled and corrected motion vector (MV) is given by:
curMV = td / tb * colMV
However, with the introduction of Inter-Layer mode, it is possible for the reference frame and the current frame to have the same POC number, and both tb and td can be zero. This can result in division by zero errors or scaling to a zero vector, rendering the scaling meaningless.
To address this issue, MV-HEVC specifies that all Interlayer modes are marked as long-term reference frames, and all long-term references can only be used as long-term reference motion vector predictors (MVPs) and not as non-long-term reference MVPs. By distinguishing between inter-layer and non-inter-layer prediction modes, this prevents the occurrence of the aforementioned errors.
Unlike the low-density I-frame characteristics commonly found in internet applications, the JCT3V standard, which is targeted towards broadcasting applications, typically uses an I-frame interval of 20-30 frames to evaluate the bitrate savings achieved through tool optimization. Since the layer 1 of the MV-HEVC standard does not include I-frames and instead utilizes inter-view predicted P-frames, there are significant differences in the number of reference frames used by commercial encoders compared to reference software. As a result, the compression gains of MV-HEVC achieved with reference software will be significantly higher than its benefits in internet services. Therefore, it is necessary to implement and measure the benefits of MV-HEVC on commercial encoders.
Currently, the Apple Vision Pro chip has achieved hardware decoding support for MV-HEVC bitstreams through firmware-level optimizations. The business team can introduce MV-HEVC extended decoding capability support to their in-house HEVC decoders and adapt it to ffmpeg. This allows users to decode MV-HEVC 3D video streams by invoking the relevant decoder through FFmpeg.
MV-HEVC (Multi-View High Efficiency Video Coding) is an advanced video encoding technology that is an extension of HEVC (High Efficiency Video Coding). HEVC is a standard for video compression, also known as H.265, which offers higher compression efficiency and better video quality compared to its predecessor, H.264/AVC.
MV-HEVC is specifically designed for encoding multi-view videos, i.e., videos captured from different angles and viewpoints. This encoding technique significantly reduces the bitrate and storage requirements for multi-view videos while maintaining high quality. MV-HEVC has broad application prospects in areas such as 3D video, panoramic video, virtual reality (VR), and augmented reality (AR).