Let me just say up front what packed bitstream is not - it's not saving any space. Actually, when "unpacking" a packed bitstream video the result is a bit smaller (but just a few KBs smaller).
Frame (or VOP) Types
Before we can talk about packed bitstream, we'll need to talk about frame types. In MPEG4 talk, each encoded image of the movie is called a VOP (Video Object Plane). A VOP is usually equivalent to a frame: here's the explanation from Wikipedia: "In MPEG-4, VOP refers to a "Video Object Plane" which is effectively a Video frame." (We'll later see that packed bitstream is exactly where a VOP is not a frame). Here are the VOP types (I'll intermix frames and VOPs, since at this point we can consider them the same):
- I-VOP (Intra): A frame encoded like a still image (much like a JPEG image). Why aren't all frames encoded like that? It's not efficient. Here's why:
- P-VOP (Previous): A frame that encodes only the changes from the previous frame (we'll be more accurate shortly). In many (actually most) cases, a video frame is very similar to the its preceeding frame. It makes sense to encode a frame by saying "this frame is exactly like the previous frame, except for the following changes". P-VOPs are much more efficient than I-VOPs.
- B-VOP (Bi-directional) - we'll discuss these in a moment.
This is theoretically, mainly because of the following reason: If you want to jump to a specific frame, you have to decode all the frames from the previous I-VOP to the current frame. The frame itself doesn't contain enough information to be seen. Decoding many frames is a long process. Therefore, it's important to have I-VOPs not too far apart. Two comments on this: First: while DVD is MPEG2 and not MPEG4, it also has I-Frames and P-Frames (and also B-Frames), and each chapter mark always points to an I-Frame. Makes sense, right? Second, in many Xvid encoded movies, there are sometimes long scenes where there are I-VOPs every 300 frames! This is very good for overall compression, but seeking takes quite some time.
What about B-VOPs? B-VOPs are encoded based on the previous I-VOP or P-VOP, and also the next I-VOP or P-VOP. The MPEG group found that in many cases, describing a frame as the sum of changes from previous and next frames is more efficient than just the changes from a previous frame. Why? I don't know but the 'E' in MPEG stands for 'Experts', so they know what they say. To complete the picture, I'll just say that P-VOPs are encoded from the previous I-VOP or P-VOP, but never from a previous B-VOP. This means that you B-VOPs never affect the quality of any other frames but themselves, which means you can compress them as much as you want, and it will affect only the current frame. That's not true for I-VOPs or P-VOPs.
Frame Type Example
Let's say we want to encode a sequence of 10 frames. We might end up with the following sequence of frames (or VOPs):
The first frame (frame 0) uses an I-VOP. Frame 3 is encoded based on frame 0, frame 6 is encoded based on frame 3, etc. Frames 1 and 2 are encoded based on frames 0 and 3. Frames 4 and 5 are encoded based on frames 3 and 6, etc. Makes sense, right?
Playback Issues
This is all fine, but how can the player decode a B-VOP if the P-VOP it's based on is later in the stream? What if we'll have much more than 2 consecutive B-VOPs before the P-VOB they're based on? The answer is simple, we need to rearrange the order of the frames as the appear in the stream (or in the file). Here's how this is done in non-packed bitstream files:
The P-VOP preceeds the B-VOPs that depend on it. Of course, it's marked in a special way so that it will not be shown in the stream order. There's still a problem. If the player starts shows frame 0 as soon as it reads it, it will have nothing to show when it reads the next frame. It has to read two frames before encountering the B-VOP it should show next (frame 2). The solution is to insert a playback delay:
- When the player reads and decodes frame 0 (I), it displays nothing.
- When the player reads and decodes frame 1 (P), it displays frame 0.
- When the player reads and decodes frame 2 (B), it displays frame 2.
- When the player reads and decodes frame 3 (B), it displays frame 3.
- When the player reads and decodes frame 4 (P), it displays frame 1.
Packed Bitstream
A video stream that uses packed bitstream tackles this issue using a different method:
The change here is that a single frame in the stream includes 2 VOPs. Whenever a future P-VOP or I-VOP is needed, it's included within the frame that needs it. Certain frames need no VOP data, since it was already included in a previous frame. In this case an N-VOP (nop? null?) is used. This is an empty VOP that takes up very little space.
There's no need to delay the playback, but on the other hand, the player has to be capable of decoding two VOPs during the time of a single frame.
Unpacking Packed Bitstream Files
I don't know if packed bitsream is better or not for encoders, players, developers, etc. I only know that my standalone DVD player can't handle packed bitstream, so I have to "unpack" such files. If you need this as well, I know of two options you may use:
- Use Moitah's Mpeg4 Modifier, either in its GUI or command line version.
- Use UnpackMP4, a Java conversion of Moitah's command line version. I created the Java port since I wanted to use it on Linux and didn't want to install Mono. Also, I was missing basic Linux behavior like being able to convert multiple files in a single call. When I ported the application, I also optimized it so that it runs faster than Moitah's version, but I have to addmit that the newest Mpeg4 Modifier runs a bit faster than my port (tested only on Windows).