Digital video is made up of individual pixels (dots that have some level of brightness, from absolute black to bright white, that may also have some color). Pictures are made up of an array of pixels arranged in a rectangle. For example, High Definition (HD) video is typically 1920 pixels wide and 1080 pixels high, and Ultra High Definition (4K) video is made up of pictures with twice the vertical and twice the horizontal pixels (3840 pixels wide by 2160 pixels high). This is often referred to as the resolution of the video (because the more pixels in a picture, the easier it is to resolve fine details), but it is more accurate to refer to the pixel dimensions as the picture size.
A digital camera has a sensor that captures raw, uncompressed video. Each pixel element is made up of sub-pixel values for the 3 primary colors that our eyes can see; Red, Green and Blue. The Red, Green and Blue (or RGB) is typically encoded with 8 or 10 bits for each value. A HD camera capturing 1920 x 1080 pixel pictures as 8 bit video would require 1920 x 1080 x 3 bytes = 6,220,800 bytes per picture. If the HD camera is capturing 30 pictures (frames) per second, it is generating 186,624,000 bytes per second. If the camera is shooting 10 bit video, and those 10 bit values are stored in 2 bytes each (which is typical when stored on a computer), it is generating 373,248,000 or roughly 373 Megabytes per second. If we are using a 4K camera with 4 times the pixels, it generates 1,492,992,000 bytes per second (roughly 1.5 Gigabytes per second of raw video data). Obviously, this volume of data is too great to store and transmit cost effectively, so we need to find a way to compress that raw video to a much lower bit rate.
The first step in the process of storing or transmitting video is usually to convert the Red, Green and Blue values to a different, but equivalent representation known as YUV. This process is mathematically lossless, and RGB can be converted to YUV (or vice versa) with no loss of fidelity. The symbol Y is used for the luma channel, and the symbols U and V are used for two color difference channels. The Y (luma) channel is a black and white version of the video. The U and V channels store color information, and they are used by the video display device to reconstruct the RGB version of each picture. We can also refer to Y, U and V as planes, as they are really just layers of information that will be composited together to create RGB color video.
Typically, the next step is chroma subsampling. Human eyes have two kinds of light sensing cells (photoreceptors); rods and cones. We have about 75 to 150 million rods, which are used mainly in dim light. Rods cannot detect color. They are monochromatic. We have roughly 7 million cones, which detect color. Since we have far more rods than cones, humans have much higher visual acuity in black and white than we do in color. Because of this, the color information can be encoded in lower resolution without a noticeable decrease in overall fidelity. So, most of the video you see has been chroma subsampled. This is typically done by creating a single pixel from squares of 4 adjacent chroma pixels, halving the picture size vertically and horizontally. This scheme of reducing the pixel count of the chroma channels by 4 is labeled 4:2:0 chroma subsampling (as opposed to raw video which has not been chroma subsampled, which is labeled 4:4:4). After RGB to YUV conversion and chroma subsampling the video is known as YUV 4:2:0. Because we reduced the pixel count of the U and V planes by a factor of 4, we have 1 + 1/4 + 1/4 or 1.5x the information of the Y channel where we used to have 3x. So, 4:2:0 chroma subsampling reduces the data rate of the raw video by a factor of 2. While chroma subsampling is a “lossy” process, for natural video (scenes shot with video cameras), the conversion from 4:4:4 to 4:2:0 is generally imperceptible to the human eye. For computer generated graphics or video with text or graphics overlaid, chroma subsampling will reduce the fidelity of the graphics or text.
When we speak of “video encoding” or “video compression”, we are usually referring to the conversion of raw, uncompressed YUV video to a file or bitstream that complies with a video coding standard such as MPEG-2, MPEG-4 Part 10 (commonly known as Advanced Video Coding, AVC, or H.264), or MPEG-H Part 2 (commonly known as High Efficiency Video Coding, HEVC, or H.265). These video coding standards define the syntax of the compressed bitstream and the method to decode it. They don’t actually define how to encode the video, enabling developers of video encoders to invent novel methods of encoding. But the structure of the compressed video and the decoding method implies quite a bit about how to go about encoding the video.