Made by Mike_Zhang

Notice | 提示

PERSONAL COURSE NOTE, FOR REFERENCE ONLY

Personal course note of COMP3422 Creative Digital Media Design, The Hong Kong Polytechnic University, Sem2, 2023/24.
Mainly focus on Basics of Image, Basics of Video, Basics of Audio, Lossless Compression Algorithms, Lossy Compression Algorithms, Image Compression Techniques, Video Compression Techniques, Audio Compression Techniques, Virtual Reality and Augmented Reality, Image and Video Retrieval, and AI-generated Content.

个人笔记，仅供参考

本文章为香港理工大学2023/24学年第二学期 创意数码媒体设计(COMP3422 Creative Digital Media Design) 个人的课程笔记。

Unfold Study Note Topics | 展开学习笔记主题 >

1. Introduction to Multimedia

1.1 Human communication

1.2 What is multimedia?

1.3 What is multimedia technologies?

1.4 History

2. Basics of Image

2.1 Color Science

2.1.1 What is color?

400 nm - 700 nm

In 1666, Newton discovered that a beam of sunlight passed through a prism will break into a spectrum of colors.
No color in the spectrum ends abruptly.
Continuous, not discrete.

2.2 Color Models in Images

2.2.1 RGB-CMY-HSI

Color model is a coordinate system where each color is represented by a single point.

Practical reason: hardware oriented.

RGB

Red, green, blue
Each color appears in its primary spectral components of R,G,B.

CMY

Cyan, magenta, yellow
Each color appears in its secondary colors of light, Cyan, magenta and yellow.

HSI

Hue, saturation, intensity
Each color is specified by its brightness (I) and color information (H, S). It is close to human visual system.

Each model is oriented toward a hardware or application.

RGB

Red, green, blue
Color monitors, cameras, color image processing

CMY

Cyan, magenta, yellow
Color printing

HSI

Hue, saturation, intensity
Broadly used in color image processing

2.2.2 Primary Colors

About 6~7 million cones in the human eye can be divided into three principal sensing categories : red, green and blue.
For human eyes, colors are perceived as variable combinations of primary colors R, G and B.

2.2.3 Secondary colors

Magenta: (red plus blue, opposite to green)
Cyan: (green plus blue, opposite to red)
Yellow: (red plus green, opposite to blue)

2.2.4 RGB / CMY conversion

Cyan, magenta and yellow are the secondary colors of light, or the primary colors of pigments.
Most devices that deposit colored pigments on paper (color printers and copiers) require CMY data input.
Pure cyan does not reflect red; pure magenta does not reflect green; pure yellow does not reflect blue.

2.2.5 HSI

Intensity: the most useful descriptor of gray images, distinguishes gray levels.
Hue is associated with the dominant wavelength (color) of the pure color observed in a mixture of light waves. It represents the dominant color perceived by an observer.
Saturation: relative purity or the amount of white light mixed with a hue.

2.2.6 RGB / HSI conversion

Hue: color, frequency of the light
Saturation: variance
Intensity: amplitude

2.3 Digital Images

2.3.1 Image acquisition

Image can be generated by energy of the illumination source reflected (natural scenes) or transmitted through objects (e.g., X-ray).
A sensor detects the energy and converts it to electrical signals.
Sensor should have a material that is responsive to the energy being detected.

2.3.2 What is a digital image?

A two-dimensional function f(x, y) where (x, y) are spatial coordinates and the amplitude f is called the intensity or gray level.
When x, y and f are discrete quantities, the image is called a digital image.

2.3.3 Different image modalities

Electromagnetic Spectrum

Imaging in visible band

X-ray imaging

Used extensively in medical imaging and industry.

Imaging in radio band

Magnetic Resonance Imaging (MRI): Radio waves are passed through his/her body in short pulses.

Acoustic Imaging

Ultrasound system transmits high-frequency (1 to 5 MHz) sound pulses into the body.
Seismic images

Fluorescence Microscopy

Operate in infrared frequency

Computer generated Images

Fractals are examples of computer-generated images.
3-D modeling is another example.

2.4 1-8-24-bits Images

2.4.1 Sampling and quantization

Computer processing: image f(x,y) must be digitized both spatially and in amplitude.

Sampling >> digitization in spatial coordinates.
Quantization >> digitization in amplitude.

An image will be represented as a matrix $[f(i,j)]_{N\times M}$ with $i=1:N; j=1:M; f\in [0,G-1]$

Normally, we set $G=2$

The number of bits to store the image: $N\times M \times k$.

The larger $N, M$ and $G=2^k$, the better approximation of an image, the larger storage and processing required.
$N\times M$ is called spatial resolution.
$G$ is called gray-level resolution, $k$ is the bit depth.

2.4.2 Sampling and quantization

Reducing Spatial Resolution

Reducing Gray Level Resolution

2.4.3 Bit depth

Bit Depth (Number of bits used to define each pixel)

Higher order bit planes of an image carry a significant amount of visually relevant details.
Lower order bit planes add fine (often imperceptible) details to the images.

2.4.4 1-bit Image

Each pixel is stored as a single bit (0 or 1), so also referred to as binary image.
Such an image is also called black-white image because black is represented by 0 and white is represented by 1.

2.4.5 8-bit image

Each pixel has a grey-value between 0 and 255.
Each pixel can be represented by a single byte, i.e. 8 bits.
8-bit image can be thought of as a set of bit-planes, where each plane consists of a 1-bit representation of the image at higher and higher levels of “elevation”.

2.4.6 24-bit image

To represent color images, normally we use 24 bits (3 bytes) for each color. Each byte represents a color channel.
Many 24-bit color images are stored as 32 - bit images, with the extra byte of data for each pixel used to store an alpha value representing special effect information (e.g., transparency).

2.4.7 Image channels

Color digital images are represented using a color models. normally we use 24 bits (3 bytes) for each color.
Each byte represents a color channel.
- RGB
- CMYK
- HSI
- YUV, YIQ
- …

2.4.8 Practice

A 4x4 image is given below. What is the 7-th bit plane of this image?

17 64 128 128
15 63 132 133
11 60 142 140
11 60 142 138

AND $2^7$ 10000000

0 0 1 1
0 0 1 1
0 0 1 1
0 0 1 1

3. Basics of Video

3.1 Color Models in Videos

3.1.1 Color for TV

Color models in digital video are largely derived from older analog methods of coding color for TV.

NTSC
- designated by US’s National Television Systems Committee;
- US, Japan, South America, …
PAL
- Phase Alternating Line;
- Asian, Western Europe, New Zealand, …
SECAM
- Sequential Color and Memory;
- France, Eastern Europe, Soviet Union, …

3.2 Sampling and Quantization of Motion

Temporal

How fast the pictures/frames are captured/played back;

Spatial

How large the pictures/frames are captured/played back;

3.2.1 Frame rate

Frame Rate (frames per second/fps):

higher sampling rate: higher frame rate;
higher sampling rate: finer motion captured;

Video Type	Frame Rate
NTSC	(black-and-white) 30
NTSC	(color) 29.97
PAL	25
SECAM	25

3.2.2 Frame size

Frame Size (pixel dimensions/resolution):

larger frame size: higher resolution;
larger frame size: more spatial information captured;

Video Type	Frame Size
NTSC (standard definition)	720 x 480 pixels
NTSC (high definition)	1280 x 720 pixels
		1440 x 1080 pixels
PAL	720 x 576 pixels
SECAM	720 x 576 pixels

3.2.3 Frame aspect ratio

Frame Aspect Ratio:

the ratio of a frame’s viewing width to height;
NOT equivalent to ratio of the frame’s pixel width to height;
- The pixel may be in rectangular shape in many years ago;

3.2.4 Frame size & frame aspect ratio

Comparison between Standard Definition and High Definition

3.3 Common Video File Types

File Type	Acronym for	Created by	Platforms
.mov	QuickTime movie	Apple	Apple QuickTime player;
.avi	Audio Video	Intel	Windows; Apple QuickTime player;
.mpg .mpeg	MPEG	Motion Picture Experts Group	Cross-platform
.mp4	MPEG-4	Motion Picture Experts Group	Web browsers; Safari/IE;
.flv	Flash Video	Adobe	Cross-platform; Adobe Media player;
.f4v	Flash Video	Adobe	A newer format than .flv;
.wmv	Windows Media	Microsoft	Windows Media player;

Choosing considerations:

File size restriction

for web: high compression;
for CD/DVD playback: use appropriate frame rate;

Intended audience

multiple platforms: Apply QuickTime, MPEG, etc..;
targe audience;

Future editing

lower compression level;
uncompressed given sufficient disk space;
get a balance between file size and quality;

3.4 3D Video

3.4.1 Monocular Cues

Shading

refers to the depiction of depth perception by varying the level of darkness;
tries to approximate local behavior of light on the object’s surface;
DON’T be confused with shadows.

Haze

due to light scattering by the atmosphere, objects at distance have lower contrast and lower color saturation

Perspective Scaling

converging parallel lines with distance and at infinity;

Texture Gradient

the appearance of textures changing;
The gradient of the image is the local feature of the image.

3.4.2 Binocular Cues

Binocular Vision

stereo vision, stereopsis;
the left and right eyes have slightly different views;
objects are shifted horizontally; the amount of shift is called disparity;
the disparity is dependent on the object’s distance from eyes, i.e., depth;

4. Basics of Audio

4.1 What is Sound

A sound is a form of energy that is produced by vibrating bodies.
It requires a medium for the propagation, such as air.
The vibrating air then causes the human eardrum to vibrate, which the brain then interprets as sound.
mechanical wave

Frequency and amplitude are two key properties of sound.

Frequency $f$ is defined as the number of oscillations per second.
The period is $T$, then $f = \frac{1}{T}$.
Amplitude $A$ of a sound wave is the height of the wave.

Sampling is discretizing the analog signal in the time dimension.
Quantization is discretizing the analog signal in the amplitude dimension.

Typical sampling rates are from 8kHz (8000 samples per second) to 48kHz; (Human voice is around 4kHz)
Typical uniform quantization rates are 8-bit (256 levels) and 16-bit (65536 levels);

4.2 Sampling of Audio

Signals can be decomposed into a sum of sinusoids.

Fourier transform

$x(t) = A_0 + \sum_{k=1}^{N} A_k \cos(2\pi f_k t + \phi_k)$

Example:

$g(t) = \sin(2\pi f t) + 0.3\sin (2\pi (3f)t)$

In practice, we usually consider band-limited signals.
- (i.e., there is a lower frequency limit $f_1$ and an upper frequency limit $f_2$ in the signal).
The difference $\Delta f = f_1 - f_2$ is called bandwidth.

Aliasing occurs when a system is measured at an insufficient sampling rate.

the machine can only see the discrete points, not the continuous signal lines.

4.2.1 Nyquist Theorem

The sampling frequency should be at least twice the highest frequency contained in the signal.

sampling frequency: the number of samples per second.

$f_S \geq 2f_{max}$

$f$ is sampling frequency and $f_m$ is the maximum frequency in the signal.
The minimum required sampling rate $f_S$ is called Nyquist rate.
- ==But given the minimum required sampling rate, it it NOT guaranteed that the original signal can be reconstructed from the samples==.
- Not always the lager the better, the purpose is to find the most appropriate sampling rate.
- E.g. the f=6 is already good, the f=6.1 will give a bad result.

4.3 Quantization of Audio

In the magnitude direction, we select breakpoints and remap any value within an interval to the nearest level.
N-bit quantization results in $2^N$ levels. (N-bit: quantization bit depth)

==Non-uniform (nonlinear) quantization:==

==Pro: take fully use the bit and record more information==
==Con: prior knowledge is required to determine the bit depth.==

4.3.1 Signal-to-Noise Ratio (SNR)

SNR is defined as the ratio of signal power $P{single}$ to the noise power $P{noise}$, it is a measure of the quality of the signal. The power in a signal is proportional to the square
of the amplitude (voltage after record), so we have,

signal power: the original single:
noise power: after the sampling, it introduce noise, the difference between the original single and the quantized single.

$SNR = \frac{P_{signal}}{P_{noise}} = (\frac{V_{signal}}{V_{noise}})^2$

SNR is usually measured in decibels (dB), where 1 dB is a tenth of a bel. Thus, in units of dB, SNR value is defined in terms of base-10 logarithm of squared amplitude.

$SNR_{dB} = 10\log_{10}(SNR)\\ = 10\log_{10}(\frac{A_{signal}}{A_{noise}})^2 = 20\log_{10}(\frac{V_{signal}}{V_{noise}})$

==SNG: is for the general purpose, it is a measure of the quality of the signal.==
==SQNG: is for the quantization, it is a measure of the quality of the quantization.==
In practice the actual value is unknown.

4.3.2 Signal-to-Quantization-Noise Ratio (SQNR)

$SQNR = 10\log_{10}(\frac{V^2_{signal}}{V^2_{quan\_noise}})= 20\log_{10}(\frac{V_{signal}}{V_{quan\_noise}})$

Suppose the original signal is X=[1.1, 2.2, 3.4, 3.8]. After uniform quantization, it is quantized into Y=[1, 2, 3, 4]. What is the SQNR of Y?
The quantization noise is N=X-Y=[0.1, 0.2, 0.4, -0.2].

4.3.3 Peak SQNR (PSQNR)

If we choose a quantization accuracy of N bits per sample , one bit is used to indicate the sign of the sample. Then the maximum absolute signal value (amplitude) is mapped to $2^{N-1}-1(\approx 2^{N-1})$, the PSQNR can be simply expressed:
- ==For a quantized value $N$, the actual value range may be $[N-\frac{1}{2}, N+\frac{1}{2}]$, therefore the maximum noise is $\frac{1}{2}$.==
- ==And the $2^{N-1}-1(\approx 2^{N-1})$ will be the maximum signal value==

$PSQNR = 20\log_{10}(\frac{V_{signal}}{V_{quan\_noise}})\\ = 20\log_{10}(\frac{2^{N-1}}{\frac{1}{2}})\\ = 20N\log_{10}2 = 6.02 \times N(dB)$

==Given more bit to record the sound, the higher quality of the sound.==
We can see that for a uniformly quantized source , adding 1 bit/sample can improve the SNR by 6dB. This is called the 6dB rule.

5. Lossless Compression Algorithms

5.1 Background

000101 : sdafjlasdjflkasdjfljsdalfjsdaklfsdlkflkjsdf
0001 : eoweielrjweqjrqweroweroiewjrojweldsjf
100101 : 222esfdsf23432423423423423423423jlfd
…
01010 : e2343242jkfjkfdsj230423423skdlsjfasdkf

codeword

fixed-length codes (FLC):
- Run-length coding , LZW , etc..
variable-length codes (VLC):
- Huffman coding , Arithmetic coding, etc..

5.2 Run-length Compression

Original data: AAAAABBBAAAAAAAABBBB (20 Bytes)
Run-length coding result: 5A3B8A4B (8 Bytes)
Compression ratio: 20/8 = 2.5

Example #1: AAAAAAAA ~ 8A

Compression ratio: 8/2 = 4
==Positive effect==

Example #2: ABABABAB ~ 1A1B1A1B1A1B1A1B

Compression ratio: 8/16 = 0.5
==Negative effect==

5.3 LZW Compression

Originally created by Lempel and Ziv in 1978 (LZ78) Refined by Welch in 1984;
Uses fixed-length codewords to represent variable-length strings of symbols/characters that commonly/repeatedly occur together;
Dynamically build up the same dictionary in compression and decompression stages;
LZW dose not have a picture of the entire data, only local view.

Input data: bananabandana

Step 1: Build an initial dictionary.
- Find all the unique characters in the input, using the key as the index. In practice, we use the pre-defined dictionary.
Step 2: Dynamically increase the dictionary and assign codes for incoming strings.
- Using the ==(shorter) index== to represent the ideal ==(longer) repeated stings==

If the char(s) is in the dictionary, move to the next char (these char(s) with be ==together== with the next char).
Else:
- Step 2-1: Emit codes for the ==seen strings==. (append to the final result)
- Step 2-2: ==Increase== the dictionary for ==unseen strings==.

If in the dictionary:

Do nothing, move to the ==next character together== with the current one.
If not in the dictionary:
Do step 2-1 and step 2-2.

def compress (uncompressed = 'bananakandan'):
  ## step 1: initial dictionary
  init_dic = {'a': 'a'
              'b': 'b',
              'd': 'd',
              'n': 'n'}

  ## step 2: compressing and updating dictionary
  dictionary = init_dic
  previous_str = ""
  compressed_result = []
  for current_char in uncompressed:
    pre_cur = previous_str + current_char
    if pre_cur in dictionary:
      previous_str = pre_cur
    else:
      # Not in the dictionary
      ## step 2.1: getting compressed code for previous str
      previous_str_code = dictionary[previous_str]
      compressed_result.append(previous_str_code)
      
      ## step 2.2: updating the dictionary.
      dictionary[pre_cur] = len(dictionary)
      previous_str = current_char
  
  ## getting compressed code for the last previous str 
  compressed_result.append(dictionary[previous_str])
  return compressed_result

def decompress(compressed) :
  ## init dictionary
  init_dic = {'a': 'a'
              'b': 'b',
              'd': 'd',
              'n': 'n'}
  dictionary = init_dic
  prev_code = decompress_result = compressed.pop(0)
  for curr_code in compressed:
    ## decompressed
    if curr_code in dictionary:
      curr_code_decomp = dictionary[curr_code]
    elif curr_code == len(init_dic):
      curr_code_decomp = prev_code + prev_code[0]
    else:
      raise ValueError('Bad compressed code: %s' % curr_code)
    
    decompress_result += curr_code_decomp
    ## updating dictionary
    dictionary[len(dictionary)] = prev_code + curr_code_decomp[0]
    prev_code = curr_code_decomp
  print(dictionary)
  return decompress_result

The ==full dictionary is not required== for decompression.
- (An identical dictionary is created during decompression.)
Both encoding and decoding programs ==must start with the same initial dictionary==.
- (Usually the existing 256 ASCII table)

5.4 Huffman Compression

5.4.1 Entropy

The Entropy of a source $S = {s | s \in {v_1, v_2, …, v_n}}$ is

$\eta = \sum_{i=1}^n p_i \log_2 \frac{1}{p_i} = - \sum_{i=1}^n p_i \log_2 p_i$

where $p_i$ is the probability that discrete symbol $v_i$ occurs in $S$.

Does the order of symbols matter? NO
Do the actual values of symbols matter? NO

Entropy represents the average amount of information contained per symbol in $S$.

5.4.2 Huffman Coding

S = aaabbccdcacddaaccadddddaccccc…ddaaaccc
S = {a/20, b/16, c/6, d/10, e/35}

How to assign a unique code for each symbol?

Step 1: pick two letters $x, y$ with the smallest frequencies, create a subtree with these two letters, label the root of this subtree as $z$.
Step 2: set frequency $f(z) = f(x) + f(y)$, remove $x, y$ and add $z$ to $S$.
Repeat this procedure, until there is only one letter left.
The resulting tree is the Huffman code.

S = aaabbccdcacddaaccadddddaccccc…ddaaaccc
S = {a/20, b/16, c/6, d/10, e/35}

average code length = $(202+162+63+103+35*1)/87$

Proved optimal for a given data model; (i.e., a given data probability distribution)
The two least frequent symbols will have the same length for their Huffman codes, differing only at the last bit.
Symbols that occur more frequently will have shorter Huffman codes than symbols that occur less frequently.
- Variable-length codes
==There will noe be ambiguity in the decoding process.==
- ==Since all the codes are in the leave nodes, no code will be the prefix of another code.==

6. Lossy Compression Algorithms

6.1 Background

In lossless compression, compression ratio is low;
- (RLC, LZW, Huffman coding)
For a higher compression ratio , lossy compression is needed;
In lossy compression, compressed data are perceptual approximation to the original data;

6.1.1 Distortion Measures

Original data: $[x_1, x_2, …, x_N]$

Decompressed data: $[y_1, y_2, …, y_N]$

Three common measures:
==Mean Squared Error (MSE)==:

$\sigma^2_d = \frac{1}{N} \sum_{n=1}^N (x_n - y_n)^2$

==Signal-to-Noise Ratio (SNR)==:

$SNR = 10\log_{10}(\frac{\sigma^2_x}{\sigma^2_d})$

where $\sigma^2x = \frac{1}{N} \sum{n=1}^N x_n^2$, the original signal;
$\sigma^2_d$, the noise;

==Peak-Signal-to-Noise Ratio (SNR)==:

$PSNR = 10\log_{10}(\frac{\sigma^2_{peak}}{\sigma^2_d})$

6.1.2 Rate-Distortion Theory

Lossy compression always involves a ==tradeoff between rate and distortion==.
==Rate== is the average number of bits required to represent each source symbol.
==$D$== is a tolerable amount of distortion.
==$R(D)$== specifies the lowest rate at which the source data can be encoded.

6.2 1D Discrete Cosine Transform (DCT)

6.2.1 Transform Coding

==If most information can be described by few components in transformed domain, the remaining components can be coarsely quantized or set as 0 (removed).==

Keep the most informative part of the signal, and remove the less informative part.
==Transform the signal into a new domain, and then keep and remove==.

6.2.2 General Framework

6.2.3 Forward DCT

$F(u) = \sqrt{\frac{2}{M}}C(u) \sum_{i=0}^{M-1} \left [\cos(\frac{(2i+1)u\pi}{2M}) f(i) \right ]$

$f(i)$
where $i,u=0,1,2,…,M-1$;
$C(u) = \begin{cases} \frac{\sqrt{2}}{2} & \text{if } u=0 \ 1 & \text{if } \text{otherwise} \end{cases}$

1 2	`f: [2 3 5 1 3 8 11 21 10 3] F: [ 21.19 -10.67 -2.29 9.6 -9.29 2.21 -2. -0.88 4.79 -1.93]`

6.2.4 Inverse DCT

$f(i) = \sum_{u=0}^{M-1}( \sqrt{\frac{2}{M}} C(u) \cos(\frac{(2i+1)u\pi}{2M}) F(u))$

where $i,u=0,1,2,…,M-1$;
$C(u) = \begin{cases} \frac{\sqrt{2}}{2} & \text{if } u=0 \ 1 & \text{if } \text{otherwise} \end{cases}$

1 2	`F: [ 21.19 -10.67 -2.29 9.6 -9.29 2.21 -2. -0.88 4.79 -1.93] f: [2 3 5 1 3 8 11 21 10 3]`

6.2.5 DCT & Inverse DCT Demo

def DCT_1D(f) :
  E = [0, 0,0, 0,0,0, 0, 0, 0,0]
  C = [np.sqrt(2)/2., 1.,1.,1.,1.,1.,1.,1.,1.,1.1]
  for u in range(len(F)):
    temp = 0.
    for i in range(len(f)):
      temp += np.cos(((2*i+1.)*u*np.pi)/(2.*len(f)))*f[i]
    F[u] = np.sqrt(2./len(f))*C[u] * temp
  
  print ('F:', пр.asarray(F))
  return F

def IRCT_1D(E):
  f = [0,0,0,0,0,0,0,0,0,0]
  C = [np.sqrt(2)/2., 1,1,1,1,1,1,1,1,1]
  for i in range(len(f)):
    temp = 0.
    for u in range(len(F)):
      temp += C[u]*np.cos(((2*i+1)*u*np.pi)/(2*len(F)) )*F[u]
    f[i] = n.sqrt(2./len(F))*temp

  print ('inverse f:', np.asarray(f))
  return f

6.2.6 Forward DCT

$F(u) = C \times \bold{w} \cdot \bold{f}$

The ==dot product== of two vectors: the ==similarity== of the two vectors.

How similar the pattern $w$ and the original signal $f$ are.
High value: the original signal is more like the pattern.
Low value: the original signal is less like the pattern.

==The larger the dot-product, the original signal is more similar to the coefficient pattern;==

DCT is a linear transformation.
- Each transformed symbol is a linear combination of the original signal.
Each transformed symbol indicates how likely the ==original signal== has the ==corresponding coefficient pattern==.
==The transformed value indicate whether the original signal has the corresponding pattern.==
- F(1) indicates the low frequency of original signal f; F(9) indicates the high frequency of signal f.

Inverse DCT

Inverse DCT is also a linear transformation.
- Each inverse transformed symbol is a linear combination of the signal F.

6.2.7 DCT-based Compression

DCT itself does not compress the signal.

the length is the same, and we can inverse the DCT to get the original signal.

Only after we select the important part of the DCT, we can compress the signal.

MSB: Low frequency, more important
LSB: High frequency, less important, can be compressed

6.3 2D Discrete Cosine Transform (DCT)

6.3.1 (Forward) 2D-DCT

$F(u,v) = \frac{2C(u)C(v)}{\sqrt{MN}} \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} \cos(\frac{(2i+1)u\pi}{2M}) \cos(\frac{(2j+1)v\pi}{2N}) f(i,j)$

where $i,u=0,1,2,…,M-1$; $j,v=0,1,2,…,N-1$;
$C(u) = \begin{cases} \frac{\sqrt{2}}{2} & \text{if } u=0 \ 1 & \text{if } \text{otherwise} \end{cases}$
2D-DCT is a linear transformation.
- Each transformed symbol is a linear combination of the original signal.
Each transformed symbol indicates how likely the original signal has the corresponding coefficient pattern.

6.3.2 Inverse 2D-DCT

$f(i,j)=\sum_{u=0}^{M-1}\sum_{v=0}^{N-1} (\frac{2C(u)C(v)}{\sqrt{MN}} \cos(\frac{(2i+1)u\pi}{2M}) \cos(\frac{(2j+1)v\pi}{2N}) F(u,v))$

where $i,u=0,1,2,…,M-1$; $j,v=0,1,2,…,N-1$;
$C(u) = \begin{cases} \frac{\sqrt{2}}{2} & \text{if } u=0 \ 1 & \text{if } \text{otherwise} \end{cases}$

Inverse 2D-DCT is a linear transformation.
- Each inverse transformed symbol is a linear combination of the signal F.

6.3.3 Basis Functions

There are $M*N$ basis functions;
Each function is an $M \times N$ spatial frequency image;
Low frequencies should be preserved;
High frequencies can be compressed;

7. Image Compression Techniques

7.1 Background

JPEG, 1986
Joint Photographic Expert Group

The Joint Photographic Experts Group (JPEG) committee has a long tradition in the creation of still image coding standards.
The committee was created in 1986 and meets usually four times a year to create the standards for still image compression and processing.

JPEG Standard, 1992~

JPEG is an image compression standard developed by JPEG; It was formally accepted as an international standard in 1992.
JPEG is a lossy image compression method. It is a transform coding method using DCT.
JPEG typically achieves ==10 : 1 compression== with ==little perceptible loss==, usually with a filename extension of “.jpg” or “.jpeg”.

7.1.1 Observation 1

In a small area, intensity values do not vary widely.
- For example, within an 8*8 block, pixel information may be repeated. (“==spatial redundancy==”)

7.1.2 Observation 2

Psychophysical experiments show that humans are less sensitive to the loss of ==high spatial frequency information==.

7.1.3 Observation 3

Visual acuity is much greater for ==grey== than for color. (more accurate in distinguishing closely spaced lines)

7.1.4 YCbCr Color Model

YCbCr model is used in JPEG image compression.
Normalize R, G, B in [0, 1], there is:
- $Cb = ((B−Y)/1.772)+0.5, Cr = ((R−Y)/1.402)+0.$
YCbCr is obtained in [0, 255] via the transform:

7.2 Chroma Subsampling

Do Subsampling ==only on the chroma==, not on the gray.

$j$: size of th patch.
$a$: the number of pixels are going to be selected in the first row.
$b$: the number of pixels are going to be selected in the second row.

4:4:4 $\to$ no compression, no subsampling;
- 8bits 8 + (8+8)bits 8 = 24 * 8 = 192 bits

4:2:2 $\to$ compression, only keep 2 bit in the first and second row; (==just copy the missing pixels==), only need to store the bit in the red box;
- 8bits 8 + (8+8)bits 4 = 64 + 64 = 128 bits

4:2:0 $\to$ compression, only keep 2 bit in the first and drop all pixels in the second row; (==just copy the missing pixels==), only need to store the bit in the red box;
- 8bits 8 + (8+8)bits 2 = 64 + 32 = 96 bits

4:1:1 $\to$ compression, only keep 1 bit in the first and second row; (==just copy the missing pixels==), only need to store the bit in the red box;
- 8bits 8 + (8+8)bits 2 = 64 + 32 = 96 bits

7.3 JPEG Compression

7.3.1 Compression Pipeline

The effectiveness of JPEG compression relies on 3 major observations.

Compression Pipeline

Splitting : ==Split== the image into 8 * 8 non-overlapping blocks, with ==zero-padding== if needed.
Color Conversion from [R,G,B] to [Y, Cb, Cr], and ==Chroma Subsampling==.
DCT: Apply ==DCT== to each 8-by-8 block.
Quantization: ==quantize== the DCT coefficients according to psycho-visually tuned quantization tables.
Serialization: by ==zigzag scanning== pattern to exploit redundancy.
Vectoring: Differential Pulse Code Modulation (DPCM) on DC
components. ==Run length Encoding== (RLE, count the repeated bits) on AC components.
Entropy Coding.

Order of Splitting and Color Conversion can be changed

7.3.2 Splitting, Color Conversion, Chroma Subsampling

7.3.3 DCT

F(0,0) is called DC coefficient. (==direct current== term, zero frequency)
Others are called AC coefficients. (==alternating current== term, non-zero frequencies)

7.3.4 Quantization

Obtained from psychophysical studies;
==Larger== values at ==lower right corner==;
More loss at higher frequencies;

7.3.5 Zig-Zag Scan

Maps 88 matrix to a 164 vector.
To group ==low frequency== coefficients at the ==top==, ==high frequency== coefficients at the ==bottom== of the vector.
To exploit the presence of the large number of zeros in the quantized matrix.

7.3.6 Vectoring

Differential Pulse Code Modulation (DPCM) is applied to the DC coefficients.
- Why? DC component is ==large and varies==, but often close to previous value.
Run length encoding (RLE) is applied to the AC coefficients.
- Why? The remaining 1*63 vector (AC) has ==lots of zeros== in it.

7.3.7 Huffman Coding

DC/AC coefficients finally undergo Huffman coding for further compression.
- Why? Further compression by ==replacing long strings== of binary digits by ==code words==.

8. Video Compression Techniques

8.1 Background

A consists of a ==time-ordered== of sequence of images.
==Consecutive frames are similar==, i.e. ==Temporal Redundance== exists.

8.1.1 Temporal Redundancy

==Not every frame needs to be coded independently as a new image==.
==Difference between the current frame and other frame(s) can be coded==.
Difference between frames is caused by ==camera and/or object motion==.
- motion can be compensated!

8.1.2 Motion Compensation

Compression algorithms based on motion compensation (MC)
- consists of two main steps:
  1. Motion estimation (motion vector search)
  2. Motion compensation- based prediction (derivation of the difference)

8.2 Block-based Motion Estimation

Each image is ==divided into macroblocks== of size N*N. Motion estimation is performed at the macroblock level.
- (In MPEG-1, N=16 for luminance channels, while N=8 for chrominance images, 4:2:0 chroma subsampling is adopted.)
The ==displacement== of the reference macroblock to the target macroblock is called a ==motion vector $MV(u, v)$==

The current image frame is referred to as Target Frame.
Match between the macroblock in Target Frame and the most similar macroblock in a previous/future frame, referred to as Reference Frame.
The displacement of the reference macroblock to the target macroblock is called a motion vector $MV(u, v)$

8.2.1 Block Matching

Horizontal and vertical displacements $i$ and $j$ are limited to a small range $[−p, p]$, where $p$ is a positive integer with a relatively small value.
This makes a ==small search window== of size ==$(2p+1)\times (2p+1)$==

The goal of the search is to find a vector $(i,j)$ as the motion vector $MV =(u,v)$ , such that $MAD(i,j)$ is minimum:

$(u,v) = [(i,j)|MAD(i,j) \text{ is minimum}], i,j \in [-p,p]]$ $MAD(i,j) = \frac{1}{N^2} \sum_{k=0}^{N-1} \sum_{l=0}^{N-1} |C(x+k,y+l)-R(x+i+k,y+j+l)|$

The difference between two macroblocks can be measured by their Mean Absolute Difference (MAD).

8.2.2 Full Block Search

Full search : sequentially ==search the whole $(2p+1)\times (2p+1)$ window== in the ==reference frame==.
The vector (i, j) that offers the least MAD is set as the $MV(u,v)$ for the macroblock in the target frame.
Assume that each pixel comparison requires three operations : subtraction, absolute value, addition.
The cost to obtain a motion vector for a single macroblock:

$(2p+1)\times (2p+1) \times N^2 \times 3 \to O(N^2 p^2)$

very costly!

8.2.3 Logarithmic Block Search

Initially only ==nine locations== in the search window are used as seeds for a MAD-based search; and they are marked as “1”.
After the one that yields the ==minimum MAD== is located, the centre of the ==new search region is moved to it== and the step-size is reduced to half.
In next iteration, the nine new locations are marked as “2”, and so on.

==The complexity of 2D logarithmic search is $O(N^2 \log_2 p)$==

==The thing we stored==:

The ==motion vector==, the displacement of the coordinate.
And the ==difference== of the pixels between the two macroblocks, the different is easy to compress, since there are lots of zeros.
It is the ==lossless compression==.

8.3 MPEG Compression Pipeline

8.3.1 Standards

8.3.2 Compression Pipeline

GOP (Group of Pictures): Smallest independent coding unit in sequence.

8.3.3 GOP

==I=Intra-Picture coding (Intra-frames)==, encoded as a still image, doesn’t depend on ==reference frames==.
==P=Predictive Coding (Inter-frames)==, ==depends on previously displayed reference frame (I)==. Can be referenced.
==B=Bi-directional coding (Inter-frames)==, ==depends on previous and future reference frames==. Never referenced.

==Pictures are coded and decoded in a different order than they are displayed, due to bidirectional prediction for B pictures.==

8.3.4 MPEG Compression

==Coding of Intra-frame (I-frames)==

Just like the JPEG compression.

==Coding of Inter-frame (P- and B-frames)==

The ==difference== is going to be compressed.
The motion vector and the difference are going to be compressed.

9. Audio Compression Techniques

9.1 Background

Frequency 𝑓 is defined as the number of oscillations per second.
- The period is $T$, then $f = \frac{1}{T}$.
Amplitude $A$ of a sound wave is the height of the wave.

9.1.1 Digitization

Sampling is discretizing the analog signal in the time dimension.
Quantization is discretizing the analog signal in the amplitude dimension.

9.2 Pulse-code Modulation (PCM)

Pulse-code modulation (PCM) is a method used to digitally represent sampled analog signals. (Standard form of applications: CD/telephone; The earliest best established and most frequently applied coding system).
Pulse comes from an engineer’s point of view that the resulting digital signals can be thought of as infinitely narrow vertical “pulses”.
In PCM, the amplitude of signal is sampled ==regularly at uniform intervals (linear coding)==, and each sample is quantized to the nearest value.

Sampling : values are taken from the analog signal every $\frac{1}{f_s}$ seconds.
Quantization : assigns these samples a value by approximation.
Entropy Coding : assign a codeword (binary digits) to each value. (e.g., Huffman coding)

PCM Encoding and Decoding Pipeline

Audio is often stored not in simple PCM, but instead in a form that ==exploits differences $d_n$==, denote the integer input signal as $f_n$ ,

$d_n = f_n - f_{n-1}$

Smaller numbers, using fewer bits to store; assign short codewords to prevalent values, long codewords to rarely occurring ones.

The difference between consecutive samples is often very small.

9.3 Lossless Predictive Coding (LPC)

Predictive coding predicts the next sample as being the current sample. (stupid way, just use the previous on as the current one.)
The difference between two samples will be kept for encoding/decoding.

$\hat{f}_n = f_{n-1}$

It is often to use a combination of a few previous samples, $f{n-1}, f{n-2}, f_{n-3}$, etc. to provide a better prediction.

$\hat{f}_n = \sum_{k=1}^m a_k f_{n-k}$

usually, $m$ can be set between 2 to 4
$k$ is predefined

==Motivation: The original signal has a strong temporal correlation==

==The optimal goal is to design a best prediction function (100% accuracy), such that, the $e_n$ will be $0$, then we can save a lot of space, and only transmit the function or the model.==

The thing we saved:

the prediction function
the error $e_n$ (the thing may be compressed)

9.3.1 Example

Suppose we devise a predictor as follows:

$\hat{f_n} = [\frac{1}{2} f_{n-1} + \frac{1}{2} f_{n-2}]$ $e_n = f_n - \hat{f_n}$

The symbol [] means biggest integer less than . (e.g., [1.8] = 1)
How to code the following sequence?

10. Virtual Reality and Augmented Reality

10.1 Virtual Reality

10.1.1 What is VR?

The Virtual Reality is a technology that use software to generate images, sound, and other sensations that ==replicate real world environment==.
Users can ==interact and manipulate with the virtual objects== of virtual world with the help of specialized devices like display or other devices.
The VR equipment is typically able to ==”look around”== the artificial world.
Virtual realities are displayed on ==computer monitors, projectors, headsets==, etc.
VR environment is usually captured by using 360 degree special devices.

10.1.2 Brief History of VR

1965 – The beginnings of VR.
1977 – Interaction through body movement.
1982 – The first computer-generated movie.
1983 – The first virtual environment.
1987 – Development of immersive VR.
1993 – Invention of surgery rehearsal system.
2007 – Google introduced panoramic Street View service.

10.1.3 Applications of VR

Movies – Virtual reality is applied in 3D movies to try and immerse the viewers into the movie and/or virtual environments.
Video Games – Users can physically interact with a game by using your body and motions to control characters and other elements of the game.
Education – Virtual reality enables large groups of students to interact with each other as well as within a 3D environment.
Training – VR can provide learners with a virtual environment where they can develop their skills without the real-world consequences of failing.
Design – VR can aid designers to setup virtual models of proposed building or product designs.

10.2 Augmented Reality

10.2.1 What is AR?

A combination of a real scene viewed by a user and a virtual scene generated by a computer that augments the scene with additional information.
An AR system adds virtual computer-generated objects, audio and other sense enhancements to real-world environment in real-time.
The goal of AR is to create a system such that a user ==CANNOT tell the difference== between the real world and virtual augmentation of it.

10.2.2 Brief History of AR

1980 – The first wearable computer with text and graphical overlays on a mediated reality.
1989 – Jaron Lanier coins the phrase Virtual Reality and creates the first commercial business around virtual worlds.
2013 – Google announces an open beta test of Google Glass.
2015 – Microsoft announces HoloLens headset.
2016 – Niantic released Pokemon GO.
2022 – Facebook focuses on Metaverse.

10.2.3 Applications of AR

Medical
Entertainment
Training
Design
Robotics
Manufacturing
etc.

10.2.4 Technology of AR

Display Technology : 1) Helmet display, 2) Handheld device display, 3) Other display devices, such as PC.
Geometry, semantics, physics and so on.
3D Registration Technology:
3D registration technology enables virtual images to be superimposed accurately in the real environment.
1. Determine the relationships between virtual images, 3D models, and display devices.
2. The virtual rendered image and model are accurately projected into the real environment.
Intelligent Interaction Technology : Hardware device interactions, location interactions, tag-based or information-based interactions.

10.2.5 Latest Research

11. Image and Video Retrieval

11.1 Image Retrieval

Tag Based Image Retrieval
Visual Content Based Image Retrieval

Tag Based Image Retrieval (TBIR)

(Visual) Content Based Image Retrieval (CBIR)

11.1.1 Framework of CBIR

11.1.2 Main Components of CBIR

11.1.3 Feature Extraction

Color Histogram

Deep Neural Networks

Automatically learn robust features. Avoid handcrafted features.
Pretrain a CNN model on ImageNet dataset as a feature extractor.
Finetune the model on new image datasets.

11.1.4 Similarity Measurement

Histogram intersection similarity
Bhattacharyya distance
Chi-Square distance
Cosine similarity
Correlation similarity

11.1.5 Image Indexing

11.2 Video Retrieval

Explosive growth of video data in recent years.
Need efficient ways to retrieve video content from large scale video databases.
Video consists of text (captions, subtitles), audio, and images.
Video retrieval methods include:
- Metadata based method
- Text based method
- Audio based method
- (Visual) Content based method
- Integrated approach

11.2.1 Metadata Based Method

Video is indexed and retrieved based on structured metadata information.
Metadata examples are the title, author, producer, date, types of video.
- Well defined with well structures.

11.2.2 Text Based Method

Video is indexed and retrieved based on associated subtitles (text).
Can be done automatically where subtitles / transcriptions exist.

11.2.3 Audio Based Method

Video is indexed/retrieved based on soundtracks using methods for audio indexing and retrieval.

Speech recognition is applied if necessary.

11.2.4 Visual Content Based Method

Applications:

Brand monitoring: search YouTube using product images
New videos: search event footage using photos
Online education: search lectures using slides

12. AI-generated Content

12.1 Background

Artificial Intelligence-generated Content (AIGC) uses artificial intelligence to assist or replace manual content generation by generating content based on user-inputted keywords or requirements.