A Convolutional neural network approach for objective video quality assessment
P.L.Callet,C.V.Gaudin and D.Barba, IEEE Trans. Neural Netw.,vol.17,no.5,pp.1316-1327, 2006.
The qulaity of video can be assessed with reduced reference(RR) by using CNN. The objective features are extracted on each frame (frame-by-frame basis) on both the refernce and distorted sequences. RR usually extracts the amount of motion and spatial detail and the comparison with the tested video is only based on those features.RR quality assessment has two main subsystems: construction of information that has to be extracted from the reference and decoded video, compariosn between two feature sets and poooling. Four features are extracted from the frames :
(1) Frequency content measures -GHV and GHVP to detect blurring artifacts and tiling distortions.
(II) Power of frame difference : to detect flicker,jadder,moving blurred images, random noise and edge jitter.
(III) Blocking effect to exhibit blocking effects.
CNN is defined along the temporal axis and is called TDNN(Time Delay Neural Network). There are four layers in total: 2 layers for local feature extraction, fully connected layer, one hidden layer and one output layer. Features extracted from the reference frame and distorted frame is given as input. Also, the original image is in YUV format and is transformed into 3 perceptual components: A (achromatic), Cr1(red-green axis), Cr2(yellow-blue axis). So the input vector has (4x3xT)x2, where T is the number of frames, 4 features and 3 perceptual components. Local receptive fields are used on the input to extract elementary visual distortions in videos. The distortions are combined to detect higher order features. The last or output layer is a single neuron that is fully connected to the previous layer, it is trained to estimate the DMOS value. TDNN uses standard stochastic gradient and backpropagation algorithm.
The performance is measured by using three indicators: RMSE,LCC(linear correlation criteria) and OR(percentage of outlier).
(1) Frequency content measures -GHV and GHVP to detect blurring artifacts and tiling distortions.
(II) Power of frame difference : to detect flicker,jadder,moving blurred images, random noise and edge jitter.
(III) Blocking effect to exhibit blocking effects.
CNN is defined along the temporal axis and is called TDNN(Time Delay Neural Network). There are four layers in total: 2 layers for local feature extraction, fully connected layer, one hidden layer and one output layer. Features extracted from the reference frame and distorted frame is given as input. Also, the original image is in YUV format and is transformed into 3 perceptual components: A (achromatic), Cr1(red-green axis), Cr2(yellow-blue axis). So the input vector has (4x3xT)x2, where T is the number of frames, 4 features and 3 perceptual components. Local receptive fields are used on the input to extract elementary visual distortions in videos. The distortions are combined to detect higher order features. The last or output layer is a single neuron that is fully connected to the previous layer, it is trained to estimate the DMOS value. TDNN uses standard stochastic gradient and backpropagation algorithm.
The performance is measured by using three indicators: RMSE,LCC(linear correlation criteria) and OR(percentage of outlier).