^{1}

^{1}

^{*}

Convolutional neural networks, which have achieved outstanding performance in image recognition, have been extensively applied to action recognition. The mainstream approaches to video understanding can be categorized into two-dimensional and three-dimensional convolutional neural networks. Although three-dimensional convolutional filters can learn the temporal correlation between different frames by extracting the features of multiple frames simultaneously, it results in an explosive number of parameters and calculation cost. Methods based on two-dimensional convolutional neural networks use fewer parameters; they often incorporate optical flow to compensate for their inability to learn temporal relationships. However, calculating the corresponding optical flow results in additional calculation cost; further, it necessitates the use of another model to learn the features of optical flow. We proposed an action recognition framework based on the two-dimensional convolutional neural network; therefore, it was necessary to resolve the lack of temporal relationships. To expand the temporal receptive field, we proposed a multi-scale temporal shift module, which was then combined with a temporal feature difference extraction module to extract the difference between the features of different frames. Finally, the model was compressed to make it more compact. We evaluated our method on two major action recognition benchmarks: the HMDB51 and UCF-101 datasets. Before compression, the proposed method achieved an accuracy of 72.83% on the HMDB51 dataset and 96.25% on the UCF-101 dataset. Following compression, the accuracy was still impressive, at 95.57% and 72.19% on each dataset. The final model was more compact than most related works.

Recently in the field of computer vision, human action recognition has become increasingly research-worthy. With the development of technology, action recognition has wide applications in the present era. Deep ConvNets, such as Inception-V1 [

In contrast to image recognition, video understanding requires learning the relevance of frames; therefore, the disadvantage of the 2D CNN method is its relatively limited performance when only RGB images are used for recognition. To improve their accuracy, most 2D CNN mainstream approaches, such as two-stream [

In this study, we aimed to further improve the performance of the traditional 2D CNN architecture for action recognition by expanding the temporal receptive fields. Although TSM [

Recently, many approaches to video understanding and action recognition have been proposed. We discuss some mainstream works in this section. Compared with the traditional methods [

It is difficult to capture temporal relationship, which is crucial in video recognition, using methods based on 2D CNN; hence, most works incorporate other streams, such as optical flow or motion vector [

Simonyan et al. [

Wang et al. [

Lin et al. [

Carreira et al. [

Wang et al. [

T-C3D, a framework proposed by Liu et al. [

Although the compression technique reduces the model size, T-C3D still requires several frames for inference, which still causes explosive FLOPs.

We combined two proposed modules, MSTSM and temporal feature difference extraction module (TFDEM), on ResNet-50 [

Similar to Wang et al. [

For example, given an input video with N frames, we divided the N frames into n parts of equal length; thus, each part was composed of k = N / n frames. Then, the sampled frames from each part formed a set and were denoted as follows:

F r a m e s = { F 1 , F 2 , F 3 , ⋯ , F n } (1)

where the frame number F i for the training stage was a random number in the interval [ 1 , k ] and plus ( i − 1 ) ∗ k . For the testing stage, the frame number F i was the median of the interval [ 1 , k ] and plus ( i − 1 ) ∗ k . In our experiments, n is 8, unless otherwise specified.

As the name implies, the MSTSM shifts features with different scales along the temporal dimension. The main purpose of shifting the feature maps bi-directionally is to merge the features of different frames when performing convolution. _{i}, where i is the frame index. For brevity, we excluded the batch size, height, and width dimensions of the feature map. Next, we describe the benefits of the two temporal shift blocks with different shift units.

1) One-unit Temporal Shift Block: In the case of shifting one unit, features are shifted from the frames before and after the current frame; this is also the relationship that needs to be learned most in action recognition. We selected the number of input channels with a higher ratio to be shifted. The intuitive idea was to replace the original feature maps with the shifted ones, as shown on the left side of

2) Two-unit Temporal Shift Block: After one-unit temporal shifting, to further increase the model’s temporal receptive fields, we shifted the features by two units. Thus, we shifted the features of the frame before the previous frame and after the next frame to the current frame. Hence, as illustrated in

To circumvent the aforementioned risk, we concatenated the shifted features with the original identical feature maps, as shown on the right side of

In several cases, the difference between the frames was subtle even though we adopted the sparse sampling strategy. Another problem was that the movements that constituted some actions differed only slightly. Thus, the proposed TFDEM module was designed to address these problems.

We had two overall objective functions during the training stage: L m a i n and L T F D E M . The first objective function L m a i n was calculated based on the definition of the cross entropy with the output probability p m a i n from the main path and the ground-truth label y t r u e . The equation can be written as follows:

arg min L m a i n ( W m a i n ) = arg min ( − 1 N ∑ i = 1 N ∑ c = 1 C ( y t r u e ) i , c ln ( p m a i n ) i , c ) (2)

where N is the total number of training videos, W m a i n is the learnable weight of the main path, y t r u e is the ground-truth label, C is the total number of categories, and ( p m a i n ) i , c , produced by the main path, is the probability of the i^{th} video belonging to the c^{th} category.

Similarly, the equation of the second objective function L T F D E M can be written as follows:

arg min L T F D E M ( W T F D E M ) = arg min ( − 1 N ∑ i = 1 N ∑ c = 1 C ( y t r u e ) i , c ln ( p T F D E M ) i , c ) (3)

where W T F D E M is the learnable weight of the TFDEM path, and ( p T F D E M ) i , c , produced by the TFDEM path, is the probability of the i^{th} video belonging to the c^{th} category.

L t o t a l is the sum of both objective functions. Hence, the equation can be written as follows:

arg min L t o t a l ( W ) = arg min L m a i n ( W m a i n ) + L T F D E M ( W T F D E M ) (4)

where W is the learnable weight of the entire model.

The kernel space of each convolutional layer contains some “similar” kernels.

Performing convolution using similar kernels will result in similar outputs. Highly similar outputs may be redundant in the model; hence, some of them can be removed with no significant effect.

1) Defining Similar Kernels: We used two steps to define “similar” kernels. First, we searched for a kernel K G M with the smallest distance from all other kernels in the kernel space; the equation can be written as follows:

t d i = ∑ j ∈ [ 1 , C h a n n e l o u t ] ‖ K i − K j ‖ 2 (5)

t d i is summation of the distance for a kernel K i to all other kernels in the same kernel space.

K G M = min K i , (6)

where K i with minimum t d i and C h a n n e l o u t is the number of output channels for each target layer and K i is the i t h kernel in the kernel space. Second, we ranked all the kernels in the ascending order of their distance from K G M using the following equation:

s o r t ( d 1 , d 2 , ⋯ , d i , ⋯ , d c h a n n e l ) ,

d i = ‖ K i − K G M ‖ 2 , ∀ i ∈ [ 1, C h a n n e l o u t ] (7)

Then, we selected kernels according to the pruning ratio.

Therefore, the “similar kernels” referred to hereafter are kernels with a smaller distance from K G M .

2) Target Layers: The target layers in our method were the layers with the MSTSM because the kernel similarity may either result in redundant spatial or temporal features, following shifting. Pruning this layer can filter out redundant spatial and temporal features, as illustrated in

3) Averaging Selected Input Feature Maps: It is known that each input feature map of the layer L i is the output feature map of L i − 1 . Therefore, we first found C o u t i − 1 ∗ 1 / R i n similar kernels from the kernel space of L i − 1 . Then, we averaged the corresponding channels of the output feature map into one channel, where R i n is the prune ratio in this case and C o u t i − 1 is the number of output channels of L i − 1 .

4) Pruning Selected Kernels: To prune the kernels in layer L i , we found and eliminated similar kernels with the ratio R o u t ; therefore, both C o u t i and C i n i + 1 reduced C o u t i ∗ 1 / R o u t channels, respectively, where C o u t i was the number of output channels of L i and C i n i + 1 was the number of input channels of L i + 1 .

Both cases are illustrated in

In this section, we first describe our experimental environment and datasets. Then, we present the experimental results to demonstrate that the proposed modules can improve the performance. Further, the ablation studies are described to show how we determined the optimal settings. At the end of this section, we present the comparison results of the proposed method and several state-of-the-art methods.

We implemented our proposed method using the Pytorch [

We mainly evaluated our framework on the UCF-101 dataset provided by [

1) UCF-101 [

2) HMDB51 [

To improve the performance of the model and avoiding overfitting, we applied similar data augmentation as [

In this section, we present the results of the proposed methods step by step. Unless otherwise specified, the results were evaluated on the UCF-101 dataset.

We show the various settings of the proposed modules and discuss their influence on different aspects. Furthermore, we also attempted to apply our proposed method to a different backbone; we obtained competitive results. We also demonstrate the possibility of incorporating the optical flow into our method to achieve greater accuracy.

Based on the results shown in

Then, we applied our pruning method to the architecture based on the MSTSM and TFDEM and denoted it as MSTSM-TFDEM-p. After filtering out similar kernels, we preserved the useful and informative features. Based on the experimental result, we reduced approximately 2M of parameters and maintained an accuracy of 95.57%.

The influence of each module is summarized in

1) Incorporation with Optical Flow: Similar to other works, we also attempted to incorporate the optical flow stream into the proposed module to achieve accuracy. The resultant architecture is shown in

Backbone | Method | Accuracy | # Params | GFLOPs |
---|---|---|---|---|

ResNet-50 | [ | 94.93% | 23.7 | 3.8 |

ResNet-50 | MSTSM | 95.64% | 24.2 | 3.9 |

[ | 95.71% | 23.9 | 3.9 | |

MSTSM + TFDEM | 96.25% | 24.5 | 4.0 | |

MSTSM + TFDEM + pruning | 95.57% | 22.4 | 3.6 | |

MSTSM-TFDEM-OF | 97.98% | 49.0 | 7.9 | |

EfficientNet-B0 | MSTSM | 95.60% | 4.23 | 0.7 |

MSTSM + TFDEM | 96.08% | 4.49 | 0.7 |

2) Application to Different Backbones: In this section, we demonstrate that our proposed modules can be applied to any backbone. We take the EfficientNet-B0 [

3) Shift Ratio of Each Temporal Shift Block: Regarding the experiments, the first aspect to be discussed is the shift ratio of each temporal shift block. We experimented with various combinations of the ratio of each temporal shift block. The ratios of the one-unit temporal shift block and two-unit temporal shift block are denoted as R o n e and R t w o , respectively. As shown in

4) Temporal Receptive Fields of ConvNet: In this section, we discuss the impact of the temporal receptive field (TRF). In addition to the two-unit temporal shift block, we also experimented with other scales, as shown in

R o n e < R t w o | R o n e > R t w o | |||||||
---|---|---|---|---|---|---|---|---|

1/16 | 1/8 | 1/8 | 1/12 | 1/8 | 1/16 | 1/8 | 1/20 | |

Accuracy | 94.58% | 95.79% | 95.72% | 95.26% | ||||

# Params (M) | 24.79 | 24.43 | 24.25 | 24.14 | ||||

In

5) Effect of Minimizing L_{TFDEM}: Although subtracting the feature maps of different frames can highlight the difference between them, our TFDEM can still extract features inaccurately. Hence, we also minimized the loss value of the proposed TFDEM path to update the weights in it. We present the comparison result between the performance of the TFDEM with and without minimizing the loss value of the TFDEM path in

We evaluated our framework on the UCF-101 and HMDB51 datasets and compared the performance with other modules in this section. As shown in

Type of MSTSM | Accuracy | Temporal Receptive Fields |
---|---|---|

M S T S M 1 | 95.00% | 2 - 3 |

M S T S M 2 | 95.72% | 3 - 5 |

M S T S M 2 v a r | 95.20% | 4 - 5 |

M S T S M 3 | 95.16% | 4 - 7 |

Method | Accuracy |
---|---|

w/ L_{TFDEM} | 96.25% |

w/o L_{TFDEM} | 95.59% |

Works | Architecture | Modality | Sampling frames | # Params | Accuracy (UCF101) | Accuracy (HDMB51) |
---|---|---|---|---|---|---|

I3D-LSTM [ | 3D CNN | RGB | whole video | - | 95.1% | - |

STH [ | 3D and 2D CNN | RGB, MV | 16 | 88 | 94.3% | 68.6% |

T-C3D [ | 3D CNN | RGB | 24 | 31.7 | 92.5% | 62.4% |

IP-LSTM [ | LSTM | RGB, OF | 25 | 27.6 | 91.4% | 68.2% |

STDDCN [ | 2D CNN | RGB, OF | 25 | 59 | 94.8% | 69.49% |

Heterogeneous Two-Stream [ | 2D CNN | RGB, OF | 25 | 45.5 | 94.4% | 67.2% |

LVR [ | 2D CNN | RGB, OF | 25 | 92.8 | 94.4% | 71.0% |

Multi-teacher KD [ | 2D CNN | RGB, MV, Residual | (1 + 11) | 33.6 | 88.5% | 56.16% |

TSM [ | 2D CNN | RGB | 8 | 23.7 | 94.9% | 70.91% |

TSN [ | 2D CNN | RGB, OF | 25 | 22.6 | 94.9% | 71.0% |

MSTSM-TFDEM (ours) | 2D CNN | RGB | 8 | 24.5 | 96.25% | 72.83% |

MSTSM-TFDEM-p (ours) | 2D CNN | RGB | 8 | 22.4 | 95.57% | 72.19% |

MSTSM-TFDEM (ours EfficientNet) | 2D CNN | RGB | 8 | 4.5 | 96.08% | 72.48% |

In this study, we designed an action recognition framework based on 2D ConvNet. When an RGB image alone is used as the 2D ConvNet input, there will be no information regarding the temporal relationship. To expand the temporal receptive fields without increasing the number of parameters, we proposed an MSTSM with average-sized receptive fields to learn the features from other frames. We also proposed a TFDEM to avoid mispredictions in the case of similar actions. Further, our pruning method made it possible to filter out similar kernels and obtain a compact model. Experimental results show that both the MSTSM and TFDEM are effective and our modules can effortlessly be applied to any other backbone. With both proposed modules, it was possible to achieve an accuracy of 96.25% on the UCF-101 dataset, which is a 1.1% improvement on the I3D-LSTM [

The authors declare no conflicts of interest regarding the publication of this paper.

Wu, K.-H. and Chiu, C.-T. (2021) Action Recognition Using Multi-Scale Temporal Shift Module and Temporal Feature Difference Extraction Based on 2D CNN. Journal of Software Engineering and Applications, 14, 172-188. https://doi.org/10.4236/jsea.2021.145011