Video Recognition

or Action Recognition, 视频识别或者视频分类任务，针对视频中的连续帧分类（可以是整个视频，亦可以是视频中的某个片段）

Clip & video It’s from paper published by FAIR (A Closer Look at Spatiotemporal Convolutions for Action Recognition)
For clip, Select X frames as a clip.
For video, Use center crops of 10 clips randomly sampled from the video and average these 10 clips predictions to obtain the final video prediction.
Top1-Acc & Top5-Acc 在预测结果类别的概率向量中，Top-1类别和Top-5类别与Ground Truth Label得到的Accuracy.
Other(speed or parameters) GFLOPs(Giga Floating-point Operations Per second)

传统的2D-conv在应用于单帧图片情况下表现良好，但用于多帧视频情况下，会丢失时间关系或者其他序列前后关系(CT或者MRI医学图像)

3D-Conv卷积核： 3D-Conv卷积核的维度至少是3，需要区分multiple channels 2d-conv channel和3d-conv之间的区别：
使$C$为channel数，$H$为图片的height，$W$为图片的width， $L$为图片时间维度上的长度，$K$为卷积核在H和W上的尺寸：
假设$C=1$（这样更方便理解）,2d-conv的卷积核大小为$L \times K \times K$, 输出层在L上会收缩为1维；
而3d-conv的卷积核大小维$d \times K \times K$, 这里的$d < L$, 输出层在L上将保留顺序信息；
若$C \neq 1$, 2d-conv kernel size=$L \times C \times K \times K$, 3d-conv kernel size=$d \times C \times K \times K $;
L上depth的设置与2d-conv kernel size设置类似，3-3-3, 3-5-5-7-7, 7-7-5-5-3，..etc

In Conclusion: 3d还是2d的核心区别是输出层上的shape是3 dimension还是2 dimension

We separate the method into two part, extraction and classification .Introduced by (Unsupervised Learning from Video with Deep Neural Embeddings)

xiaoxin83121.github.io