Pipeline
2D skeleton sequence --> Spatial Transformer --> Temporal Transformer --> Regression Head.
2D skeleton sequence
\(X\in R^{f\times (2J)}\), \(f\): 接受帧数, 感受野; \(J\): 关节点数; \(2\): 2D空间\((x, y)\).
\(x^i \in R^{1\times (2J)}\), 单个骨架关节点坐标向量.
Spatial Transformer
The spatial transformer module is to extract a high dimension feature embedding
from a single frame.
Spatial emedding
对于单帧2D姿态\(x\in R^{1\times (2J)}\), Spatial Transformer Module把每个2D坐标\((x, y)\)作为
一个patch(ViT), 用一个线性投影将2D坐标投影至高纬空间. 接着加上表示关节位置信息的embedding,
作为输入交给Spatial Transformer学习关节点间空间信息.
\(x_i \in R^{1\times (2J)}\) -- linear transformation --> \(x_i \in R^{J\times c}\) -- positional embeddomh --> \(z_0^i \in R^{J\times C}\)
\(z_0 = [p^1E; p^2E; ...; p^JE] + E_{Spos}\), 其中\(p\)表示单个2D坐标\((x, y)\),
\(E\)表示线性投影矩阵, \(E_{Spos}\)表示positional embedding.
Self-attention
将\(z_0\in R^{f\times c}\)作为输入进入Spatial Transformer模块, 由\(L\)个ViT模块堆叠而成.
\(z'_l = MSA(LN(z_{l-1})) + Z_{l - 1}, l = 1, 2, ..., L\)
\(z_l = MLP(LN(Z'_l) + Z'_l), l = 1, 2, ..., L\)
\(Y = LN(Z_L)\)
其中\(LN()\)表示layer normalization, \(Y\in R^(f\times c)\), 与输入维度相同.
即先对输入进行多头注意力, 再进入全连接. 每个输入进入网络前先进行层归一化, 输出网络后残差连接.
Temporal Transformer
The goal of the temporal transformer module is to model dependencies across the
sequence of frames.
对于第\(i\)帧2D骨架经过Spatial Transformer后输出为\(z_L^i \in R^{J\times c}\), 将其
压缩为向量\(z^i\in R^{1\times (Jc)}\). 将\(f\)个(感受野)向量连接concatenate, 加上表示
帧位置信息的position embedding作为输入
\(Z_0 = [z_L^1; z_L^2; ..., z_L^f] + E_{Tpos}, Z_0\in R^{f\times (Jc)}\).
将\(Z_0\)输入\(L\)层相同的ViT模块(与Spatial Transformer相同), 得到输出\(Y\in R^{f\times (Jc)}\).
Regression Head
由于最终预测中间一帧的3D骨架, 首先需要将输出维度缩减为\(f\times (Jc) --> 1\times (Jc)\).
本方法通过对帧维度(\(f\))进行可训练的带权平均weighted mean操作实现. 最后, 使用一个简单的
\(MLP\), 得到最终输出\(y\in R^{1\times (3J)}\), 代表3D骨架坐标向量.