Deep Learning Week10 Notes

2022-06-04 04:31:17 阅读：162 来源： 互联网

标签：Week10 right log dim Notes ldots Learning self left

1. Auto-Regression

Auto-regression methods model components of a signal serially, each one conditionally to the ones already modeled.

They rely on the chain rule:

\[\begin{align} P(X_1 = x_1,...,X_T= x_T) = P(X_1 = x_1)P(X_2=x_2|X_1=x_1)...P(X_T|X_{T-1},...,X_1) \end{align} \]

with two tensors of dimension \(T\): the first a Boolean mask stating which variables are conditioned, and the second the actual conditioning values.

Now we consider finite distributions over \(C\) real values. Hence we can model a conditional distribution with a mapping that maps a pair mask / known values to a distribution for the next value of the sequence:

\[f:\{ 0,1 \}^Q\times \mathbb{R}^Q\rightarrow\mathbb{R}^C \]

where the \(C\) output values can be either probabilities, or as we will prefer, logits

\(\Large\text{Note:}\)

In math: logits:

\[logits(p) = \ln(\frac{p}{1-p}) \]

In ML:

the vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then passed to a normalization function. If the model is solving a multi-class classification problem, logits typically become an input to the softmax function. The softmax function then generates a vector of (normalized) probabilities with one value for each possible class.

Given such a model and a sampling procedure \(\text{sample}\), the generative process can be written as:

\[\begin{align} x_1&\leftarrow \text{sample}(f(\{\}))\\ x_2&\leftarrow \text{sample}(f(\{X_1=x_1\}))\\ &...\\ x_T&\leftarrow \text{sample}(f(\{X_1=x_1,X_2=x_2,...,X_{T-1} =x_{T-1}\})) \end{align} \]

A sampling procedure takes as input the probabilities (or \(\text{logits}\)) output by the model (a tensor in \(\mathbb{R}^C\)) and outputs a value sampled randomly according to the provided probabilities or logits.

\(\text{Details: see }\)Lecture-P6

torch.distributions:

>>> l = torch.tensor([ log(0.8), log(0.1), log(0.1) ])
>>> dist = torch.distributions.categorical.Categorical(logits = l)
>>> s = dist.sample((10000,))
>>> (s.view(-1, 1) == torch.arange(3).view(1, -1)).float().mean(0)
tensor([0.8037, 0.0988, 0.0975])

This can also be done in a batch:

>>> l = torch.tensor([[ log(0.90), log(0.10) ],
... [ log(0.50), log(0.50) ],
... [ log(0.25), log(0.75) ],
... [ log(0.01), log(0.99) ]])
>>> dist = torch.distributions.categorical.Categorical(logits = l)
>>> dist.sample((8,))
tensor([[0, 1, 1, 1],
    [0, 1, 1, 1],
    [0, 0, 1, 1],
    [0, 1, 0, 1],
    [1, 0, 1, 1],
    [0, 1, 1, 1],
    [0, 1, 1, 1],
    [0, 0, 1, 1]])

In the batch case, the sampler is parameterized by a tensor of size

\[M_1\times M_2\times...\times M_K\times C \]

that represents

\[M_1\times M_2\times...\times M_K \]

vectors of logits over \(C\) classes.

The sampling itself takes \((N_1,...,N_L)\) as input and returns a tensor of size

\[N_1\times N_2\times...\times N_L\times M_1\times...\times M_K \]

of values in \(\{0,1,...,C-1 \}\)

\(\\\)
\(\text{Minimize the loss:}\)

\[\begin{align} L(f)&= -\sum_n\sum_t \log{\hat{p}(X_t = x_{n,t}|X_{t-1} = x_{n,t-1},...,X_1 = x_{n,1})}\\ &=\sum_n\sum_t l[f((1,1,...,1,0,...,0),(x_{n,1},x_{n,2},...,x_{n,t-1},0,...,0)), x_{n,t}] \end{align} \]

where \(l\) is the cross-entropy loss.

In training practice: refer Lecture P9-15

Image autogression

MNIST samples are \(28 × 28\) gray-scale images. Pixels are in \([0, 255]\). For auto-regression, such a \(28 × 28\) image will be interpreted as a sequence of length \(784\), corresponding to the pixels visited from top to bottom, and from left to right.

Define two functions to serialize the image tensors into sequences:

def seq2tensor(s):
   return s.reshape(-1, 1, 28, 28)

def tensor2seq(s):
   return s.reshape(-1, 28 * 28)

whole training model process: Lecture P20-22

2. Causal convolution

Instead of predicting [the distribution of] one component, the model could make a prediction at every position of the sequence, that is

\[f:\mathbb{R}^T\rightarrow\mathbb{R}^{T\times C} \]

In detail:

\[\begin{aligned} x_{1} & \leftarrow \text { sample }\left(f_{1}(0, \ldots, 0)\right) \\ x_{2} & \leftarrow \text { sample }\left(f_{2}\left(x_{1}, 0, \ldots, 0\right)\right) \\ x_{3} & \leftarrow \operatorname{sample}\left(f_{3}\left(x_{1}, x_{2}, 0, \ldots, 0\right)\right) \\ & \ldots \\ x_{T} & \leftarrow \text { sample }\left(f_{T}\left(x_{1}, x_{2}, \ldots, x_{T-1}, 0\right)\right) \end{aligned} \]

where the \(0\)s simply fill in for unknown values, and the mask is not needed.

If additionally, the model is such that “future values” do not influence the prediction at a certain time, that is

\[\begin{aligned} \forall t, x_{1}, \ldots, x_{t}, \alpha_{1}, \ldots, \alpha_{T-t}, \beta_{1}, \ldots, \beta_{T-t} \\ & f_{t+1}\left(x_{1}, \ldots, x_{t}, \alpha_{1}, \ldots, \alpha_{T-t}\right)=f_{t+1}\left(x_{1}, \ldots, x_{t}, \beta_{1}, \ldots, \beta_{T-t}\right) \end{aligned} \]

then in particular:

\[\begin{aligned} f_{1}(0, \ldots, 0) &=f_{1}\left(x_{1}, \ldots, x_{T}\right) \\ f_{2}\left(x_{1}, 0, \ldots, 0\right) &=f_{2}\left(x_{1}, \ldots, x_{T}\right) \\ f_{3}\left(x_{1}, x_{2}, 0, \ldots, 0\right) &=f_{3}\left(x_{1}, \ldots, x_{T}\right) \\ & \cdots \\ f_{T}\left(x_{1}, x_{2}, \ldots, x_{T-1}, 0\right) &=f_{T}\left(x_{1}, \ldots, x_{T}\right) \end{aligned} \]

Which provides a tremendous computational advantage during training, since

\[\begin{aligned} \ell(f, x) &=\sum_{t} \ell\left(f_{t}\left(x_{1}, \ldots, x_{t-1}, 0, \ldots, 0\right), x_{t}\right) \\ &=\sum_{t} \ell(\underbrace{f_{t}\left(x_{1}, \ldots, x_{T}\right)}_{f \text { is computed once }}, x_{t}) . \end{aligned} \]

\(\large\text{More details and illustrations, see }\) Lecture.

3. Non-volume preserving networks

原论文：NVP Networks
相关博客：Blog

Given a dimension \(d\), a Boolean vector \(b \in \{0, 1\}^d\) and two mappings:

\[\begin{align} s &: \mathbb{R^d}\rightarrow\mathbb{R^d}\\ t &: \mathbb{R^d}\rightarrow\mathbb{R^d} \end{align} \]

define a [fully connected] coupling layer as the transformation:

\[\begin{align} c: \mathbb{R^d}&\rightarrow \mathbb{R^d}\\ x&\rightarrow b \odot x+(1-b) \odot(x \odot \exp (s(b \odot x))+t(b \odot x)) \end{align} \]

where \(\text{exp}\) is component-wise, and \(\odot\) is the Hadamard component-wise product. The quantities \(t\) and \(s\) stand respectively for translation and scale.

For clarity in what follows, \(b\) has all \(1\)s first, follows by \(0\)s, but this is not required:

\[b=(\underbrace{1,1, \ldots, 1}_{\Delta}, \underbrace{0,0, \ldots, 0}_{d-\Delta}) \]

\(\large\text{Illustration: }\) Lecture-P14

The second property of this mapping is the simplicity of its Jacobian: see Lecture-P16
and we have

\[\begin{aligned} \log \left|J_{c}(x)\right| &=\sum_{i: b_{i}=0} s_{i}(x \odot b) \\ &=\sum_{i}((1-b) \odot s(x \odot b))_{i} \end{aligned} \]

\(\\\)
\(\text{Code:}\)

dim = 6

x = torch.randn(1, dim).requires_grad_()
b = torch.zeros(1, dim)
b[:, :dim//2] = 1.0

s = nn.Sequential(nn.Linear(dim, dim), nn.Tanh())
t = nn.Sequential(nn.Linear(dim, dim), nn.Tanh())

c = b * x + (1 - b) * (x * torch.exp(s(b * x)) + t(b * x))

# Flexing a bit
j = torch.cat([autograd.grad(c_k, x, retain_graph=True)[0] for c_k in c[0]])

print(j)

prints

tensor([[ 1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [ 0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [ 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000],
        [ 0.4001, -0.3774, -0.9410, 1.0074, 0.0000, 0.0000],
        [-0.1756, 0.0409, 0.0808, 0.0000, 1.2412, 0.0000],
        [ 0.0875, -0.3724, -0.1542, 0.0000, 0.0000, 0.6186]])

To recap, with \(f^{(k)},k=1,2,...,K\) coupling layers:

\[\begin{align} f = f^{(K)}\circ ... \circ f^{(1)} \end{align} \]

and \(x_n^{(0)} = x_n, x_n^{(k)} = f^{(k)}(x_n^{(k-1)})\), we train by minimizing

\[\mathscr{L}(f)=-\sum_{n}-\frac{1}{2}\left(\left\|x_{n}^{(K)}\right\|^{2}+d \log 2 \pi\right)+\sum_{k=1}^{K} \log \left|J_{f(k)}\left(x_{n}^{(k-1)}\right)\right| \]

with

\[\log \left|J_{f(k)}(x)\right|=\sum_{i}\left(\left(1-b^{(k)}\right) \odot s^{(k)}\left(x \odot b^{(k)}\right)\right)_{i} \]

A coupling layer can be implemented with:

class NVPCouplingLayer(nn.Module):
    def __init__(self, map_s, map_t, b):
        super().__init__()
        self.map_s = map_s
        self.map_t = map_t
        self.register_buffer('b', b.unsqueeze(0))

    def forward(self, x, ldj): # ldj for log det Jacobian
        s, t = self.map_s(self.b * x), self.map_t(self.b * x)
        ldj = ldj + ((1 - self.b) * s).sum(1)
        y = self.b * x + (1 - self.b) * (torch.exp(s) * x + t)
        return y, ldj

    def invert(self, y):
        s, t = self.map_s(self.b * y), self.map_t(self.b * y)
        return self.b * y + (1 - self.b) * (torch.exp(-s) * (y - t))

We can then define a complete network with one-hidden layer tanh MLPs for the \(s\) and \(t\) mappings:

class NVPNet(nn.Module):
    def __init__(self, dim, hidden_dim, depth):
        super().__init__()
        b = torch.empty(dim)
        self.layers = nn.ModuleList()
        for d in range(depth):
            if d%2 == 0:
                i = torch.randperm(b.numel())[0:b.numel() // 2]
                b.zero_()[i] = 1
            else:
                b = 1 - b
            
            map_s = nn.Sequential(nn.Linear(dim,hidden_dim), nn.Tanh(),
                                nn.Linear(hidden_dim, dim))
            map_t = nn.Sequential(nn.Linear(dim, hidden_dim), nn.Tanh(),
                                nn.Linear(hidden_dim, dim))

            self.layers.append(NVPCouplingLayer(map_s, map_t, b.clone()))

    def forward(self, x, ldj):
        for m in self.layers: x, ldj = m(x, ldj)
        return x, ldj

    def invert(self, y):
        for m in reversed(self.layers): y = m.invert(y)
        return y

torch.randperm(n): Returns a random permutation of integers from \(0\) to \(n - 1\).
.numel(input): Returns the total number of elements in the input tensor.

And the log-proba of individual samples of a batch:

def LogProba(x, ldj):
    log_p = - 0.5 * (x**2 + math.log(2*pi)).sum(1) + ldj
    return log_p

Training is achieved by maximizing the mean log-proba:

batch_size = 100

model = NVPNet(dim = 2, hidden_dim = 2, depth = 4)
optimizer = optim.Adam(model.parameters(), lr = 1e-2)

for e in range(args.nb_epochs):
    for input in train_input.split(batch_size):
        output, ldj = model(input, 0)
        loss = - LogProba(output, ldj).mean()
        model.zero_grad()
        loss.backward()
        optimizer.step()

Finally, we can sample according to \(\mu_X\) with

z = torch.randn(nb_generated_samples, 2)
x = model.invert(z)

标签：Week10,right,log,dim,Notes,ldots,Learning,self,left
来源： https://www.cnblogs.com/xinyu04/p/16341184.html

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

Deep Learning Week10 Notes

1. Auto-Regression

Image autogression

2. Causal convolution

3. Non-volume preserving networks