- Initialize tokens completely randomly in an n-dimensional space
- Tune the embeddings via backpropagation
ok so define a block_size (how many tokens we take to predict the next one)
build ur x and y (inputs and labels)
do below:
ralts # add these to our x and y tensors
... ---> r
..r ---> a
.ra ---> l
ral ---> t
alt ---> s
lts ---> .
x = torch.tensor(x)
y = torch.tensor(y)in the paper the cram 17k words (they do one token per word) into a 30-dimensional space. since we have less tokens we can cram them into a 2d space (- ‿ ◦)
Build our awesome lookup table:
C = torch.rand((34, 2)) # pokemon is 33+1 cos of the special chars lol
# the below approaches r indentical, but indexing is just faster
C[5]
# tensor([random_float, random_float])
F.one_hot(torch.tensor(n), num_classes=34).float @ C
# tensor([random_float, random_float])
Tip
F.one_hot(...)param 1 takes in atorch.tensorand not anint32F.one_hot(...)returns withint64s so cast to a float :3
Embedding C[5] is easy enough, but how do we embed the torch.Size([examples_size, 3]) stored in x? Luckily, PyTorch indexing is flexible and powerful. You can index using lists ! C5, 6, 7: valid आहे. You can also index with a tensor. You can index with a multi-dimensional index of tensors:
C[X].shape
# torch.Size([examples, 3, 2])
C[X][13, 2] # gives us `tensor([randf_ randf])` (the embedding)
# so.
emb = C[X]!bengio03a_p6_diagram.png Source: A Neural Probabilistic Language Model
Let’s Construct the Hidden Layer
oooh we get to pick our weights and biases 👻
erm… wherey = mx + bwis our weight andbis our bias lol
# the number of inputs is 3 * 2 because we have two dimensional embeddings and we have 3 of them (6)
# number of neurons is arbitrary lets pick 100 because we're humans
W1 = torch.rand((6, 100))
b1 = torch.randn(100)ok chat we need to concat the input layer so we can do emb @ W1 + b1 :3
but how do we transform a numx3x2 to a numx6
do this: torch.cat([emb[:, 0, :], emb[:, 1, :], emb[:, 2, :]], 1) which results in a 32x6 but this doesn’t generalize if we want to change the block size so instead lets do this emb.view(-1, 6)
Ok, now to calculate h we gotta tanh (hyperbolic tangent, idk what it means lmao). Thus:
h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # Size([32, 100])Ok lets make a new layer :3
W2 = torch.randn((100, 34))
b2 = torch.rand(34)
logits = h @ W2 + b2 # of shape 32x34
counts = logits.exp()
prob = counts / counts.sum(1, keepdims=True) # 32x34print(prob[torch.arrange(32), Y]) #iters the rows, grabs each column given by Y
# so..
loss = -prob[torch.arrange(32), Y].log().mean()TL;DR
X.shape, Y.shape # dataset
(torch.size([32, 3]), torch.size([32]))t = 33 + 1 # tokens: amount of chars + 1
g = torch.Generator().manual_seed(1337) # for reproducibility
C = torch.randn((t, 2), generator=g)
W1 = torch.randn((6, 100), generator=g)
b1 = torch.randn(100, generator=g)
W2 = torch.randn((100, t), generator=g)
b2 = torch.randn(t, generator=g)
parameters = [C, W1, b1, W2, b2]
#. lol this is important cos its false by default 🤓
for p in parameters:
p.requires_grad = True
print(f" num of param: {sum(p.nelement() for p in parameters)}")
# == Forward Pass ==
emb = C[X] # (32, 3, 2)
h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
logits = h @ W2 + b2 # (32, t)
"""
# in-efficient, alternative outside this block
counts = logits.exp()
prob = counts / counts.sum(1, keepdims=True)
loss = -prob[trch.arange(32), Y].log().mean()
"""
# combines `nn.LogSoftmax()` and `nn.NLLLoss()`
loss = F.cross_entropy(logits, Y) # see [[Cross-Entropy Loss]]
print(f"loss: {loss.item()}")# == backward pass ==
# (nerd shit)
for p in parameters:
p.grad = None # set it to zero more efficiently
loss.backward() # todo: learn what the fuck this does
for p in parameters:
p.data += -0.1 * p.grad # learning_rate * p.grad
Tip
repeat
forward passandbackward passa few times and print the loss each time :3
Tip
the reason this is so straightforward rn is because we’re only overfitting 32 examples. hear that? we’re overfitting (BAD)
9/17/24 9/18/24 9/25/24