Initialize tokens completely randomly in an n-dimensional space
Tune the embeddings via backpropagation

ok so define a block_size (how many tokens we take to predict the next one) build ur x and y (inputs and labels) do below:

ralts # add these to our x and y tensors
... ---> r
..r ---> a
.ra ---> l
ral ---> t
alt ---> s
lts ---> .

x = torch.tensor(x)
y = torch.tensor(y)

in the paper the cram 17k words (they do one token per word) into a 30-dimensional space. since we have less tokens we can cram them into a 2d space (- ‿ ◦)

Build our awesome lookup table:

C = torch.rand((34, 2)) # pokemon is 33+1 cos of the special chars lol

 
# the below approaches r indentical, but indexing is just faster
 
C[5]
# tensor([random_float, random_float])
 
F.one_hot(torch.tensor(n), num_classes=34).float @ C
# tensor([random_float, random_float])

Tip

F.one_hot(...) param 1 takes in a torch.tensor and not an int32 F.one_hot(...) returns with int64s so cast to a float :3

Embedding C[5] is easy enough, but how do we embed the torch.Size([examples_size, 3]) stored in x? Luckily, PyTorch indexing is flexible and powerful. You can index using lists ! C5, 6, 7: valid आहे. You can also index with a tensor. You can index with a multi-dimensional index of tensors:

C[X].shape
# torch.Size([examples, 3, 2])
 
C[X][13, 2] # gives us `tensor([randf_ randf])` (the embedding)
 
# so.
emb = C[X]

!bengio03a_p6_diagram.png Source: A Neural Probabilistic Language Model

Let’s Construct the Hidden Layer

oooh we get to pick our weights and biases 👻

~~y = mx + b~~ erm… $y = w x + b$ where w is our weight and b is our bias lol

# the number of inputs is 3 * 2 because we have two dimensional embeddings and we have 3 of them (6)
# number of neurons is arbitrary lets pick 100 because we're humans
W1 = torch.rand((6, 100)) 
b1 = torch.randn(100)

ok chat we need to concat the input layer so we can do emb @ W1 + b1 :3 but how do we transform a numx3x2 to a numx6 do this: torch.cat([emb[:, 0, :], emb[:, 1, :], emb[:, 2, :]], 1) which results in a 32x6 but this doesn’t generalize if we want to change the block size so instead lets do this emb.view(-1, 6)

Ok, now to calculate h we gotta tanh (hyperbolic tangent, idk what it means lmao). Thus:

h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # Size([32, 100])

Ok lets make a new layer :3

W2 = torch.randn((100, 34))
b2 = torch.rand(34)
 
logits = h @ W2 + b2 # of shape 32x34 
 
counts = logits.exp()
prob = counts / counts.sum(1, keepdims=True) # 32x34

print(prob[torch.arrange(32), Y]) #iters the rows, grabs each column given by Y
 
# so..
 
loss = -prob[torch.arrange(32), Y].log().mean()

TL;DR

X.shape, Y.shape # dataset
(torch.size([32, 3]), torch.size([32]))

t = 33 + 1 # tokens: amount of chars + 1
 
g = torch.Generator().manual_seed(1337) # for reproducibility
C = torch.randn((t, 2), generator=g)
W1 = torch.randn((6, 100), generator=g)
b1 = torch.randn(100, generator=g)
W2 = torch.randn((100, t), generator=g)
b2 = torch.randn(t, generator=g)
parameters = [C, W1, b1, W2, b2]
 
#. lol this is important cos its false by default 🤓
for p in parameters:
	p.requires_grad = True
 
print(f" num of param: {sum(p.nelement() for p in parameters)}")
 
# == Forward Pass ==
 
emb = C[X] # (32, 3, 2)
h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
logits = h @ W2 + b2 # (32, t)
 
"""
# in-efficient, alternative outside this block
counts = logits.exp()
prob = counts / counts.sum(1, keepdims=True)
loss = -prob[trch.arange(32), Y].log().mean()
"""
 
# combines `nn.LogSoftmax()` and `nn.NLLLoss()`
loss = F.cross_entropy(logits, Y) # see [[Cross-Entropy Loss]]
print(f"loss: {loss.item()}")

# == backward pass ==
# (nerd shit)
 
for p in parameters:
	p.grad = None # set it to zero more efficiently
loss.backward() # todo: learn what the fuck this does
for p in parameters:
	p.data += -0.1 * p.grad # learning_rate * p.grad

Tip

repeat forward pass and backward pass a few times and print the loss each time :3

Tip

the reason this is so straightforward rn is because we’re only overfitting 32 examples. hear that? we’re overfitting (BAD)

9/17/24 9/18/24 9/25/24

vishwiki

Explorer

Multi-Layer Perceptron

Let’s Construct the Hidden Layer

TL;DR

Graph View

Table of Contents