Simple language model, looks at the last character to guess next one?

N = torch.zeros((28, 28), dtype=torch.int32) (its 28x28 cos 26 letters of the alphabet + <​start> and <end​> and and the other dimension is the corresponding token)

(eg ab, az, ba ect.)

import matplotlib.pyplot as plt
plt.imshow(N)

turn our array into floats that add up to one and use torch.multinomial to sample

— erm what the sigma

  • [6] Vish: one hot encoding works for as many tokens as you have, if i have like 10000 tokens would that mean for any model i would need a vector of size 10000 for each token given to the model ,,, that seems inefficient

  • [1] Claude: You’re right to be concerned about efficiency - one-hot encoding for a large vocabulary can indeed lead to very high-dimensional, sparse vectors. For 10,000 tokens, each input would be a vector of 10,000 elements, with only one element being 1 and the rest 0. This is inefficient in terms of both storage and computation. To address this issue, most modern natural language processing models use more efficient representations [like Word Embeddings, Subword Tokenization, etc.]

thanks, claude

building the network

these neurons perform a simple function, , where is a dot product

xenc = F.one_hot(xs, num_classes=27).float()
logits = xenc @ W # log-counts
counts = logits.exp() # n ^ logit power # equivalent N
probs = counts / counts.sum(1, keepdims=True) # probabilities for next char

the last two lines r called the Softmax, the more u know

Error

study how the neural net works i still dont get it lol

anways

the way we feed smth into our neural net is getting the index, then one-hot encode it, then it goes in the neural net and out comes some distribution of probabilities (???) and its shape its the amt of tokens possible or smth (27 in our case)

when we have the loss function nlls.mean().item()

we’re tryna find a good setting of via gradient-based optimization

tldr; it SUCKS

8/29/24 9/13/24 9/16/24