Read our blogs, tips and tutorials
Try our exercises or test your skills
Watch our tutorial videos or shorts
Take a self-paced course
Read our recent newsletters
License our courseware
Book expert consultancy
Buy our publications
Get help in using our site
547 attributed reviews in the last 3 years
Refreshingly small course sizes
Outstandingly good courseware
Whizzy online classrooms
Wise Owl trainers only (no freelancers)
Almost no cancellations
We have genuine integrity
We invoice after training
Review 30+ years of Wise Owl
View our top 100 clients
Search our website
We also send out useful tips in a monthly email newsletter ...
What you can - and can't - hope to understand about how AI works Part two of a four-part series of blogs |
---|
I'm sick of reading blogs pretending to explain how Large Language Models work, when the truth is that few people on earth can ever hope to understand how they are constructed. This blog is my attempt to explain as much as 99.99% of us can ever hope to understand about how AI works, and why the other bits will always remain opaque to most people.
|
In this part of the blog we'll look at tokens (which you probably need to know about) and embedding (which you probably don't).
AI tools convert text into numbers using something called tokens. You can use tools like OpenAI's Tokenizer to see this in action:
Here our text contained 7 words and 27 characters (including spaces), and yielded 8 tokens (shown in different colours above).
One token corresponds - on average - to about 4 characters of English text (or to put it another way, each word generates on average one-and-a-third tokens).
If you want to dive deeper into this process you can use the tiktoken module in Python to see the numbers behind the tokens.
To do this you'd also obviously need to know how to program in Python, but fortunately I've done this for you so you can just read the results below!
Just in case you want to try this out yourself, here's the simple code I used (you'll obviously need to install the tiktoken module first):
# the module giving tokens
import tiktoken
# which LLM we'll use to generate tokens
tokenizer_to_use = tiktoken.encoding_for_model("gpt-4")
# the text for which we will generate tokens
ai_question = "Write a haiku about an owl"
# get and print the corresponding numbers
tokens = tokenizer_to_use.encode(ai_question)
print(tokens)
# reverse the process
decoded_question = tokenizer_to_use.decode(tokens)
print(decoded_question)
And here's the output from this program:
So for example the word haiku is rendered by ChatGPT as tokens 6520 and 39342.
There are only so many tokens that any AI tool can understand in one go (a subject I'll return to in the final part of this blog, looking at the constraints of using AI tools).
How deeply do you want to descend the rabbit-hole? The next thing that an AI tool does is to turn these tokens into vectors of numbers, giving context about the question you've asked. You can do this yourself, but you'll need to subscribe to OpenAI's API first to get an API key (and again, you'll need to know how to program in Python). Here's the code I used should you want to try this:
# the bits of text in the prompt
prompt_questions = [
"Write a haiku about an owl"
]
# you'll need to install the openai module first
from openai import OpenAI
# create an API session
openapi_session = OpenAI(api_key="YOUR-LONG-API-KEY-HERE")
# which embedding model to use
embedding_model = "text-embedding-3-small"
# create the list of vectors of numbers giving context
embedding_vectors = openapi_session.embeddings.create(
input=prompt_questions,
model=embedding_model
).data[0].embedding
# show the number of items in this, and the first few
print("There are {0} vectors - here are the first 2:\n{1}".format(
len(embedding_vectors),
embedding_vectors[0:2]
))
For those without the time, money or expertise to try the above, here are the results of running this program:
The first 2 of the 1,536 elements in the embed vectors.
Noice that 1536 = the number of tokens (7) multiplied by 128, which you can think of as the number of dimensions in the model. Here's how to think about these. Supposing that you typed in the word small and one of the dimensions was size.
An imaginary representation of the size dimension, running from 0 (no size at all) to 1 (infinite size).
Then for this parameter for this token, embedding might return the number 0.378825161688980441 (a number I have invented to illustrate the point).
One of the problems with AI tools is that no only do you not know what the dimensions represent, it's likely that the people who built the large language models (see the next part of this blog for more on these) don't know either, since most models are generated by computers.
Now that your AI tool has a numerical representation of what you're asking, it can start to answer the question; and to do that, it needs a large language model.
Parts of this blog |
---|
|
Some other pages relevant to the above blogs include:
Kingsmoor House
Railway Street
GLOSSOP
SK13 2AA
Landmark Offices
99 Bishopsgate
LONDON
EC2M 3XD
Holiday Inn
25 Aytoun Street
MANCHESTER
M1 3AE
© Wise Owl Business Solutions Ltd 2024. All Rights Reserved.