Tokenisaton and embedding

This page has 0 threads | Add post

Blogs

Finding our blogs useful? Please consider helping us to share more!

What you can - and can't - hope to understand about how AI works Part two of a four-part series of blogs
I'm sick of reading blogs pretending to explain how Large Language Models work, when the truth is that few people on earth can ever hope to understand how they are constructed. This blog is my attempt to explain as much as 99.99% of us can ever hope to understand about how AI works, and why the other bits will always remain opaque to most people. What you can - and can't - hope to understand about how AI works Tokenisaton and embedding (this blog) Large language models Constraints on the use of AI

What you can - and can't - hope to understand about how AI works

Part two of a four-part series of blogs

I'm sick of reading blogs pretending to explain how Large Language Models work, when the truth is that few people on earth can ever hope to understand how they are constructed. This blog is my attempt to explain as much as 99.99% of us can ever hope to understand about how AI works, and why the other bits will always remain opaque to most people.

What you can - and can't - hope to understand about how AI works
Tokenisaton and embedding (this blog)
Large language models
Constraints on the use of AI

Posted by Andy Brown on 16 July 2024

In this blog

Generating Tokens from Questions
Turning these Tokens into Numbers
Embedding

In this part of the blog we'll look at tokens (which you probably need to know about) and embedding (which you probably don't).

Generating Tokens from Questions

AI tools convert text into numbers using something called tokens. You can use tools like OpenAI's Tokenizer to see this in action:

Here our text contained 7 words and 27 characters (including spaces), and yielded 8 tokens (shown in different colours above).

One token corresponds - on average - to about 4 characters of English text (or to put it another way, each word generates on average one-and-a-third tokens).

Turning these Tokens into Numbers

If you want to dive deeper into this process you can use the tiktoken module in Python to see the numbers behind the tokens.

To do this you'd also obviously need to know how to program in Python, but fortunately I've done this for you so you can just read the results below!

Just in case you want to try this out yourself, here's the simple code I used (you'll obviously need to install the tiktoken module first):

# the module giving tokens

import tiktoken

# which LLM we'll use to generate tokens

tokenizer_to_use = tiktoken.encoding_for_model("gpt-4")

# the text for which we will generate tokens

ai_question = "Write a haiku about an owl"

# get and print the corresponding numbers

tokens = tokenizer_to_use.encode(ai_question)

print(tokens)

# reverse the process

decoded_question = tokenizer_to_use.decode(tokens)

print(decoded_question)

And here's the output from this program:

So for example the word haiku is rendered by ChatGPT as tokens 6520 and 39342.

There are only so many tokens that any AI tool can understand in one go (a subject I'll return to in the final part of this blog, looking at the constraints of using AI tools).

Embedding

How deeply do you want to descend the rabbit-hole? The next thing that an AI tool does is to turn these tokens into vectors of numbers, giving context about the question you've asked. You can do this yourself, but you'll need to subscribe to OpenAI's API first to get an API key (and again, you'll need to know how to program in Python). Here's the code I used should you want to try this:

# the bits of text in the prompt

prompt_questions = [

"Write a haiku about an owl"

]

# you'll need to install the openai module first

from openai import OpenAI

# create an API session

openapi_session = OpenAI(api_key="YOUR-LONG-API-KEY-HERE")

# which embedding model to use

embedding_model = "text-embedding-3-small"

# create the list of vectors of numbers giving context

embedding_vectors = openapi_session.embeddings.create(

input=prompt_questions,

model=embedding_model

).data[0].embedding

# show the number of items in this, and the first few

print("There are {0} vectors - here are the first 2:\n{1}".format(

len(embedding_vectors),

embedding_vectors[0:2]

))

For those without the time, money or expertise to try the above, here are the results of running this program:

The first 2 of the 1,536 elements in the embed vectors.

Noice that 1536 = the number of tokens (7) multiplied by 128, which you can think of as the number of dimensions in the model. Here's how to think about these. Supposing that you typed in the word small and one of the dimensions was size.

An imaginary representation of the size dimension, running from 0 (no size at all) to 1 (infinite size).

Then for this parameter for this token, embedding might return the number 0.378825161688980441 (a number I have invented to illustrate the point).

One of the problems with AI tools is that no only do you not know what the dimensions represent, it's likely that the people who built the large language models (see the next part of this blog for more on these) don't know either, since most models are generated by computers.

Now that your AI tool has a numerical representation of what you're asking, it can start to answer the question; and to do that, it needs a large language model.

Parts of this blog
What you can - and can't - hope to understand about how AI works Tokenisaton and embedding (this blog) Large language models Constraints on the use of AI

Some other pages relevant to the above blogs include:

Training classes in Artificial Intelligence

This blog has 0 threads

Add a new post

ALL BLOGS

ARTIFICIAL INTELLIGENCE BLOGS

ARTIFICIAL INTELLIGENCE (AI) BLOGS

Generating Tokens from Questions

Turning these Tokens into Numbers

Embedding

Head office

London

Manchester