LLM series Post 1: What is a LLM ?

A lot of people might have known what a LLM is at this point. Most people call it a large language model. But i call it an autoregressive multi class classifier because the functionality of an LLM is to predict the next word and move on until it encounters the end of the sentence. In the series of post i am going to explore what an LLM is and how it is trained and what are the industry terms like "prompt engineering", "one shot" prompting, other buzz words are in terms of Machine learning.

This post is written by me and may not be that fluent because i did not ran it through a large language model because i find it counter intuitive to write a LLM post by using LLM. I am not a person who is supporting or against LLM because with correct mindset and usage i think it can be very useful.

Although a lot of people write a lot of articles about LLM. in general, i find everything to be same that is lacking depth or blocked by paywalls so i am writing what i understand and how i use it in my workflow. There are scenarios that you can use an LLM and increase your productivity by 100x but also the same time there are scenarios where you can decrease your productivity by 100x. So, this post is not to provide a "do this", "do that", "prompt like a pro" these series of post will tell you how an LLM work so you can work out the input (prompt) yourself.

In this introduction post I will explain why LLM is "autoregressive multi class classifier" and how LLM works in the inference layer (output). In the next post i will explore the inner workings of it by designing one. The reason why i split this into series because NLP is very resource intensive and needs more time to preprocess, clean the data, train the model, etc., For example if we take a Book as a corpus (structured text used for ML training) then it is thousands of tokens so it may take some time to do it.

NLP (Natural language processing) is an advanced machine learning technique so please learn some basic machine learning techniques like regression or classification before proceeding to understand the concepts better.

Before diving deep into the world of LLM let us first convert the marketing terms to machine learning terms.

1 Token is not equal to 1 word. It may be 2 or more tokens per word

Prompt Engineering -> Preprocessing and cleaning the Data

Context Engineering -> Preprocessing and cleaning the Data

Prompt -> Cleaned and Pre-processed Data or Input

Context -> Cleaned and Pre-processed Data or Input

LLM Skills -> Cleaned and Pre-processed Data or Input

Hallucination -> Wrong Predictions

One Shot Prompting -> Cleaned and Pre-processed Data or Input

Two Shot Prompting -> Cleaned and Pre-processed Data or Input

RAG -> Fancy way of saying clean data

Classes -> words in a natural language.

Agents -> different instances of LLM with different initial information(context)

For the non-ML person, the below is the standard process in machine learning to train or get inference from a model.

preprocess the data (structure the data and clean the data) -> Split the data to (70:30) or (80:20) for training and testing -> train the model -> test it on the test set -> get inference

Let us start with the inference level with a commercial model Claude or ChatGPT for example. we go step by step on how the model is predicting the text based on the input you give it to them. To understand it take a simple naive bayes classifier that predicts email spams or not.

In machine learning model training you will have X and Y values where X is the email and Y is the classes Spam / Not Spam.

Imagine the same with Multiple Classes in a Language. I think i am complicating it a bit but with the below example I think it is much better to understand training data for a naive Bayes classifier is as follows for a single record.

X = This is a mail from the local news paper

Y = SPAM

X = Hi Aravindh. How are you we are doing great here

Y = NOT SPAM

A training data for a LLM might be if we take the same sentence in our email spam classifier

X = This is a mail from the local

Y = Newspaper SPAM

X = Hi Jon. How are you we are

Y = doing great here NOT SPAM

During inference a naive bayes classifier will output SPAM if you type the first sentence and NOT SPAM if you type second

During inference a LLM classifier will output in a series of input and output as below since it is a autoregressive loop prediction it follows a method like below

Initial input -> "This is a mail from the local" output 1 -> "News"

Input 2 -> This is a mail from the local News output 2 -> "paper"

input 3 -> This is a mail from the local Newspaper output 3 -> SPAM

final output This is a mail from the local new paper SPAM

The above example is a oversimplification of what is happening and LLMs predict tokens based on the prediction score based on the token embeddings, positional embeddings and attention score not the sequence of words like naive bayes classifier but that is the basic comparison of what happens still stands true. Also, for two rows of training set the LLM may not infer any predictions at all.

There are some steps a machine learning model goes through for inference (the steps below are not necessarily NLP steps)

1. Input buffer or input data

2. Preprocessing step that transforms raw data to a form that the model understands

3. Randomness of the model prediction or temperature (softmax output / temperature)

4. Get prediction

We will explore all these steps in detail in the next posts. But to end this article i will give you a way to play with LLM to get some understanding on how it works.

prompt 1: what is red

it will output: Red is a color ..........

prompt 2: write it in British English spellings

it will output: Red is a colour ..........

Open a new conversation and ask the same prompt 1

prompt 1: what is red

it will output: Red is a color ..........

You get the same answer as the first time even though to correct it to British English spellings. Why is that happening? this is why input matters and by knowing the internals you can leverage it so you can be productive rather than counterproductive. In the next article we will explore more theoretical knowledge about NLP, RNN and limitations of it and Transformers and how it overcomes the limitations. Please learn horizontal scaling and parallel processing and NLP basics before reading that article.

Aravindh's CS & Data Science Blog

LLM series Post 1: What is a LLM ?

Popular posts from this blog

Deriving the positive and negative factors from hotel reviews using NLP

Who this blog is for ?