Local LLM Part 1 - llama.cpp

#AI #Python #LLM #Local_LLM

What is llama.cpp?

Llama.cpp is an open source library that performs inference on large language models. It's much lighter than frameworks like Ollama or LM Studio.

What is it used for?

For running LLM models on consumer hardware. It uses quantized models, which massively reduces the memory requirements, without significantly reducing quality. This allows people to run LLM locally, which is good for experimentation and for commercial applications where privacy is a major concern.

Why not [insert other tool]?

Because the point of this exercise it to learn about LLMs by doing it "the hard way". The more friction we encounter the more we can learn.

What are we going to build?

For now the only thing we are going to do is set up llama.cpp locally and prompt it a couple of times to make sure it works. We will make further improvements in the following parts, such as a UI, sessions and RAGs.

How to install

I use a M3 Macbook, so I will be following the instructions for that system. I will provide links to the instructions for each of the operating systems.
- Windowsaka just download the latest exe and run it.
- Linux
- Mac

Which model to use?

For local llms, the models used are ones big enough to be useful and small enough to fit into memory. There is a class of models called Quantized models which fit this niche perfectly. They have been modified in such a way that they use a lot less memory at the cost of some quality loss. This comes in degrees marked as precision variants, which are measured in bits. There range from 2 to 16, with 2 being the most quantized and 16 being the least.

Other than being quantized, we also need to pay attention to how many parameters a model has. The number of parameters is a proxy for knowing how much "knowledge" the model contains, the more parameters the "smarter" the model is. Naturally we want the highest number possible, however we still need to fit all of this into memory.

The ones I recommend are llama-2-7b and Mistral-7b, they are big enough to be somewhat useful and small enough to run on most modern computers.

There are many sizes of each, choose the biggest one you can run, the bigger the model the better the quality.

I chose Mistral, mostly because I liked the name more. It sounds like a DnD character.

How to prompt

Create a Jupyter notebook and add the following code.

Install the necessary packages.

pip install llama-cpp-python sentence-transformers numpy torch pandas

Import the libraries

from llama_cpp import Llama

Connect to the model. The model file is in the same directory as the notebook.

# Load the model
model_path = "mistral-7b-instruct-v0.2.Q5_K_M.gguf"

llm = Llama(
	model_path=model_path,
	n_ctx=2048,
	n_gpu_layers=30,
	verbose=False
)

Run a prompt, make sure to be very specific about what type of response you expect. Otherwise the model will ramble until it hits the token limit.

# Generate response
input = "What is artificial intelligence?"
prompt = f"Be extremely concise. Answer in one sentence only: {input}"

response = llm(
	prompt,
	max_tokens=50,
	temperature=0.5,
	top_p=0.9,
	stop=["</s>", "Human:", "User:"],
	echo=False
)

Check response

# Print the response
print(response["choices"][0]["text"])

The result should look something like this:

Artificial intelligence is a branch of computer science that enables systems to perform tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and language translation.