Last updated on November 18th, 2025 at 12:59 pm
I will tell you the truth, though, when I said that when I heard about running LLMs locally, I at first thought that was what people who had server farms in their basement did. Turns out, I was wrong.
I had managed to test various configurations over the last weekend, and in fact, my local LLM started working on my mid-range PC. This is how to do it, what did not work, and what has worked.
Table of Contents
Why I Even Bothered with Local LLMs
Before we build, why I descended this rabbit hole, I would like to tell you. I got fed up with waiting as much as 2-3 seconds when I sent a query to ChatGPT and I am not particularly fond of giving my data to third-party servers. Local LLMs had sub-10ms response times and total privacy. That alone got me curious.
And even at more than 10,000 API calls each month, you will be actually saving an amount of money in just a year of going local. That is not the scale that everyone needs but it is something to know.
Step 1: Evaluation Of whether your hardware is able to do it or not.
This is the straight talk then, you do not actually require an $8,000 graphics card. My card is RTX 3060, 12GB VRAM, 32GB system RAM, and it is not problematic with 7B-13B models.
The sweet spot for beginners:
CPU: 2.5GHz ( Andrida 3060 / 3080 or higher)
- GPU: at least 12GB minimum RAM (VRAM) ( RTRX 3060 / 3080 or greater )
- RAM: 16-32GB system memory
- Storage: accelerated SSD (prices are 4-8GB each)
The 24GB VRAM will support 70B quantization models. But honestly? Start small. The 7B model will make you feel the surprise at what it can do.
Step 2: Selecting the appropriate tool (I Tried Three of them)
I tested Ollama, which is found on github, ollama, LM Studio, and GPT4All. Here’s my take:
Ollama won for me. It is command-line and that does not sound easy but it is, in fact, easier. You run ollama run llama3, and bang, you are already communicating with an AI. No configuration hell.
LM Studio does not have terminals. It has a bit shiny GUI, allows you to window shop models and can do it all with clicks. I would suggest this when you are not computer savvy.
GPT4All was best regarded as the easiest to use, but I found it stiffer when I wanted easy access to it.
In the case of this guide, I am using Ollama, as it is what I have persevered with.
Step 3: Ollama Installation Ollama (5-minute installation)
I use windows, so I downloaded the installer at ollama.com. It is even easier with the package managers of Mac and Linux users.
The installation was followed by the following command that I typed in my terminal:
ollama run llama3
That’s it. The Llama 3 model (around 4.7GB) was automatically downloaded by Ollama and was booted. It took me a little bit longer (somewhere around 10 seconds) of possible 10ms to respond, but that was due to loading. After that? Lightning fast.
Step 4: Testing the Dissimilar Models.
Here’s where it got fun. The library of models that you can use in Ollama is:
- Llama 3 (8B): Versatile, does not have many problems with tasks.
- Qwen 2.5 (7B): Quite good at coding questions.
- Mistral (7B): Reasoned, thought him/herself superior during discussions.
I kept switching them by ollama run [model-name]. All the models have various advantages, and I had some of them installed.
Step 5: To make It Realistically Useful with RAG.
Now running a base model is fun, but here is what made it viable to me: I linked it to my personal documents with what is termed as Running a base model is cool, but that is not the end, what made it possible in my case was connecting it to my personal documents that is known as RAG (Retrieval-Augmented Generation).
Imagine it in the following way: the AI does not guess the answers but would first search your papers, and then answer based on the real information. The one I used was open-webui/open-webui which is a web interface, uploading of documents, which is connected to the Ollama automatically.
Setup took about 20 minutes. I can now pose questions regarding my project notes and it does know what I am talking about.
Step 6: Optimizing for Speed
My 7B model came out of the box at approximately 25 tokens per second. Not bad, but I wanted faster. Here’s what helped:
Quantization I converted myself to 4-bit quantised models. They are compressed versions which consume much less memory. My RTX 3060 did not only not struggle with 13B models anymore, but also smoothly.
Below is an example on how to use a quantized model in Ollama: ollama run llama3:7b-q40
The q40 in that implies it is 4-bit sampled. The speed went to 40+ tokens/second, and honestly, I could not distinguish the quality.
What Actually Surprised Me
Three things I didn’t expect:
- It is much less technical than I imagine. Provided you are able to install software and copy-pasting commands, then that is good.
- Offline mode is amazing. I tried it in one of the flights, there was no internet, and even my local AI did not fail. Can’t do that with ChatGPT.
- The community is huge. When I had to be helped, I located solutions on the LocalLLaMA section of Reddit in a few minutes. Human beings are really co-operative.
Should You Build One?
When you are casually using ChatGPT a couple of times a week, then probably not worth it. But if you:
- Care about privacy
- The responses needed are quick to code or write.
- Process sensitive data
- And just as much as fussing with technology.
Then all right, get a weekend off on this. My initial state of complete ignorance was transformed into a working setup in 6-7 hours of practical work (the remaining one was waiting to downloads).
It begins with Ollama and a 7B model. Enhancing your hardware is always an option and you can always upgrade to bigger ones later in case it clicking. The entry barrier in 20 25 will be less than you might otherwise imagine and, frankly, the experience of having a local AI running is a satisfying one.
Read:
Also Read: ChatGPT vs Gemini: Which AI Assistant Wins for Students and IT Pros in 2025?
I’m a technology writer with a passion for AI and digital marketing. I create engaging and useful content that bridges the gap between complex technology concepts and digital technologies. My writing makes the process easy and curious. and encourage participation I continue to research innovation and technology. Let’s connect and talk technology! LinkedIn for more insights and collaboration opportunities:
