Are you curious about running a powerful large language model (LLM) right on your own laptop? With recent advancements in quantized models and GPU-accelerated tools, itβs entirely possible! In this post, we walk through the exact steps to set up and run a local LLM (like Mistral 7B) using an NVIDIA GPU on a Windows laptop with WSL (Windows Subsystem for Linux).
π System Setup & Prerequisites
1. Hardware Requirements
- NVIDIA GPU (e.g., RTX 3060 with 6GB+ VRAM)
- SSD recommended
2. Software Stack
- Windows 10/11 with WSL2 (Ubuntu)
- NVIDIA GPU Driver (latest)
- CUDA Toolkit installed in WSL
3. Check GPU Compatibility in WSL
python3 -c "import torch;
print(torch.cuda.is_available());
print(torch.cuda.get_device_name(0))"
Expected Output:
True
NVIDIA GeForce RTX 3060 Laptop GPU
π§ Installing llama.cpp with CUDA Support
Step 1: Clone the Repo
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
Step 2: Install Build Tools
sudo apt update
sudo apt install -y cmake build-essential python3-dev nvidia-cuda-toolkit
Step 3: Build with CMake (CUDA Enabled)
mkdir build && cd build
cmake .. -DLLAMA_CUDA=on
cmake --build . --config Release -- -j$(nproc)
π Downloading a GGUF Model
Step 1: Create the Models Directory
mkdir ../models
cd ../models
Step 2: Download Mistral 7B Instruct GGUF
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf -O mistral.gguf
βοΈ Running the Model
Run in Interactive Mode
cd ../build
./bin/llama-run --ngl 99 ../models/mistral.gguf -i
Type your prompt at the > and press Enter.
Run with a One-Off Prompt
./bin/llama-run --ngl 99 ../models/mistral.gguf "Say hello in three languages."
π€ Common Issues & Fixes
Issue: llama.cpp built without libcurl, downloading from an url not supported.
This warning is safe to ignore if you’re loading a local file.
Issue: failed to open GGUF file
Make sure your model path is correct and the file exists.
Issue: Model output scrolls endlessly
Ensure you are passing a proper prompt or using
-imode for manual interaction.
β¨ What You’ve Achieved
- Installed and built
llama.cppwith CUDA - Downloaded and loaded a quantized Mistral 7B model (GGUF format)
- Verified GPU-accelerated inference locally
- Ran both interactive and one-shot queries
You’re now ready to move toward more advanced use cases like building a Retrieval-Augmented Generation (RAG) system or integrating with a web UI or chatbot framework.
Next up? Connect your model to a local API, vector DB, or UI and bring your LLM-powered apps to life!
Stay tuned for Part 2: Building a Local RAG Pipeline with Your LLM


