Are you curious about running a powerful large language model (LLM) right on your own laptop? With recent advancements in quantized models and GPU-accelerated tools, it’s entirely possible! In this post, we walk through the exact steps to set up and run a local LLM (like Mistral 7B) using an NVIDIA GPU on a Windows laptop with WSL (Windows Subsystem for Linux).
📚 System Setup & Prerequisites
1. Hardware Requirements
- NVIDIA GPU (e.g., RTX 3060 with 6GB+ VRAM)
- SSD recommended
2. Software Stack
- Windows 10/11 with WSL2 (Ubuntu)
- NVIDIA GPU Driver (latest)
- CUDA Toolkit installed in WSL
3. Check GPU Compatibility in WSL
python3 -c "import torch;
print(torch.cuda.is_available());
print(torch.cuda.get_device_name(0))"
Expected Output:
True
NVIDIA GeForce RTX 3060 Laptop GPU
🔧 Installing llama.cpp with CUDA Support
Step 1: Clone the Repo
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
Step 2: Install Build Tools
sudo apt update
sudo apt install -y cmake build-essential python3-dev nvidia-cuda-toolkit
Step 3: Build with CMake (CUDA Enabled)
mkdir build && cd build
cmake .. -DLLAMA_CUDA=on
cmake --build . --config Release -- -j$(nproc)
📂 Downloading a GGUF Model
Step 1: Create the Models Directory
mkdir ../models
cd ../models
Step 2: Download Mistral 7B Instruct GGUF
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf -O mistral.gguf
⚖️ Running the Model
Run in Interactive Mode
cd ../build
./bin/llama-run --ngl 99 ../models/mistral.gguf -i
Type your prompt at the >
and press Enter.
Run with a One-Off Prompt
./bin/llama-run --ngl 99 ../models/mistral.gguf "Say hello in three languages."
🤖 Common Issues & Fixes
Issue: llama.cpp built without libcurl, downloading from an url not supported.
This warning is safe to ignore if you’re loading a local file.
Issue: failed to open GGUF file
Make sure your model path is correct and the file exists.
Issue: Model output scrolls endlessly
Ensure you are passing a proper prompt or using
-i
mode for manual interaction.
✨ What You’ve Achieved
- Installed and built
llama.cpp
with CUDA - Downloaded and loaded a quantized Mistral 7B model (GGUF format)
- Verified GPU-accelerated inference locally
- Ran both interactive and one-shot queries
You’re now ready to move toward more advanced use cases like building a Retrieval-Augmented Generation (RAG) system or integrating with a web UI or chatbot framework.
Next up? Connect your model to a local API, vector DB, or UI and bring your LLM-powered apps to life!
Stay tuned for Part 2: Building a Local RAG Pipeline with Your LLM
No responses yet