Building a Sovereign Local LLM on Linux with llama.cpp and NVIDIA RTX 3090


Building a Sovereign AI Desktop on Linux: A Practical Guide to Freedom and Control

You asked for the setup to replicate this Linux-based local LLM environment. This is not just about installing software; it is about reclaiming your digital infrastructure from the “black box” of proprietary cloud services. When you run models on your own hardware, you own the data, the weights, and the inference engine. No subscription fees, no usage limits, and no privacy violations.

Here is the practical, step-by-step guide to setting this up, grounded in the philosophy of Free Software. We prioritize the GNU/Linux ecosystem where you have full control over your code and your data.

1. The Hardware Foundation: GPU and VRAM

You mentioned the RTX 3090 with 24 GB (you wrote 25 GB, but it is 24 GB) of VRAM. This is an excellent choice for a budget-conscious enthusiast because:

Recommendation: If you are building new, ensure your motherboard has enough PCIe lanes and a PSU (750W+ Gold) to handle power spikes. The NVMe drive mentioned in your script is critical for fast model loading.

2. Compiling llama.cpp: The Engine of Local LLMs

llama.cpp is the gold standard for running LLMs efficiently on consumer hardware. It is written in C/C++ and is highly optimized. The script you provided automates the build process. Here is the breakdown of what it does and why it matters for your freedom.

Shell function:

build_llama.cpp () 
{ 
    ( set -x;
    # 1. Navigate to your source directory
    pushd /mnt/nvme0n1/LLM/git/llama.cpp;
  
    # 2. Pull the latest free code from the repository
    # This ensures you get improvements from the global community
    git pull && git log;
  
    # 3. Configure the build with CUDA support
    # GGML_CUDA=ON: Enables NVIDIA CUDA acceleration
    # GGML_BLAS=ON: Uses BLAS for matrix operations
    # GGML_BLAS_VENDOR=OpenBLAS: High-performance library for BLAS
    # GGML_CUDA_F16=ON: Enables half-precision for faster inference
    # GGML_CUDA_FA_ALL_QUANTS=ON: Full attention support
    # DCMAKE_CUDA_ARCHITECTURES=86: Optimizes for Ampere architecture (RTX 30xx series)
    cmake -B build -DGGML_CUDA=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DGGML_CUDA_F16=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_CUDA_ARCHITECTURES=86;
  
    # 4. Check your CPU info (useful for debugging multi-threading)
    watch_cpu_info;
  
    # 5. Compile the release build using all available cores
    cmake --build build --config Release -j$(nproc);
  
    # 6. Move into the build directory
    cd build;
  
    # 7. Install binaries to system paths (e.g., /usr/local/bin)
    # sudo may be required depending on your prefix settings
    sudo make install -j$(nproc);
  
    # 8. Update the dynamic linker cache
    sudo ldconfig;
  
    # 9. Return to the previous directory
    popd )
}

Why this matters for Free Software: Unlike proprietary AI apps that update silently in the background, this build process is transparent. You compile exactly what you want, when you want it. You can inspect the source code, patch it, or contribute back. This is the core of GNU/Linux: you are not a consumer of a product; you are an owner of your system.

3. The Interface: Cherry Studio, Hermes, or OpenCode

You have three choices for the “front-end.” All are excellent, but they serve different philosophies of user control.

A. Cherry Studio (https://www.cherry-ai.com/)

B. Hermes Agent (https://hermes-agent.nousresearch.com/)

C. OpenCode (https://opencode.ai/)

4. Privacy and Control: The Free Software Advantage

When you run llama.cpp locally, you avoid the privacy disadvantages of proprietary software:

5. How to Participate in the Community

Free Software is not just about code; it is about people. If you encounter bugs, have improvements, or want to discuss the philosophy behind this setup, join the conversation.

Final Steps for Your Setup

  1. Install Dependencies: Ensure you have cmake, g++, cuda-toolkit, and openblas installed. On Ubuntu/Debian:
sudo apt update
    sudo apt install cmake g++ nvidia-cuda-toolkit libopenblas-dev

  1. Run the Script: Execute the build_llama.cpp function above.
  2. Download a Model: Get a GGUF format model (e.g., llama-3.1-8b-instruct.Q4_K_M.gguf) from HuggingFace.
  3. Run Inference:
./main -m /path/to/model.gguf -ngl 35 --interactive

*(Adjust `-ngl` for your GPU layer count)*
  1. Connect Your Interface: Point Cherry Studio, Hermes, or OpenCode to your local llama.cpp server (usually http://localhost:8080 if you start it with ./main --server).

How to run it?

/usr/local/bin/llama-server –jinja -fa on -c 131072 -ngl 64 -v –log-timestamps –host 192.168.1.68 -ub 512 –threads 16 –webui-mcp-proxy –tools all –pooling mean –reasoning off -m /mnt/data/LLM/quantized/Qwen3.6-35B-A3B-uncensored-heretic-Q3_K_M.gguf –mmproj /mnt/data/LLM/quantized/Qwen3.6-35B-A3B-mmproj-uncensored-heretic-BF16.gguf

This is your digital life. Control it with Free Software.

Jean Louis, Free Software Supporter since 1999