Building a Sovereign AI Desktop on Linux: A Practical Guide to Freedom and Control

Building a Sovereign AI Desktop on Linux: A Practical Guide to Freedom and Control

You asked for the setup to replicate this Linux-based local LLM environment. This is not just about installing software; it is about reclaiming your digital infrastructure from the “black box” of proprietary cloud services. When you run models on your own hardware, you own the data, the weights, and the inference engine. No subscription fees, no usage limits, and no privacy violations.

Here is the practical, step-by-step guide to setting this up, grounded in the philosophy of Free Software. We prioritize the GNU/Linux ecosystem where you have full control over your code and your data.

1. The Hardware Foundation: GPU and VRAM

You mentioned the RTX 3090 with 24 GB (you wrote 25 GB, but it is 24 GB) of VRAM. This is an excellent choice for a budget-conscious enthusiast because:

CUDA Support: NVIDIA’s proprietary drivers are somewhat restrictive, but they offer the best out-of-the-box support for llama.cpp via GGML.
VRAM Capacity: 24 GB allows you to load larger models (like Llama-3-70B quantized) entirely in GPU memory for fast inference, avoiding slow CPU fallbacks.
Community: Since 1999, the Free Software community has championed the idea that hardware should serve the user. With enough VRAM, you can run models locally that typically require expensive cloud APIs.

Recommendation: If you are building new, ensure your motherboard has enough PCIe lanes and a PSU (750W+ Gold) to handle power spikes. The NVMe drive mentioned in your script is critical for fast model loading.

2. Compiling `llama.cpp`: The Engine of Local LLMs

llama.cpp is the gold standard for running LLMs efficiently on consumer hardware. It is written in C/C++ and is highly optimized. The script you provided automates the build process. Here is the breakdown of what it does and why it matters for your freedom.

Shell function:

build_llama.cpp () 
{ 
    ( set -x;
    # 1. Navigate to your source directory
    pushd /mnt/nvme0n1/LLM/git/llama.cpp;
  
    # 2. Pull the latest free code from the repository
    # This ensures you get improvements from the global community
    git pull && git log;
  
    # 3. Configure the build with CUDA support
    # GGML_CUDA=ON: Enables NVIDIA CUDA acceleration
    # GGML_BLAS=ON: Uses BLAS for matrix operations
    # GGML_BLAS_VENDOR=OpenBLAS: High-performance library for BLAS
    # GGML_CUDA_F16=ON: Enables half-precision for faster inference
    # GGML_CUDA_FA_ALL_QUANTS=ON: Full attention support
    # DCMAKE_CUDA_ARCHITECTURES=86: Optimizes for Ampere architecture (RTX 30xx series)
    cmake -B build -DGGML_CUDA=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DGGML_CUDA_F16=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_CUDA_ARCHITECTURES=86;
  
    # 4. Check your CPU info (useful for debugging multi-threading)
    watch_cpu_info;
  
    # 5. Compile the release build using all available cores
    cmake --build build --config Release -j$(nproc);
  
    # 6. Move into the build directory
    cd build;
  
    # 7. Install binaries to system paths (e.g., /usr/local/bin)
    # sudo may be required depending on your prefix settings
    sudo make install -j$(nproc);
  
    # 8. Update the dynamic linker cache
    sudo ldconfig;
  
    # 9. Return to the previous directory
    popd )
}

Why this matters for Free Software: Unlike proprietary AI apps that update silently in the background, this build process is transparent. You compile exactly what you want, when you want it. You can inspect the source code, patch it, or contribute back. This is the core of GNU/Linux: you are not a consumer of a product; you are an owner of your system.

3. The Interface: Cherry Studio, Hermes, or OpenCode

You have three choices for the “front-end.” All are excellent, but they serve different philosophies of user control.

A. Cherry Studio (https://www.cherry-ai.com/)

Type: All-in-One AI Assistant.
Pros: Great UI, supports multiple backends (including local llama.cpp instances), easy to set up.
Cons: The code is proprietary. You rely on their update cycle and privacy policy.
Best For: Users who want a polished experience without managing the backend manually.

B. Hermes Agent (https://hermes-agent.nousresearch.com/)

Type: Agent that grows with you.
Pros: Focuses on tool use and long-term memory. Nous Research is community-friendly.
Cons: More experimental, less stable than Cherry Studio for pure chat.
Best For: Power users who want an AI that can interact with their filesystem and codebase.

C. OpenCode (https://opencode.ai/)

Type: Open Source AI Coding Agent.
Pros: Truly Free Software. You can inspect the code, ensure no telemetry, and contribute to its development. It integrates directly with your IDE.
Cons: May require more technical setup.
Best For: Developers who want the AI to be a transparent partner in their code, not a black box.

4. Privacy and Control: The Free Software Advantage

When you run llama.cpp locally, you avoid the privacy disadvantages of proprietary software:

No Telemetry: Your prompts, code, and documents never leave your machine unless you explicitly send them to an API.
No Subscription: You buy the hardware once. The software is free to use, modify, and distribute.
Full Control: You can swap out models, adjust quantization, and tweak parameters without asking permission.

5. How to Participate in the Community

Free Software is not just about code; it is about people. If you encounter bugs, have improvements, or want to discuss the philosophy behind this setup, join the conversation.

Visit **https://lists.gnu.org/** to participate in free software projects. This is where the history of the GNU Project is kept, and where practical discussions about user control happen.

Final Steps for Your Setup

Install Dependencies: Ensure you have cmake, g++, cuda-toolkit, and openblas installed. On Ubuntu/Debian:

sudo apt update
    sudo apt install cmake g++ nvidia-cuda-toolkit libopenblas-dev

Run the Script: Execute the build_llama.cpp function above.
Download a Model: Get a GGUF format model (e.g., llama-3.1-8b-instruct.Q4_K_M.gguf) from HuggingFace.
Run Inference:

./main -m /path/to/model.gguf -ngl 35 --interactive

*(Adjust `-ngl` for your GPU layer count)*

Connect Your Interface: Point Cherry Studio, Hermes, or OpenCode to your local llama.cpp server (usually http://localhost:8080 if you start it with ./main --server).

How to run it?

/usr/local/bin/llama-server –jinja -fa on -c 131072 -ngl 64 -v –log-timestamps –host 192.168.1.68 -ub 512 –threads 16 –webui-mcp-proxy –tools all –pooling mean –reasoning off -m /mnt/data/LLM/quantized/Qwen3.6-35B-A3B-uncensored-heretic-Q3_K_M.gguf –mmproj /mnt/data/LLM/quantized/Qwen3.6-35B-A3B-mmproj-uncensored-heretic-BF16.gguf

This is your digital life. Control it with Free Software.

Jean Louis, Free Software Supporter since 1999