Private Local AI Setup with Qwen3.6-35B and RTX 3090


Controlling Your Digital Life with Local LLMs: A Guide to Your RTX 3090 Setup

You have installed the right tool for the job. llama.cpp is free software. It is not just “open source” in the vague sense of the term often used by proprietary vendors; it is free as in liberty. You can study the source, modify it, and run it on your hardware without asking permission from a corporation. You are running it on a Linux system, likely compiled with GCC (GNU Compiler Collection), which is the heart of the GNU operating system. This is a pragmatic choice. You have a powerful NVIDIA GeForce RTX 3090 with 24 GB of VRAM, and you are using it to run local language models (LLMs) that respect your data, not ones that send your private thoughts to a black-box cloud service to be harvested for ad revenue.

Let’s look at exactly how you are configuring your primary model instance and why the surrounding ecosystem matters.

The Primary Model: Qwen3.6-35B-A3B

Your command starts with:

/usr/local/bin/llama-server --jinja -fa on -c 131072 -ngl 64 -v --log-timestamps --host 192.168.1.68 -ub 512 --threads 16 --webui-mcp-proxy --tools all --reasoning off

This command line tells llama-server exactly what you want it to do. Let’s break down the flags to understand how you are taking control.

The model itself is:

-m /mnt/data/LLM/quantized/Qwen3.6-35B-A3B-uncensored-heretic-Q3_K_M.gguf

This is a quantized version of the Qwen 3.6 35B-A3B model. Quantization reduces the precision of the numbers (weights) in the model to save space and increase speed. Q3_K_M is a quantization method that aims for a good balance between size and quality. It is “uncensored” and “heretic,” suggesting a version free from corporate alignment layers, allowing you to get the raw output of the model’s training.

The multimodal projector is:

--mmproj /mnt/data/LLM/quantized/Qwen3.6-35B-A3B-mmproj-uncensored-heretic-BF16.gguf

This allows the model to “see” images. You are not relying on a cloud vision API; you are processing visual data locally on your GPU.

The second model in your process list is:

3043 /usr/local/bin/llama-server --rerank -m /mnt/nvme0n1/LLM/quantized/bge-reranker-v2-m3-q8_0.gguf ...

This is a reranking model. Instead of asking the large LLM to do everything, you use a smaller, specialized model to rank the relevance of search results. This is efficient. It keeps the heavy lifting for the main model and uses lightweight models for specific tasks.

The third process is an embedding model:

5399 /usr/local/bin/llama-server --embedding ... -m .../nomic-ai/quantized/nomic-embed-text-v1.5-Q8_0.gguf

This generates vector embeddings for your documents. This is the foundation of local RAG (Retrieval-Augmented Generation). You can index your local files and query them using your own hardware, keeping your data private. You are building your own knowledge base without uploading it to a proprietary vector database service.

Why This Setup Matters: Privacy and Control

When you use proprietary software, you are not the customer; you are the product. You have no idea what data is being collected, how it is stored, or who it is sold to. With your current setup using free software like llama.cpp, you control the entire stack.

  1. No Telemetry: By default, llama.cpp does not phone home. You decide if you log data (--log-timestamps). You see the logs directly in your terminal or log file.
  2. Local Processing: Your 24 GB of VRAM is fully utilized (83.7% in the main process). The data stays on your machine. The llama-server instance bound to 192.168.1.68 is yours.
  3. Open Standards: You are using GGUF format for model weights, a free, open format. You are using MCP for tool communication. You are not locked into a proprietary vendor’s API.

If you want to dive deeper into the philosophy behind this freedom, I recommend reading about What is Free Software?. It explains why “free as in liberty” matters more than just “free as in beer.” It is about your freedom to run the program for any purpose, study how it works, redistribute copies, and improve it.

For those who want to participate in the development or support of these free software projects, the GNU mailing lists at https://lists.gnu.org/ are a great place to engage with the community.

You have built a powerful, private, local AI infrastructure. You are using your RTX 3090 to process language and images on your terms. This is the practical application of free software: you get control, performance, and privacy.

Jean Louis, Free Software Supporter since 1999