Controlling Your Digital Life with Local LLMs: A Guide to Your RTX 3090 Setup
You have installed the right tool for the job. llama.cpp is free software. It is not just “open source” in the vague sense of the term often used by proprietary vendors; it is free as in liberty. You can study the source, modify it, and run it on your hardware without asking permission from a corporation. You are running it on a Linux system, likely compiled with GCC (GNU Compiler Collection), which is the heart of the GNU operating system. This is a pragmatic choice. You have a powerful NVIDIA GeForce RTX 3090 with 24 GB of VRAM, and you are using it to run local language models (LLMs) that respect your data, not ones that send your private thoughts to a black-box cloud service to be harvested for ad revenue.
Let’s look at exactly how you are configuring your primary model instance and why the surrounding ecosystem matters.
The Primary Model: Qwen3.6-35B-A3B
Your command starts with:
/usr/local/bin/llama-server --jinja -fa on -c 131072 -ngl 64 -v --log-timestamps --host 192.168.1.68 -ub 512 --threads 16 --webui-mcp-proxy --tools all --reasoning off
This command line tells llama-server exactly what you want it to do. Let’s break down the flags to understand how you are taking control.
/usr/local/bin/llama-server: You are calling the server binary directly. No bloatware wrappers.--jinja: You are enabling the Jinja template engine for chat. This gives you precise control over how the system prompt and user messages are formatted, rather than relying on a hardcoded, potentially changing proprietary format.-fa on: Flash Attention is enabled. This is an optimization that reduces memory usage and speeds up inference by changing how the attention mechanism calculates tokens. On an RTX 3090, this is essential for keeping things fast.-c 131072: You have set the context size to 128k tokens. This is massive. It means the model can “read” a very long document or maintain a very long conversation history in its working memory. Most proprietary cloud APIs charge by the token or limit you to 8k or 32k. You have 128k of free, local context.-ngl 64: GPU Layers is set to 64. This tells the server how many layers of the neural network to offload to your NVIDIA VRAM. The RTX 3090 has 24 GB. If the model requires more than that to fit 64 layers, the rest will spill over to system RAM (CPU), which is slower. You are balancing speed and capacity.--host 192.168.1.68: You are binding the server to a local IP address. You are not exposing this necessarily to the public internet by default, keeping it within your local network control.--threads 16: You are dedicating 16 CPU threads for general processing.--tools all: You are enabling all built-in tools for AI agents. This allows the model to interact with your system, read files, or execute commands if configured correctly. It transforms the model from a chatbot into a functional tool.--webui-mcp-proxy: This enables the Model Context Protocol (MCP) proxy in the Web UI. MCP is a standard way for applications to connect to LLMs. By using this, you are adopting open standards for communication rather than a closed API.
The model itself is:
-m /mnt/data/LLM/quantized/Qwen3.6-35B-A3B-uncensored-heretic-Q3_K_M.gguf
This is a quantized version of the Qwen 3.6 35B-A3B model. Quantization reduces the precision of the numbers (weights) in the model to save space and increase speed. Q3_K_M is a quantization method that aims for a good balance between size and quality. It is “uncensored” and “heretic,” suggesting a version free from corporate alignment layers, allowing you to get the raw output of the model’s training.
The multimodal projector is:
--mmproj /mnt/data/LLM/quantized/Qwen3.6-35B-A3B-mmproj-uncensored-heretic-BF16.gguf
This allows the model to “see” images. You are not relying on a cloud vision API; you are processing visual data locally on your GPU.
The second model in your process list is:
3043 /usr/local/bin/llama-server --rerank -m /mnt/nvme0n1/LLM/quantized/bge-reranker-v2-m3-q8_0.gguf ...
This is a reranking model. Instead of asking the large LLM to do everything, you use a smaller, specialized model to rank the relevance of search results. This is efficient. It keeps the heavy lifting for the main model and uses lightweight models for specific tasks.
The third process is an embedding model:
5399 /usr/local/bin/llama-server --embedding ... -m .../nomic-ai/quantized/nomic-embed-text-v1.5-Q8_0.gguf
This generates vector embeddings for your documents. This is the foundation of local RAG (Retrieval-Augmented Generation). You can index your local files and query them using your own hardware, keeping your data private. You are building your own knowledge base without uploading it to a proprietary vector database service.
Why This Setup Matters: Privacy and Control
When you use proprietary software, you are not the customer; you are the product. You have no idea what data is being collected, how it is stored, or who it is sold to. With your current setup using free software like llama.cpp, you control the entire stack.
- No Telemetry: By default,
llama.cppdoes not phone home. You decide if you log data (--log-timestamps). You see the logs directly in your terminal or log file. - Local Processing: Your 24 GB of VRAM is fully utilized (83.7% in the main process). The data stays on your machine. The
llama-serverinstance bound to192.168.1.68is yours. - Open Standards: You are using GGUF format for model weights, a free, open format. You are using MCP for tool communication. You are not locked into a proprietary vendor’s API.
If you want to dive deeper into the philosophy behind this freedom, I recommend reading about What is Free Software?. It explains why “free as in liberty” matters more than just “free as in beer.” It is about your freedom to run the program for any purpose, study how it works, redistribute copies, and improve it.
For those who want to participate in the development or support of these free software projects, the GNU mailing lists at https://lists.gnu.org/ are a great place to engage with the community.
You have built a powerful, private, local AI infrastructure. You are using your RTX 3090 to process language and images on your terms. This is the practical application of free software: you get control, performance, and privacy.
Jean Louis, Free Software Supporter since 1999