Running Local AI on a Single RTX 3090: Process Management and VRAM Optimization

Running Local AI on a Single RTX 3090: Process Management and VRAM Optimization

GNU.Support — Practical Free Software Applications

Introduction

The GNU operating system, built on the four essential software freedoms, gives users complete control over their computing environment. This control extends to modern artificial intelligence applications, where running models locally is not just a technical choice—it’s a political and ethical requirement for those who value privacy and digital autonomy.

In this article, I demonstrate how a single NVIDIA GeForce RTX 3090 with 24 GB of VRAM is sufficient to run a complete local AI stack, including large language models, embedding systems, reranking, and vision-language understanding. I will also share practical command-line tools for monitoring these processes and their GPU memory usage.

The RTX 3090: Sufficient Hardware for Local AI

The RTX 3090, with its 24 GB of VRAM, represents a sweet spot for free software enthusiasts running local AI. This single consumer GPU can simultaneously handle:

Large Language Model Inference — Models like Qwen 3.5 (35B parameters in 3-bit quantization) using under 19 GB of VRAM
Embedding Generation — For retrieval-augmented generation (RAG), with models like Nomic Embed Text v1.5 using under 500 MB
Re-ranking — Cross-encoders for improving search relevance, using under 700 MB
Vision-Language Models — For image understanding and cross-modal search

The key insight is that with careful model selection and quantization, one GPU is enough. No expensive clusters, no cloud dependencies, no corporate surveillance.

Complete Local AI Stack Running

Here is a typical setup running on a single RTX 3090, as shown by process inspection:

=== LLAMA & VISION PROCESSES ===
197960 /usr/local/bin/llama-server --rerank -m /path/to/bge-reranker-v2-m3-q8_0.gguf -c 8192 --port 7676
198194 /usr/local/bin/llama-server -ngl 999 --embedding --port 9999 -m /path/to/nomic-embed-text-v1.5-Q8_0.gguf
289901 /usr/local/bin/llama-server --jinja -fa on -c 65536 -m /path/to/Qwen3.5-35B-A3B-Uncensored-Q3_K_M.gguf
290988 python /path/to/nomic-embed-vision-v1.5-api.py

=== GPU SUMMARY ===
0, NVIDIA GeForce RTX 3090, 20754 MiB, 24576 MiB, 5 %

=== VRAM PER PROCESS ===
PID: 197960 | VRAM:   676 MB ( 2.7%)
PID: 198194 | VRAM:   468 MB ( 1.9%)
PID: 289901 | VRAM: 18614 MB (75.8%)
PID: 290988 | VRAM:   480 MB ( 1.9%)

Total VRAM usage: approximately 20.2 GB out of 24 GB — leaving comfortable headroom for operating system and other processes.

What Each Process Does

1. Re-ranker (Port 7676)

The BGE Reranker v2 M3 improves search result relevance. After embedding-based retrieval returns candidates, this cross-encoder model rescore them for higher accuracy. Using only 676 MB, it provides significant quality improvements at minimal VRAM cost.

2. Embedding Server (Port 9999)

Nomic Embed Text v1.5 converts text into vector representations for semantic search. At 468 MB, this model enables RAG pipelines without cloud dependencies. The -ngl 999 flag loads all layers onto the GPU for maximum speed.

3. Main Language Model (Port unspecified)

Qwen 3.5 with 35 billion parameters, quantized to 3-bit, consumes 18.6 GB while providing instruction following, tool use, and a 65,536 token context window. The --jinja flag enables template processing, -fa on enables flash attention, and --tools all exposes function-calling capabilities.

4. Vision Model (Python process)

The Nomic Embed Vision v1.5 API enables image understanding and cross-modal search (finding images from text descriptions or vice versa). At 480 MB, it adds multi-modal capabilities to the local AI stack.

Monitoring Tool: `llama-gpu` Alias

To easily inspect running processes and their VRAM usage, add this alias to your ~/.bashrc:

alias llama-gpu='(echo "=== LLAMA & VISION PROCESSES ===" && pgrep -fa "llama|vision" && echo -e "\n=== GPU SUMMARY ===" && nvidia-smi --query-gpu=index,name,memory.used,memory.total,utilization.gpu --format=csv,noheader && echo -e "\n=== VRAM PER PROCESS ===" && (total_mem=$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits | head -1 | tr -d " "); nvidia-smi --query-compute-apps=pid,used_memory --format=csv,noheader | while IFS=, read pid mem; do mem_clean=$(echo "$mem" | tr -d " " | sed "s/MiB//"); percent=$(echo "scale=1; $mem_clean * 100 / $total_mem" | bc); echo "PID: $pid | VRAM: ${mem_clean} MB (${percent}%)"; done)) | cat'

After adding this alias, simply run llama-gpu to see:

All llama and vision-related processes with full command lines
GPU summary showing total and used memory
Per-process VRAM allocation with percentage of total GPU memory

Why Local AI Matters for Free Software Supporters

Running AI locally preserves the four fundamental freedoms:

Freedom 0 — Run the software for any purpose. Cloud AI providers can terminate your access, change terms, or censor your queries. Local models answer to no one.

Freedom 1 — Study and adapt the software. With open-weight models and local inference engines like llama.cpp, you can inspect every operation, modify behavior, and audit for privacy violations.

Freedom 2 — Redistribute copies. Local models can be shared with colleagues, students, or community members without permission from corporate gatekeepers.

Freedom 3 — Improve and release modifications. Fine-tune models on your own data, optimize for your hardware, and share your improvements.

Practical Applications

With this single-GPU setup, you can:

Private RAG on personal documents — Emails, notes, contracts, or research papers never leave your hardware
Cross-modal search — Find images by describing them; find text by describing images
Local coding assistant — With tool-use capabilities, the model can execute commands, read files, and help with development
Private customer support — For small businesses handling sensitive customer data
Educational tools — Language learning, tutoring, or research assistance without student data leakage

Conclusion

A single RTX 3090 with 24 GB VRAM is sufficient for a complete, production-ready local AI stack. The ability to run embedding, retrieval, reranking, generation, and vision models simultaneously on one consumer GPU proves that local AI is not only possible but practical.

Free software supporters should embrace this capability. Every query answered locally is one less query sent to corporate servers. Every model run on personal hardware is one less dependency on cloud providers. And every user who learns to monitor and manage these processes takes another step toward true digital autonomy.

The command-line tools shared here—simple aliases that reveal exactly what is running and how much VRAM each component uses—empower users to understand and control their systems. That is the GNU way: transparency, control, and freedom.

GNU.Support promotes the use of GNU operating systems and free software in practical applications. Running local AI is not just a technical demonstration—it is a political statement and a practical necessity for those who value software freedom.