
Closed
Posted
Paid on delivery
I need solution for LLM (selectd by client)+ RAG deployed on own server(recommended by freelancer) with automatic scalable to 1000 or more converations the same time. Instances /pods should be added and removed automatically to save costs(for now only online dedicated serwers /clauds) later hibdrid of GPU server on premis + online servers Currenly additional information aboout users we have in postgresql only , we want to give user option to talk with RAG data and LLM model System also should count usages, store inforamtion when conversation started and finished in our database. If there is better solution recommended to talk wih the data I am open for it . In future I would like to add sending voice to this server and getting it back (except text). Please share price,timeplan for text only and text+voice + fiull support and documentation
Project ID: 40415137
9 proposals
Remote project
Active 8 days ago
Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs
9 freelancers are bidding on average $969 USD for this job

Hi, You’re essentially looking for a production-grade, auto-scaling LLM + RAG platform, not just an API wrapper and that’s exactly how I’d approach it. Architecture (scalable & cost-efficient): Backend: Python (FastAPI) with async processing RAG: LangChain/LlamaIndex + pgvector or Qdrant (tight PostgreSQL integration for your existing data) LLM: Client-selectable (OpenAI/Claude/local via Ollama if needed) Orchestration: Docker + Kubernetes (K8s) with HPA (auto-scale pods based on CPU/queue load) Queue layer: Redis/RabbitMQ for handling 1000+ concurrent conversations smoothly Usage tracking: middleware logs start/end, tokens, latency → stored in PostgreSQL Timeline & Cost: Text-only system: 2–3 weeks | $1,200 – $1,800 Text + Voice: 4–5 weeks | $2,000 – $2,800 Includes deployment, documentation, and scaling setup I’ve built similar RAG + scalable AI systems with usage tracking and optimization: https://www.freelancer.com/projects/php/Sharepoint-RAG-SQL-GPT-agent/reviews https://www.freelancer.com/projects/php/SQL-RAG-GPT-Agent-with/details If you want, I can also suggest a lower-cost alternative to Kubernetes depending on traffic patterns. Thanks.
$4,500 USD in 60 days
4.9
4.9

hi, we have developed RAG based model with UI, Its easy and support multilangauge both voice or text. would u like we wink send link through chat, we have more knowledge and Experience in AI , if you have any query related tot his project, kindly contact us through chat, thank you
$2,500 USD in 7 days
2.8
2.8

Hello, I trust you're doing well. I am well experienced in machine learning algorithms, with nearly a decade of hands-on practice. My expertise lies in developing various artificial intelligence algorithms, including the one you require, using Matlab, Python, and similar tools. I hold a doctorate from Tohoku University and have a number of publications in the same subject. My portfolio, which showcases my past work, is available for your review. Your project piqued my interest, and I would be delighted to be part of it. Let's connect to discuss in detail. Warm regards. please check my portfolio link: https://www.freelancer.com/u/sajjadtaghvaeifr
$25 USD in 7 days
2.0
2.0

Hi there, Building a scalable LLM + RAG system with auto-scaling pods and real-time usage tracking is a high-impact architecture challenge, and with my experience in AI systems and backend scaling, I can deliver a robust, cost-efficient solution ready for 1000+ concurrent conversations with future voice support. Do you have a preferred cloud/provider or should I propose a Kubernetes-based scalable setup? Should usage tracking include token-level billing and per-user analytics from day one? Let's have a quick chat to discuss more in detail. I am looking forward to hearing from you. Best, Sajid.
$15 USD in 7 days
0.0
0.0

Your focus on a scalable, on-prem RAG deployment is the right move for data sovereignty, a transition I’ve recently executed for clients migrating away from high-cost cloud APIs. I specialize in deploying local LLMs like Llama-3 or Mistral using frameworks that prioritize low-latency response times and high throughput. My previous work involved optimizing on-prem hardware to handle massive vector datasets, ensuring retrieval remains near-instantaneous even as your knowledge base expands. My strategy involves using vLLM for inference to maximize GPU utilization, coupled with a high-performance vector database like Qdrant for efficient semantic search. I will implement a containerized Docker environment to ensure seamless scalability and reliability across your local server resources. For hardware, I’ll provide a specific hardware BOM—likely NVIDIA-based—utilizing 4-bit AWQ quantization to maintain accuracy while minimizing the physical footprint. This architecture will be tied together with a robust RAG orchestration layer to handle complex document parsing and context-aware retrieval. To provide an accurate hardware recommendation, what is the total volume of data you intend to ingest, and are you leaning toward a specific parameter size like 8B or 70B? Do you require a web-based UI for users, or should I focus on an API-first deployment for software integration? I’m open to a brief chat to align on your hardware constraints and ensure we build a future-proof environment that meets your performance benchmarks.
$1,139 USD in 21 days
0.0
0.0

Hi, I specialize in building production-ready RAG systems — I recently delivered a full enterprise RAG assistant with FastAPI, PostgreSQL, Pinecone vector DB, and GPT-4o that handles multilingual conversations with citation tracking and confidence scoring. Your project fits exactly what I build. What I will deliver: ✅ RAG pipeline with your chosen LLM (GPT-4o, Llama, Mistral — your choice) ✅ FastAPI backend — clean REST API ✅ PostgreSQL integration — your existing user data connected, conversation start/end logged, usage counted per user ✅ Docker containerization — easy to scale ✅ Auto-scaling setup — instances added/removed automatically based on load ✅ Voice support — Whisper AI for speech-to-text and text-to-speech response ✅ Full documentation — setup guide, API docs, deployment manual ✅ 30 days free support after delivery Pricing: • Text only RAG + scaling: $500 / 7 days • Text + Voice: $650 / 10 days • Full package with docs + 30 day support: $800 / 14 days Tech stack: • LLM: your choice (OpenAI / Llama / Mistral) • Vector DB: Pinecone or Qdrant (on-premise) • API: FastAPI (Python) • Database: PostgreSQL (your existing) • Scaling: Docker + Kubernetes • Voice: OpenAI Whisper AI • Deployment: AWS / Azure / own server I can share a live demo of my current RAG system that handles 112 documents with hybrid search (BM25 + semantic) and streams answers in real time. Ready to start immediately. Let me know if you want to discuss.
$500 USD in 7 days
0.0
0.0

Hello, I can help you design and deploy a scalable production-ready LLM + RAG platform on dedicated servers/cloud infrastructure with automatic scaling, usage tracking, and future support for voice conversations. Proposed Architecture LLM Serving: vLLM / Ollama / TGI depending on selected model RAG Layer: LangChain or LlamaIndex Vector Database: PostgreSQL + pgvector or Qdrant API Backend: FastAPI (Python) Containerization: Docker + Kubernetes Auto Scaling: Kubernetes HPA + KEDA for dynamic pod scaling Monitoring: Prometheus + Grafana Authentication & Usage Tracking: PostgreSQL integration Deliverables Full deployment setup Kubernetes manifests / Docker configs CI/CD pipeline Usage tracking system Monitoring dashboards Documentation & support I have experience with Docker, Kubernetes, scalable cloud infrastructure, monitoring stacks, and AI deployment workflows. Let’s discuss the preferred LLM model and expected traffic patterns so I can finalize the architecture and pricing. Best regards.
$12 USD in 7 days
0.0
0.0

As an AI Engineer with a strong background in RAG, I am confident that I can provide you with the scalable on-prem LLM RAG deployment that you need. My expertise lies in deploying LLM models on servers which aligns perfectly with your requirement of deploying it on your own server. I understand the importance of cost-effectiveness, and thus, I am well-versed in designing systems that can automatically add or remove instances/pods based on load - saving you valuable resources. Moreover, my proficiency in setting up hybrid GPU and online server infrastructures ensures a future-proof solution for you, as you mentioned your plan to incorporate GPU servers. I have hands-on experience in working with PostgreSQL to store and retrieve user data, providing a seamless communication channel between data and the LLM model. Lastly, I aim at not just delivering the solution but also ensuring its smooth operation. Thus, I offer full support alongside detailed documentation. Additionally, I look forward to extending the system to include voice interaction as part of my commitment to serving all your potential future needs. Please let me know if there's anything else you'd like to discuss regarding the project timeline or pricing
$10 USD in 7 days
0.0
0.0

Krakow, Poland
Member since Feb 3, 2026
$1500-3000 USD
$500-2500 USD
$3000-5000 USD
$250-750 USD
$750-1500 USD
₹360000-900000 INR
$30-250 USD
₹12500-37500 INR
$250-750 USD
₹360000-900000 INR
$30-250 USD
₹1500-12500 INR
₹150000-250000 INR
₹1500-12500 INR
$30-250 USD
₹150000-250000 INR
$30-250 USD
$250-750 USD
₹12500-37500 INR
$30-250 USD
$30-250 USD
₹360000-900000 INR
₹1500-12500 INR
$250-750 USD
$30-250 USD