Itinai.com a realistic user interface of a modern ai powered ba94bb85 c764 4faa 963c 3c93dfb87a10 0
Itinai.com a realistic user interface of a modern ai powered ba94bb85 c764 4faa 963c 3c93dfb87a10 0

Build a Self-Hosted LLM Workflow with Ollama, REST API, and Gradio

Understanding the Target Audience

The tutorial on building a self-hosted LLM workflow with Ollama, REST API, and Gradio Chat Interface is tailored for a diverse audience. Key groups include:

  • Data Scientists and AI Practitioners: These individuals are eager to implement machine learning models in real-world applications.
  • Software Developers: Developers looking to integrate AI capabilities into their applications will find this guide beneficial.
  • Business Analysts: Professionals who aim to leverage AI for data analysis and informed decision-making.

Common challenges faced by these groups include:

  • Setting up and managing AI models in a self-hosted environment.
  • Integrating various components of AI workflows effectively.
  • Limited resources for running complex models, especially in CPU-only environments.

Their primary goals often involve:

  • Creating efficient and scalable AI solutions.
  • Enhancing technical skills in AI and machine learning.
  • Finding cost-effective methods to deploy AI models.

Interests typically include:

  • Staying updated on the latest trends in AI and machine learning technologies.
  • Engaging in hands-on coding tutorials and practical implementations.
  • Participating in community discussions through forums and collaborative projects.

When it comes to communication, they prefer:

  • Clear, concise instructions with practical examples.
  • Technical documentation that includes code snippets and explanations.
  • Interactive content that allows for experimentation and feedback.

Tutorial Overview

This tutorial provides a step-by-step guide to implementing a fully functional Ollama environment within Google Colab. The process includes:

  • Installing Ollama on the Colab VM using the official Linux installer.
  • Launching the Ollama server to expose the HTTP API on localhost:11434.
  • Pulling lightweight models such as qwen2.5:0.5b-instruct or llama3.2:1b, optimized for CPU-only environments.
  • Interacting with these models programmatically via the /api/chat endpoint using Python’s requests module with streaming enabled.
  • Integrating a Gradio-based UI to facilitate user interaction with the models.

Implementation Steps

To set up the environment, we start by checking for the installation of Ollama and Gradio:

import os, sys, subprocess, time, json, requests, textwrap
from pathlib import Path

def sh(cmd, check=True):
   """Run a shell command, stream output."""
   p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
   for line in p.stdout:
       print(line, end="")
   p.wait()
   if check and p.returncode != 0:
       raise RuntimeError(f"Command failed: {cmd}")

if not Path("/usr/local/bin/ollama").exists() and not Path("/usr/bin/ollama").exists():
   print(" Installing Ollama ...")
   sh("curl -fsSL https://ollama.com/install.sh | sh")
else:
   print(" Ollama already installed.")

This code checks for the presence of Ollama and installs it if necessary. It also ensures Gradio is available for building the user interface.

Starting the Ollama Server

Next, we start the Ollama server in the background and verify its status:

def start_ollama():
   try:
       requests.get("http://127.0.0.1:11434/api/tags", timeout=1)
       print(" Ollama server already running.")
       return None
   except Exception:
       pass
   print(" Starting Ollama server ...")
   proc = subprocess.Popen(["ollama", "serve"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
   for _ in range(60):
       time.sleep(1)
       try:
           r = requests.get("http://127.0.0.1:11434/api/tags", timeout=1)
           if r.ok:
               print(" Ollama server is up.")
               break
       except Exception:
           pass
   else:
       raise RuntimeError("Ollama did not start in time.")
   return proc

server_proc = start_ollama()

This function ensures that the Ollama server is running and ready to accept API requests.

Model Management

We define the model to use and check its availability:

MODEL = os.environ.get("OLLAMA_MODEL", "qwen2.5:0.5b-instruct")
print(f" Using model: {MODEL}")
try:
   tags = requests.get("http://127.0.0.1:11434/api/tags", timeout=5).json()
   have = any(m.get("name")==MODEL for m in tags.get("models", []))
except Exception:
   have = False

if not have:
   print(f"  Pulling model {MODEL} (first time only) ...")
   sh(f"ollama pull {MODEL}")

This code checks if the specified model is available on the server and pulls it if necessary.

Chat Functionality

We create a streaming client for the chat functionality:

OLLAMA_URL = "http://127.0.0.1:11434/api/chat"

def ollama_chat_stream(messages, model=MODEL, temperature=0.2, num_ctx=None):
   """Yield streaming text chunks from Ollama /api/chat."""
   payload = {
       "model": model,
       "messages": messages,
       "stream": True,
       "options": {"temperature": float(temperature)}
   }
   if num_ctx:
       payload["options"]["num_ctx"] = int(num_ctx)
   with requests.post(OLLAMA_URL, json=payload, stream=True) as r:
       r.raise_for_status()
       for line in r.iter_lines():
           if not line:
               continue
           data = json.loads(line.decode("utf-8"))
           if "message" in data and "content" in data["message"]:
               yield data["message"]["content"]
           if data.get("done"):
               break

This function allows for real-time interaction with the model, yielding responses as they are generated.

Smoke Testing

We run a smoke test to ensure everything is functioning correctly:

def smoke_test():
   print("n Smoke test:")
   sys_msg = {"role":"system","content":"You are concise. Use short bullets."}
   user_msg = {"role":"user","content":"Give 3 quick tips to sleep better."}
   out = []
   for chunk in ollama_chat_stream([sys_msg, user_msg], temperature=0.3):
       print(chunk, end="")
       out.append(chunk)
   print("n Done.n")
try:
   smoke_test()
except Exception as e:
   print(" Smoke test skipped:", e)

This test sends a prompt to the model and checks for a valid response.

Building the Gradio Interface

Finally, we integrate Gradio to create an interactive chat interface:

import gradio as gr

SYSTEM_PROMPT = "You are a helpful, crisp assistant. Prefer bullets when helpful."

def chat_fn(message, history, temperature, num_ctx):
   msgs = [{"role":"system","content":SYSTEM_PROMPT}]
   for u, a in history:
       if u: msgs.append({"role":"user","content":u})
       if a: msgs.append({"role":"assistant","content":a})
   msgs.append({"role":"user","content": message})
   acc = ""
   try:
       for part in ollama_chat_stream(msgs, model=MODEL, temperature=temperature, num_ctx=num_ctx or None):
           acc += part
           yield acc
   except Exception as e:
       yield f" Error: {e}"

with gr.Blocks(title="Ollama Chat (Colab)", fill_height=True) as demo:
   gr.Markdown("#  Ollama Chat (Colab)nSmall local-ish LLM via Ollama + Gradio.n")
   with gr.Row():
       temp = gr.Slider(0.0, 1.0, value=0.3, step=0.1, label="Temperature")
       num_ctx = gr.Slider(512, 8192, value=2048, step=256, label="Context Tokens (num_ctx)")
   chat = gr.Chatbot(height=460)
   msg = gr.Textbox(label="Your message", placeholder="Ask anything…", lines=3)
   clear = gr.Button("Clear")

   def user_send(m, h):
       m = (m or "").strip()
       if not m: return "", h
       return "", h + [[m, None]]

   def bot_reply(h, temperature, num_ctx):
       u = h[-1][0]
       stream = chat_fn(u, h[:-1], temperature, int(num_ctx))
       acc = ""
       for partial in stream:
           acc = partial
           h[-1][1] = acc
           yield h

   msg.submit(user_send, [msg, chat], [msg, chat])
      .then(bot_reply, [chat, temp, num_ctx], [chat])
   clear.click(lambda: None, None, chat)

print(" Launching Gradio ...")
demo.launch(share=True)

This code sets up the Gradio interface, allowing users to interact with the model through a simple chat interface.

Conclusion

In conclusion, this tutorial establishes a reproducible pipeline for running Ollama in Google Colab. It covers installation, server startup, model management, API access, and user interface integration. The system utilizes Ollama’s REST API as the core interaction layer, enabling both command-line and Python streaming access, while Gradio manages session persistence and chat rendering. This approach adapts the self-hosted design for Colab’s constraints, allowing experimentation with multiple LLMs and dynamic parameter adjustments.

For further exploration, check out the Full Codes and feel free to follow us on Twitter. Join our community on Reddit and subscribe to our newsletter for more updates.

FAQ

  • What is Ollama? Ollama is a platform for deploying and managing language models in a self-hosted environment.
  • Can I run this on my local machine? Yes, the tutorial is designed to run in Google Colab, but it can also be adapted for local setups.
  • What models can I use with Ollama? You can use lightweight models like qwen2.5:0.5b-instruct or llama3.2:1b, which are optimized for CPU-only environments.
  • Is Gradio necessary for this setup? While Gradio enhances the user interface, it is not strictly necessary; you can interact with the API directly.
  • How can I modify the chat functionality? You can adjust parameters like temperature and context tokens to change how the model responds.
Itinai.com office ai background high tech quantum computing 0002ba7c e3d6 4fd7 abd6 cfe4e5f08aeb 0

Vladimir Dyachkov, Ph.D
Editor-in-Chief itinai.com

I believe that AI is only as powerful as the human insight guiding it.

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

  • Automation of internal processes.
  • Optimizing AI costs without huge budgets.
  • Training staff, developing custom courses for business needs
  • Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

100% of clients report increased productivity and reduced operati

AI news and solutions