How to orchestrate multi-agents locally with Gemma-4 and LM Studio

Here’s a practical guide to build for orchestrating multi-agents powered by Gemma-4 E2B running locally in LM Studio. Depending on the ram your computer has, you can run your own swarm of Ai agents but let’s start with two.

This tutorial shows how you can do it using LM Studio on a Macbook. LM Studio is alo available for Windows. A minimum of 16gb of ram is needed, 24gb is better and anything over 24gb allows you to use more powerful models. Don’t expect to have a fast production flow for business using 16gb. It’s adequate enough for research and as a task assistant.

I’d like to target beginners with this article but this flow can be useful for more advanced developers which are new to LLMs. Compared to using an API service, you don’t have to pay for LLM calls on your computer.

During the publishing of this article Gemma-4 is one of the leading open-source models, but that can change any day and this article will get appended with more content. First step is to download https://lmstudio.ai/download for your Macbook. If you’re using a windows-based PC, the configuration steps may vary but the models and python code will be the same. If you are new to programming, you’ll encounter requirements such as Xcode tools, Python, and a Python environment like conda or Astral uv needing to be installed first. Conda is the most popular being around for a longer time than uv, but uv is more performant since it was built with Rust. Read the quickstart guide for conda here.

After installing LM Studio, open it, search for and download google/gemma-4-e2b.

A 4-8B parameter model is considered lightweight and suitable for personal use. Any model bigger won’t function on a 16gb of ram because you still need ram for context memory; the memory used for sending/receiving prompts.

Gemma-4 is a reasoning model which means it’s going to “think” and take longer to respond to your prompts. On a Apple M4 processor with 24gb, I get a quick speed of 49 tokens per second and a thought time of 12.9 seconds. Feel free to download a different parameter model for lower or higher performance.

There are many variants which include distillations and quantized ones from the community (unofficial). Their performance (tokens per second) will vary accordingly to the previously stated factors.

Next, navigate to “Developer” in the left sidebar (⌘2), load the model, and start the server. Depending on the version of LM Studio, the button placement of Developer options could be changed. Running the server uses the OpenAI-compatible API at http://localhost:1234/v1 by default.

Now that you have a local server running and ready for inference, let’s code the python part. No API key is needed for this. Just use placeholder text like in the example.

Using the OpenAI client library set the LLM_URL=”http://localhost:1234/v1″. Set the model, message and print the output. If you want to use a separate python environment, do so with conda or uv first and pip install openai. Note: If using uv, dependency installation commands need to be prefixed with “uv”.

Your directory structure will look like this:

agent_eval/
│
├── app.py
├── llm_client.py
├── generator_agent.py
├── evaluator_agent.py
└── orchestrator.py

agent_eval/
│
├── app.py
├── llm_client.py
├── generator_agent.py
├── evaluator_agent.py
└── orchestrator.py

mkdir agent_eval
cd agent_eval
uv venv agent_eval
source agent_eval/bin/activate
uv pip install requests
touch app.py orchestrator.py llm_client.py generator_agent.py evaluator_agent.py

mkdir agent_eval
cd agent_eval
uv venv agent_eval
source agent_eval/bin/activate
uv pip install requests
touch app.py orchestrator.py llm_client.py generator_agent.py evaluator_agent.py

Let’s create the entry point python file app.py which combines the classes for each agent and the orchestrator. Don’t run the file until we finish writing the additional classes and server request.

# we will create these classes in the next step
from generator_agent import GeneratorAgent
from evaluator_agent import EvaluatorAgent
from orchestrator import Orchestrator

# Your local server started in LM Studio
LLM_URL = "http://localhost:1234/v1"

# This is the API model identifier, copy it from LM Studio
GENERATOR_MODEL = "gemma-4-e2b"
EVALUATOR_MODEL = "gemma-4-e2b"

def main():

    generator = GeneratorAgent(
        base_url=LLM_URL,
        model=GENERATOR_MODEL,
    )

    evaluator = EvaluatorAgent(
        base_url=LLM_URL,
        model=EVALUATOR_MODEL,
    )

    orchestrator = Orchestrator(
        generator=generator,
        evaluator=evaluator,
    )
    # Our example prompt for this tutorial
    task = """
Write a technical explanation of Retrieval Augmented Generation (RAG)
for software engineers.

Include:
- What it is
- Why it helps
- Limitations
- A simple example
"""

    result = orchestrator.run(
        task=task,
        max_rounds=3,
        min_score=1,
    )

    print("\n=== FINAL RESULT ===\n")
    print(result)


if __name__ == "__main__":
    main()

# we will create these classes in the next step
from generator_agent import GeneratorAgent
from evaluator_agent import EvaluatorAgent
from orchestrator import Orchestrator

# Your local server started in LM Studio
LLM_URL = "http://localhost:1234/v1"

# This is the API model identifier, copy it from LM Studio
GENERATOR_MODEL = "gemma-4-e2b"
EVALUATOR_MODEL = "gemma-4-e2b"

def main():

    generator = GeneratorAgent(
        base_url=LLM_URL,
        model=GENERATOR_MODEL,
    )

    evaluator = EvaluatorAgent(
        base_url=LLM_URL,
        model=EVALUATOR_MODEL,
    )

    orchestrator = Orchestrator(
        generator=generator,
        evaluator=evaluator,
    )
    # Our example prompt for this tutorial
    task = """
Write a technical explanation of Retrieval Augmented Generation (RAG)
for software engineers.

Include:
- What it is
- Why it helps
- Limitations
- A simple example
"""

    result = orchestrator.run(
        task=task,
        max_rounds=3,
        min_score=1,
    )

    print("\n=== FINAL RESULT ===\n")
    print(result)


if __name__ == "__main__":
    main()

Here’s the llm_client.py file which does the prompts through llm calls. This file does POST requests to our localhost endpoint and passes the prompt payload. Then it returns a response in JSON. It is called by the evaluator and generator agent files.

import requests

def call_llm(
    base_url: str,
    model: str,
    system_prompt: str,
    user_prompt: str,
    temperature: float = 0.5,
) -> str:

    payload = {
        "model": model,
        "messages": [
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": user_prompt,
            },
        ],
        "temperature": temperature,
    }

    response = requests.post(
        f"{base_url}/chat/completions",
        json=payload,
        headers={"Content-Type": "application/json"},
        timeout=120,
    )

    response.raise_for_status()
    return response.json()["choices"][0]["message"]["content"]

import requests

def call_llm(
    base_url: str,
    model: str,
    system_prompt: str,
    user_prompt: str,
    temperature: float = 0.5,
) -> str:

    payload = {
        "model": model,
        "messages": [
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": user_prompt,
            },
        ],
        "temperature": temperature,
    }

    response = requests.post(
        f"{base_url}/chat/completions",
        json=payload,
        headers={"Content-Type": "application/json"},
        timeout=120,
    )

    response.raise_for_status()
    return response.json()["choices"][0]["message"]["content"]

Just like a typical LLM call or prompt from any Ai service, generator_agent.py is a basic function to invoke a text generation. It is structured as a python class passing in the server path and model name as parameters. Our temperature parameter is set at typical 0.5 value to provide non-deterministic results. Lower this value to get more consistent results.

# generator_agent.py
from llm_client import call_llm

class GeneratorAgent:

    def __init__(
        self,
        base_url: str,
        model: str,
    ):
        self.base_url = base_url
        self.model = model

    def generate(
        self,
        task: str,
        feedback: str | None = None,
    ) -> str:

        prompt = f"""
TASK:
{task}
"""
        if feedback:
            prompt += f"""

REVIEW FEEDBACK:
{feedback}

Revise the response and address all concerns.
"""
        return call_llm(
            base_url=self.base_url,
            model=self.model,
            system_prompt=(
                "You are a content generation agent. "
                "Produce the highest quality response."
            ),
            user_prompt=prompt,
            temperature=0.5,
        )

# generator_agent.py
from llm_client import call_llm

class GeneratorAgent:

    def __init__(
        self,
        base_url: str,
        model: str,
    ):
        self.base_url = base_url
        self.model = model

    def generate(
        self,
        task: str,
        feedback: str | None = None,
    ) -> str:

        prompt = f"""
TASK:
{task}
"""
        if feedback:
            prompt += f"""

REVIEW FEEDBACK:
{feedback}

Revise the response and address all concerns.
"""
        return call_llm(
            base_url=self.base_url,
            model=self.model,
            system_prompt=(
                "You are a content generation agent. "
                "Produce the highest quality response."
            ),
            user_prompt=prompt,
            temperature=0.5,
        )

This is the second agent whose role is evaluation of the generator agent. The file generator_agent.py includes an identical class as the generator just with different prompts and output result.

# evaluator_agent.py
import json

from llm_client import call_llm

class EvaluatorAgent:

    def __init__(
        self,
        base_url: str,
        model: str,
    ):
        self.base_url = base_url
        self.model = model

    def evaluate(
        self,
        task: str,
        candidate: str,
    ) -> dict:

        prompt = f"""
Evaluate the candidate response.

TASK:
{task}

CANDIDATE:
{candidate}

Return ONLY valid JSON.
Rules:
- score must be an integer from 0 to 10
- approved must be false when score < 6
- approved must be true when score >= 6
- feedback must explain deficiencies

Schema:

{{
  "approved": boolean,
  "score": integer,
  "feedback": string
}}
"""

        result = call_llm(
            base_url=self.base_url,
            model=self.model,
            system_prompt=(
                "You are a strict reviewer. "
                "Return only JSON."
            ),
            user_prompt=prompt,
            temperature=0.4,
        )

        try:
            review = json.loads(result)

            required_fields = [
                "approved",
                "score",
                "feedback"
            ]

            for field in required_fields:
                if field not in review:
                    raise ValueError(
                        f"Missing field: {field}"
                    )
            return review

        except Exception:
            return {
                "approved": False,
                "score": 0,
                "feedback": (
                    "Reviewer returned invalid JSON."
                ),
            }

# evaluator_agent.py
import json

from llm_client import call_llm

class EvaluatorAgent:

    def __init__(
        self,
        base_url: str,
        model: str,
    ):
        self.base_url = base_url
        self.model = model

    def evaluate(
        self,
        task: str,
        candidate: str,
    ) -> dict:

        prompt = f"""
Evaluate the candidate response.

TASK:
{task}

CANDIDATE:
{candidate}

Return ONLY valid JSON.
Rules:
- score must be an integer from 0 to 10
- approved must be false when score < 6
- approved must be true when score >= 6
- feedback must explain deficiencies

Schema:

{{
  "approved": boolean,
  "score": integer,
  "feedback": string
}}
"""

        result = call_llm(
            base_url=self.base_url,
            model=self.model,
            system_prompt=(
                "You are a strict reviewer. "
                "Return only JSON."
            ),
            user_prompt=prompt,
            temperature=0.4,
        )

        try:
            review = json.loads(result)

            required_fields = [
                "approved",
                "score",
                "feedback"
            ]

            for field in required_fields:
                if field not in review:
                    raise ValueError(
                        f"Missing field: {field}"
                    )
            return review

        except Exception:
            return {
                "approved": False,
                "score": 0,
                "feedback": (
                    "Reviewer returned invalid JSON."
                ),
            }

Now for the important part of the code which is a basic orchestrator which loops for the amount of rounds we define, in this case 3.

# orchestrator.py
from generator_agent import GeneratorAgent
from evaluator_agent import EvaluatorAgent

class Orchestrator:

    def __init__(
        self,
        generator: GeneratorAgent,
        evaluator: EvaluatorAgent,
    ):
        self.generator = generator
        self.evaluator = evaluator

    def run(
        self,
        task: str,
        max_rounds: int = 3,
        min_score: int = 1,
    ) -> str:

        feedback = None
        best_response = ""

        for round_num in range(1, max_rounds + 1):

            print(f"\nROUND {round_num}")

            candidate = self.generator.generate(
                task=task,
                feedback=feedback,
            )

            review = self.evaluator.evaluate(
                task=task,
                candidate=candidate,
            )

            print("\nCandidate Repsonse:")
            print(candidate)
            print("\nReview:")
            print(review)

            best_response = candidate

            approved = review.get(
                "approved",
                False,
            )

            score = review.get(
                "score",
                0,
            )

            if approved or score >= min_score:
                print("\nResponse accepted.")
                return candidate

            feedback = review.get(
                "feedback",
                "",
            )

        return best_response

# orchestrator.py
from generator_agent import GeneratorAgent
from evaluator_agent import EvaluatorAgent

class Orchestrator:

    def __init__(
        self,
        generator: GeneratorAgent,
        evaluator: EvaluatorAgent,
    ):
        self.generator = generator
        self.evaluator = evaluator

    def run(
        self,
        task: str,
        max_rounds: int = 3,
        min_score: int = 1,
    ) -> str:

        feedback = None
        best_response = ""

        for round_num in range(1, max_rounds + 1):

            print(f"\nROUND {round_num}")

            candidate = self.generator.generate(
                task=task,
                feedback=feedback,
            )

            review = self.evaluator.evaluate(
                task=task,
                candidate=candidate,
            )

            print("\nCandidate Repsonse:")
            print(candidate)
            print("\nReview:")
            print(review)

            best_response = candidate

            approved = review.get(
                "approved",
                False,
            )

            score = review.get(
                "score",
                0,
            )

            if approved or score >= min_score:
                print("\nResponse accepted.")
                return candidate

            feedback = review.get(
                "feedback",
                "",
            )

        return best_response

You can monitor the verbatim output in LM Studio so on your first call from the generator agent your result will appear similar to this:

"id": "chatcmpl-74qdgsd32m3nq4f4kbn",
  "object": "chat.completion",
  "created": 1781743549,
  "model": "gemma-4-e2b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "### Technical Overview: Retrieval-Augmented Generation (RAG)\n\nRetrieval-Augmented Generation (RAG) is a neural architecture that augments LLM inference by injecting relevant, domain-specific context into the prompt at runtime. Instead of relying solely on in-weights knowledge—which is static and limited to the training cutoff—the model consumes a retrieved document chunk as an in-context reference for generating grounded responses.\n\n### Why RAG Helps\n*   **Freshness:** You can update your vector store without retraining or fine-tuning the model; new documents are immediately discoverable via embedding lookup.\n*   **Verifiability & Grounding:** By including source citations, you can trace generated content back to a specific chunk in the retrieval index rather than accepting opaque \"hallucinations.\"\n*   **Privacy and Access Control:** Sensitive or user-specific data stays in your indexed store; only relevant slices are surfaced into the context window on demand.\n\n### Limitations\n*   **Retrieval Failure:** If the embedding model fails to find semantically similar chunks (e.g., due to poor chunking or out-of-vocabulary terms), the generation will suffer regardless of LLM quality.\n*   **Context Window Saturation:** Including too many retrieved segments can dilute the attention over relevant content and may exceed max sequence lengths.\n*   **Embedding Drift:** Updates to your embedding model require re-indexing the entire corpus, which creates a dependency between the retriever and generator pipelines.\n\n### Conceptual Example\nA user asks about internal company policy on remote work. Instead of generating from weights, the system embeds the query (`vector_q`), performs cosine similarity against an index of HR documents (top-k = 2), appends those chunks to `USER <query>\\nCONTEXT [chunk1]\\n[chunk2]`, and requests generation: \"Answer using only the provided context.\"",
        "reasoning_content": "",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 94,
    "completion_tokens": 389,
    "total_tokens": 483,
    "completion_tokens_details": {
      "reasoning_tokens": 0
    }
  },
  "stats": {},
  "system_fingerprint": "gemma-4-e2b"
}

"id": "chatcmpl-74qdgsd32m3nq4f4kbn",
  "object": "chat.completion",
  "created": 1781743549,
  "model": "gemma-4-e2b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "### Technical Overview: Retrieval-Augmented Generation (RAG)\n\nRetrieval-Augmented Generation (RAG) is a neural architecture that augments LLM inference by injecting relevant, domain-specific context into the prompt at runtime. Instead of relying solely on in-weights knowledge—which is static and limited to the training cutoff—the model consumes a retrieved document chunk as an in-context reference for generating grounded responses.\n\n### Why RAG Helps\n*   **Freshness:** You can update your vector store without retraining or fine-tuning the model; new documents are immediately discoverable via embedding lookup.\n*   **Verifiability & Grounding:** By including source citations, you can trace generated content back to a specific chunk in the retrieval index rather than accepting opaque \"hallucinations.\"\n*   **Privacy and Access Control:** Sensitive or user-specific data stays in your indexed store; only relevant slices are surfaced into the context window on demand.\n\n### Limitations\n*   **Retrieval Failure:** If the embedding model fails to find semantically similar chunks (e.g., due to poor chunking or out-of-vocabulary terms), the generation will suffer regardless of LLM quality.\n*   **Context Window Saturation:** Including too many retrieved segments can dilute the attention over relevant content and may exceed max sequence lengths.\n*   **Embedding Drift:** Updates to your embedding model require re-indexing the entire corpus, which creates a dependency between the retriever and generator pipelines.\n\n### Conceptual Example\nA user asks about internal company policy on remote work. Instead of generating from weights, the system embeds the query (`vector_q`), performs cosine similarity against an index of HR documents (top-k = 2), appends those chunks to `USER <query>\\nCONTEXT [chunk1]\\n[chunk2]`, and requests generation: \"Answer using only the provided context.\"",
        "reasoning_content": "",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 94,
    "completion_tokens": 389,
    "total_tokens": 483,
    "completion_tokens_details": {
      "reasoning_tokens": 0
    }
  },
  "stats": {},
  "system_fingerprint": "gemma-4-e2b"
}

This occurrs after executing python app.py in your command line so run the file and hopefully you didn’t get errors. The evaluator agent takes this output as input and does its evaluation. This will take a couple of minutes depending on your hardware.

python app.py

python app.py

The following output will show this in LM Studio and the print() function lines from app.py

Received request: POST to /v1/chat/completions with body  
{
  "model": "gemma-4-e2b",
  "messages": [
    {
      "role": "system",
      "content": "You are a strict reviewer. Return only JSON."
    },
    {
      "role": "user",
      "content": "\nEvaluate the candidate response.\n\nTASK:\n\nWrite a ... <Truncated in logs> ...approved\": true,\n  \"score\": 0,\n  \"feedback\": \"\"\n}\n"
    }
  ],
  "temperature": 0
}

Received request: POST to /v1/chat/completions with body  
{
  "model": "gemma-4-e2b",
  "messages": [
    {
      "role": "system",
      "content": "You are a strict reviewer. Return only JSON."
    },
    {
      "role": "user",
      "content": "\nEvaluate the candidate response.\n\nTASK:\n\nWrite a ... <Truncated in logs> ...approved\": true,\n  \"score\": 0,\n  \"feedback\": \"\"\n}\n"
    }
  ],
  "temperature": 0
}

The following will be the output from the completed app code in markdown format:

Candidate Reponse:
Retrieval-Augmented Generation (RAG) sits at the intersection of Information Retrieval and Generative AI. Instead of relying solely on a model’s internal weights to answer a query, RAG treats the LLM as a reasoning engine over an external context window injected at runtime.

### What it is
The core architecture decouples **knowledge** from **reasoning**. A standard pipeline consists of three stages:

1.  **Retrieval:** The system searches a non-parametric corpus (a vector database, document store, or search index) for the top-$k$ relevant chunks based on embedding similarity (e.g. cosine distance via `all-MiniLM-L6-v2`).
2.  **Augmentation:** These retrieved fragments are concatenated with the original query into a structured prompt: `"Answer using only these snippets: [chunk1, chunk2...] Query: {query}"`.
3.  **Generation:** The LLM decodes over that expanded context rather than hallucinating from pre-training weights alone.

### Why it helps engineers
For production software systems, RAG solves three critical failure modes of vanilla inference:

*   **Grounding & Factuality:** By forcing the model to use provided evidence, you shift from "probabilistic guessing" toward verifiable generation. This is the primary mitigation against hallucinations on niche facts (e.g., internal API specs).
*   **Knowledge Updateability:** Changing a fact requires updating an index entry (O(1) or O(log n)) rather than fine-tuning the weights, which is slow and expensive. It enables real-time knowledge injection for news, inventory, or evolving documentation.
*   **Verifiability/Attribution:** Since chunks are indexed with source metadata, every response can be programmatically linked to a `source_id` or URL—critical for audit trails in enterprise systems.

### Limitations
RAG is not an "infinite knowledge" solution; it inherits several engineering bottlenecks:

*   **Retrieval Failure:** If the embedding model fails to map semantic similarity correctly, the LLM receives garbage context and will still produce a bad answer ("garbage-in/garbage-out").
*   **Needle in a Haystack / Lost in Middle:** For very large context windows, models can lose coherence on information buried in middle chunks. Chunking strategy (size vs. overlap) becomes a critical hyperparameter.
*   **Embedding Drift:** As the corpus evolves, the vector space may become stale or misaligned with newer query semantics unless reindexed regularly.

### Simple Example: `get_answer` pipeline
Instead of calling an LLM on a naked string, wrap it in a retriever abstraction.

```python
from haystack.embeddings import SentenceTransformerEmbedding
from haystack.retrievers import EmbeddingRetriever
from haystack.generation import GeneratedAnswerGenerator
from haystack.types import Response

# 1. Index the "source of truth" (could be a DB query)
docs = [Document(content="User ID must be UUIDv4 format.")]

embedder = SentenceTransformerEmbedding(model_name="all-MiniLM-L6-v2")
retriever = EmbeddingRetriever(document_type=str, embedding_provider=embedder)

# 2. The RAG Loop: Retrieve -> Augment → Generate
def answer_query(query):
    # Step A: Semantic search (the "Retrieval" in RAG)
    result = retriever.run(query)  # returns a list of relevant docs
    
    # Step B & C combined by the framework: 
    # Embedder embeds query → Generator builds augmented prompt → LLM generates answer
    generator = GeneratedAnswerGenerator()
    answer = generator.run(answer=result, question=query)

    return answer  # A Response object containing the generated string and confidence score
```

Candidate Reponse:
Retrieval-Augmented Generation (RAG) sits at the intersection of Information Retrieval and Generative AI. Instead of relying solely on a model’s internal weights to answer a query, RAG treats the LLM as a reasoning engine over an external context window injected at runtime.

### What it is
The core architecture decouples **knowledge** from **reasoning**. A standard pipeline consists of three stages:

1.  **Retrieval:** The system searches a non-parametric corpus (a vector database, document store, or search index) for the top-$k$ relevant chunks based on embedding similarity (e.g. cosine distance via `all-MiniLM-L6-v2`).
2.  **Augmentation:** These retrieved fragments are concatenated with the original query into a structured prompt: `"Answer using only these snippets: [chunk1, chunk2...] Query: {query}"`.
3.  **Generation:** The LLM decodes over that expanded context rather than hallucinating from pre-training weights alone.

### Why it helps engineers
For production software systems, RAG solves three critical failure modes of vanilla inference:

*   **Grounding & Factuality:** By forcing the model to use provided evidence, you shift from "probabilistic guessing" toward verifiable generation. This is the primary mitigation against hallucinations on niche facts (e.g., internal API specs).
*   **Knowledge Updateability:** Changing a fact requires updating an index entry (O(1) or O(log n)) rather than fine-tuning the weights, which is slow and expensive. It enables real-time knowledge injection for news, inventory, or evolving documentation.
*   **Verifiability/Attribution:** Since chunks are indexed with source metadata, every response can be programmatically linked to a `source_id` or URL—critical for audit trails in enterprise systems.

### Limitations
RAG is not an "infinite knowledge" solution; it inherits several engineering bottlenecks:

*   **Retrieval Failure:** If the embedding model fails to map semantic similarity correctly, the LLM receives garbage context and will still produce a bad answer ("garbage-in/garbage-out").
*   **Needle in a Haystack / Lost in Middle:** For very large context windows, models can lose coherence on information buried in middle chunks. Chunking strategy (size vs. overlap) becomes a critical hyperparameter.
*   **Embedding Drift:** As the corpus evolves, the vector space may become stale or misaligned with newer query semantics unless reindexed regularly.

### Simple Example: `get_answer` pipeline
Instead of calling an LLM on a naked string, wrap it in a retriever abstraction.

```python
from haystack.embeddings import SentenceTransformerEmbedding
from haystack.retrievers import EmbeddingRetriever
from haystack.generation import GeneratedAnswerGenerator
from haystack.types import Response

# 1. Index the "source of truth" (could be a DB query)
docs = [Document(content="User ID must be UUIDv4 format.")]

embedder = SentenceTransformerEmbedding(model_name="all-MiniLM-L6-v2")
retriever = EmbeddingRetriever(document_type=str, embedding_provider=embedder)

# 2. The RAG Loop: Retrieve -> Augment → Generate
def answer_query(query):
    # Step A: Semantic search (the "Retrieval" in RAG)
    result = retriever.run(query)  # returns a list of relevant docs
    
    # Step B & C combined by the framework: 
    # Embedder embeds query → Generator builds augmented prompt → LLM generates answer
    generator = GeneratedAnswerGenerator()
    answer = generator.run(answer=result, question=query)

    return answer  # A Response object containing the generated string and confidence score
```

Pay attention to the temperature parameter in your LLM calls. Increasing the number upto 1 will increase the creativity in the response. Your results will vary considering we are using a low parameter and low quantized model. Because of this w set the evaluator agent to be semi-strict with a temperature of 0.4, a passing score of 6 and a generator temperature of 0.5.

Full output from app.py

> python app.py

---------ROUND 1--------------

Candidate Response:
### Technical Overview: Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a design pattern that augments the ...
---

### Why it helps
LLMs suffer from two fundamental failure modes that RAG mitigates:
...

---

### Limitations
RAG is not a silver bullet; its performance is bounded by the retrieval pipeline:
...

---

### A simple example: "Ask your documentation"

Instead of asking a model, *"How do I authenticate with our internal API?"* (which would fail on private data), the RAG pipeline executes three steps:
...

Review:
{'approved': False, 'score': 5, 'feedback': "The candidate provides a concise overview but lacks depth in explaining how RAG improves accuracy. The example is vague and doesn't clearly illustrate the retrieval process or its impact on hallucination reduction."}

---------ROUND 2--------------
Candidate Response:
### Retrieval-Augmented Generation (RAG): A Technical Overview

Retrieval-Augmented Generation (RAG) is a technique that enhances Large Language Model (LLM) outputs by grounding generation in retrieved external information rather than relying solely on the model’s internal weights. Instead of retraining or fine-tuning, RAG injects relevant context directly into the prompt at inference time.

#### Why It Helps
Standard LLMs are limited to their training cutoff and can hallucinate when asked about private data or niche details. RAG addresses these issues by:
* **Reducing Hallucination:** The model synthesizes an answer from provided evidence rather than guessing.
* **Enabling Private/Dynamic Data:** You can query a live database, documentation, or internal wikis without retraining the model on that content.
* **Attribution:** Because the source text is in the context window, you can ask the system to cite which document informed each part of its answer.

#### Limitations
RAG is not a silver bullet and has several failure modes:
* **Retrieval Quality:** If the retriever returns irrelevant or incomplete chunks (bad embedding match, bad chunking strategy), the generator cannot produce a correct answer.
* **Context Window Limits:** You can only inject so many retrieved documents; dense information in too few tokens degrades quality.
* **Lost in the Middle:** LLMs may struggle to weigh every piece of context equally if the relevant information is buried deep within a long prompt.

#### Example Workflow
Think of RAG as an **"open-book exam."** Instead of asking the model "What was our company's Q3 revenue?", you use a retriever (e.g., FAISS or Pinecone with embeddings) to fetch the relevant paragraph from a financial report and construct this prompt:

> Context: [Retrieved Chunk about Q3 Revenue]
> Question: What was our company's Q3 revenue?
> Answer based only on the context above.

The model then summarizes that specific chunk rather than hallucinating a number from its weights.

Review:
{'approved': True, 'score': 6, 'feedback': 'The explanation lacks technical depth on how retrieval quality impacts hallucination reduction and misses key limitations like context window constraints. A clearer breakdown would strengthen the assessment.'}

Response accepted.

=== FINAL RESULT ===

### Retrieval-Augmented Generation (RAG): A Technical Overview

> Context: [Retrieved Chunk about Q3 Revenue]
> Question: What was our company's Q3 revenue?
> Answer based only on the context above.

The model then summarizes that specific chunk rather than hallucinating a number from its weights.

> python app.py

---------ROUND 1--------------

Candidate Response:
### Technical Overview: Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a design pattern that augments the ...
---

### Why it helps
LLMs suffer from two fundamental failure modes that RAG mitigates:
...

---

### Limitations
RAG is not a silver bullet; its performance is bounded by the retrieval pipeline:
...

---

### A simple example: "Ask your documentation"

Instead of asking a model, *"How do I authenticate with our internal API?"* (which would fail on private data), the RAG pipeline executes three steps:
...

Review:
{'approved': False, 'score': 5, 'feedback': "The candidate provides a concise overview but lacks depth in explaining how RAG improves accuracy. The example is vague and doesn't clearly illustrate the retrieval process or its impact on hallucination reduction."}

---------ROUND 2--------------
Candidate Response:
### Retrieval-Augmented Generation (RAG): A Technical Overview

Retrieval-Augmented Generation (RAG) is a technique that enhances Large Language Model (LLM) outputs by grounding generation in retrieved external information rather than relying solely on the model’s internal weights. Instead of retraining or fine-tuning, RAG injects relevant context directly into the prompt at inference time.

#### Why It Helps
Standard LLMs are limited to their training cutoff and can hallucinate when asked about private data or niche details. RAG addresses these issues by:
*   **Reducing Hallucination:** The model synthesizes an answer from provided evidence rather than guessing.
*   **Enabling Private/Dynamic Data:** You can query a live database, documentation, or internal wikis without retraining the model on that content.
*   **Attribution:** Because the source text is in the context window, you can ask the system to cite which document informed each part of its answer.

#### Limitations
RAG is not a silver bullet and has several failure modes:
*   **Retrieval Quality:** If the retriever returns irrelevant or incomplete chunks (bad embedding match, bad chunking strategy), the generator cannot produce a correct answer.
*   **Context Window Limits:** You can only inject so many retrieved documents; dense information in too few tokens degrades quality.
*   **Lost in the Middle:** LLMs may struggle to weigh every piece of context equally if the relevant information is buried deep within a long prompt.

#### Example Workflow
Think of RAG as an **"open-book exam."** Instead of asking the model "What was our company's Q3 revenue?", you use a retriever (e.g., FAISS or Pinecone with embeddings) to fetch the relevant paragraph from a financial report and construct this prompt:

> Context: [Retrieved Chunk about Q3 Revenue]
> Question: What was our company's Q3 revenue?
> Answer based only on the context above.

The model then summarizes that specific chunk rather than hallucinating a number from its weights.

Review:
{'approved': True, 'score': 6, 'feedback': 'The explanation lacks technical depth on how retrieval quality impacts hallucination reduction and misses key limitations like context window constraints. A clearer breakdown would strengthen the assessment.'}

Response accepted.

=== FINAL RESULT ===

### Retrieval-Augmented Generation (RAG): A Technical Overview

Retrieval-Augmented Generation (RAG) is a technique that enhances Large Language Model (LLM) outputs by grounding generation in retrieved external information rather than relying solely on the model’s internal weights. Instead of retraining or fine-tuning, RAG injects relevant context directly into the prompt at inference time.

#### Why It Helps
Standard LLMs are limited to their training cutoff and can hallucinate when asked about private data or niche details. RAG addresses these issues by:
*   **Reducing Hallucination:** The model synthesizes an answer from provided evidence rather than guessing.
*   **Enabling Private/Dynamic Data:** You can query a live database, documentation, or internal wikis without retraining the model on that content.
*   **Attribution:** Because the source text is in the context window, you can ask the system to cite which document informed each part of its answer.

#### Limitations
RAG is not a silver bullet and has several failure modes:
*   **Retrieval Quality:** If the retriever returns irrelevant or incomplete chunks (bad embedding match, bad chunking strategy), the generator cannot produce a correct answer.
*   **Context Window Limits:** You can only inject so many retrieved documents; dense information in too few tokens degrades quality.
*   **Lost in the Middle:** LLMs may struggle to weigh every piece of context equally if the relevant information is buried deep within a long prompt.

#### Example Workflow
Think of RAG as an **"open-book exam."** Instead of asking the model "What was our company's Q3 revenue?", you use a retriever (e.g., FAISS or Pinecone with embeddings) to fetch the relevant paragraph from a financial report and construct this prompt:

> Context: [Retrieved Chunk about Q3 Revenue]
> Question: What was our company's Q3 revenue?
> Answer based only on the context above.

The model then summarizes that specific chunk rather than hallucinating a number from its weights.

You have completed a multi-agent orchestration! Good job!
Now onto more complex agent patterns.