This project is a small CrewAI-based multi-agent system for analyzing Kubernetes-style deployment logs, finding root causes, investigating related issues online, and generating a remediation plan.
The workflow starts from a real log file, then uses three specialized AI agents in sequence:
- The Log Analyzer reads the log and extracts the main failure patterns.
- The Issue Investigator searches online for known causes and documented fixes.
- The Solution Specialist turns the findings into a practical remediation plan.
The system is designed to simulate a DevOps troubleshooting workflow with AI agents, tools, and basic guardrails.
The repository is organized as follows:
- main.py — entry point that creates the crew and starts the workflow.
- agents/agents.py — defines the three agents and their LLM configuration.
- tasks/tasks.py — defines the tasks, expected outputs, and guardrail validation.
- tools/tools.py — wires the file-reader tool and the EXA web-search tool.
- kubernetes_log.log — sample log used for the demo run.
- task_outputs/ — generated reports from the crew run.
- main.py loads the sample log path and creates a Crew with three tasks.
- The first task asks the Log Analyzer to inspect the log file.
- The second task uses the investigation agent to search for similar issues online.
- The third task uses the solution agent to summarize the result as a remediation plan.
- Task outputs are saved under task_outputs/.
The current crew uses a sequential process:
- Process: sequential
- Agents:
- log_analyzer
- issue_investigator
- solution_specialist
- Tasks:
- analyze_logs_task
- investigate_issue_task
- provide_solution_task
Each task depends on the previous findings for context, which makes the system behave like a chain of reasoning:
- Analyze the raw log
- Investigate the likely root cause
- Generate a concrete solution
This is the core multi-agent design of the project.
Role: analyze log files and identify incidents, errors, warnings, timelines, and likely root causes.
Responsibilities:
- Parse deployment and runtime log lines
- Detect error patterns such as ImagePullBackOff, CrashLoopBackOff, and sandbox failures
- Create a structured analysis report
Role: research the identified problem using external search.
Responsibilities:
- Search the internet for related error messages
- Gather official docs, forum posts, and known troubleshooting guidance
- Rank likely causes and proven fixes
Role: convert investigation findings into actionable remediation steps.
Responsibilities:
- Produce a step-by-step remediation plan
- Include commands and verification steps
- Recommend monitoring and prevention measures
The log analyzer uses a file reader tool to inspect the sample log file.
The investigation agent uses EXA to search the public web for similar issues and community guidance.
The project includes a simple guardrail in tasks/tasks.py:
- The log analysis task validates that at least one error was actually found.
- If the model output is too vague or empty, the task is retried.
This helps reduce low-quality or empty analysis results.
After a run, the project writes reports to task_outputs/:
- log_analysis.md
- investigation_report.md
- solution_plan.md
These files capture the different stages of the multi-agent reasoning process.
- Use Python 3.13 for the project environment. The current dependency set is tested against Python 3.13.
- Create the project virtual environment: py -3.13 -m venv .venv
- Activate it in PowerShell: ..venv\Scripts\Activate.ps1
- Install dependencies: python -m pip install -r requirements.txt
- Add your API keys to .env before running the crew.
The default model is configured in agents/agents.py and can be overridden with OPENROUTER_MODEL in .env.
Recommended notes:
- Free OpenRouter models are rate-limited and may return 429 or 402 errors when quota is exhausted.
- If you hit those limits, add credits at OpenRouter or switch to a paid model in .env.
From the project root:
python main.py
The sample input used by the demo is the repository file kubernetes_log.log.
- The .env file is required because the agents and tools load environment variables at startup so create a .env in your current environment.
- The OpenRouter key must have available credits for the LLM requests to succeed.
- The current system is intentionally simple and easy to extend with more agents, more tools, or more structured output formats.