Straylabs

Really cool ASCII art of a Möbius band. Unfortunately, it has nothing to do with the article below. We just like topology.

If you are a security team looking to make your team more efficient, an SMB or startup wanting to move fast and safely or just looking to access the private beta, please contact us on discord or via email at contact@straylabs.ai

Introduction

Today, we are launching our platform for web application security reviews. For the last year, we have been extensively working on discovering what works and what doesn’t in agentic AI for offensive security. By the end of 2025, we’ve achieved one of the highest scores at the time in one of the most well known benchmarks in offensive security with our harness and agentic architecture using minimal resources, with no external dependencies and with mostly open weight models.

But it was incomplete. We couldn’t really use it in production for several reasons.

First of all, we were not handling authentication properly. In most websites, authentication is different. It might need to follow redirections, save particular authentication mechanisms and be able to reuse them in other subagents.
We also had the problem of resolving the issues surrounding WAF and bot-protections (ex Cloudfare) that made the whole agentic workflow not work at all.
In real applications, we didn’t have a full coverage of the application we were targeting. We were able to target specific tasks, but not testing every path does not make it a real security assessment.
Another technical problem was validation. In some cases, the agent reported hallucinated incorrect findings and false positives. And we needed the correct way to handle validation.

For the past months, we’ve been working to solve all these problems to make our harness more production ready and usable for real purposes and thus, we are opening our private beta today for a limited number of users.

The problem we are trying to solve

In most of our discussions with security professionals, dev teams, SaaS companies and consulting companies, we had the same feedback :

A plug-and-play harness for web application pentesting.
Being able to review a full coverage of the web application.
Validated findings and reporting.

Our goal is to conceive the best agentic tooling for devs and security teams that adds value.

For dev teams : Being able to run agentic security testing to have a direct overview of their security posture (in an attacker point of view) without prior extensive knowledge and with validated weaknesses that they would be able to directly remediate.
For security teams : Giving them the ability to use the agentic tooling in their audits to move faster in their repetitive security assessments and audit reports, while relying on their technical expertise to push further on specific attack paths (by for example letting the agent spills back interesting source-to-sink hints that could be explored). basically giving the ability to security professionals to build up a fast a quick analysis and findings

Why we believe that it is not just about the model

In the last few months (at the time of writing this article), we saw a great proliferation in AI models’ capabilities in offensive security, namely Mythos and GPT 5.5.

What we’ve seen lately with Anthropic’s results (for example on Firefox) is undeniably dazzling at finding bugs, and critical ones at that. And their benchmark results are off the charts too. But no matter how capable the model is, we saw (and we are not the only ones of course) that without a really good validation pipeline, we are pretty much just generating a lot of false positives and erroneous results.

We also have a focus on runtime. We primarly focus on testing the application live. Because even if we find a flaw in the code, we need to test it while running the application to evaluate the impact and severity due to the layer of protections (WAFs…) already put in place.

What we believe that still needs to be done, is the whole pipeline from delivering the right context, testing specific weaknesses with specialized tooling, generate exploits and PoC to validating the findings and reporting.

How we differ / The value that we want to produce

In the plethora of projects in AI agents for offensive security, we’ve tried to stay concise and scientific about our approach. Relying on evaluation and benchmarking.

Our goal is not to redo the wheel. In pentesting, we understand that most people already have their own scripting and tooling. Especially in the reconnaissance phase.

This is one of the reasons that we never really showed AI agents that run subfinder, nmap or any other recon tool. It just felt like we’re using a powerful tool to achieve already easily configurable tasks.

We see however a lot of value in:

Leaving the possibility for each team to implement their own recon scripting and tooling as skills or pre-hooks.
Correctly aggregating reconnaissance data to be analysed by an LLM. And more specifically, using a knowledge graph to give the ability to understand the web application behaviour and having an overview of the application’s coverage. This gives the agent the power to understand and look for behavioral vulnerabilities (such as business logic vulnerabilities, IDORs…), not by having a specific prompt but by understanding a feature and its business impact.

Our interesting features

In what we want to deliver, we resolved most of the infuriating issues that AI pentest agents encounters :

We can authenticate into any type of authentication (form, JSON, OAuth2…) given prior testing credentials.
We bypass bot-protections, captchas… So you don’t have to whitelist IPs, disable WAF or any other protections that you already have (as a matter of fact, it gives us the possibility to look at the WAF’s implementation).
By specifying an OpenAPI specification (or swagger link) we are able to gather the information and build a more robust security testing.

We didn’t achieve that by relying on third-party services. Because we felt that would open a whole different type of issues in the future on dealing how the traffic would be sent and how to deal with it. So we rebuilt everything and we integrate it natively in our agent (and soon in our OpenSource project, see below)

Lesson learned from previous experiences

Experiences summary

In our last experiences, we were able to achieve a feedback-driven reasoning to let the agent find the right payload to inject depending on the target and task that we were exploring.

While most implementations rely on specifically giving the type of vulnerability (basically a long prompt expressing all the types of mutations and payloads to be tested per vulnerability: XSS, IDOR…), we didn’t find that approach really “agentic” and felt more like a DAST approach.

Another issue that we’ve experienced, is that by giving specific instructions of payloads to a model, it usually is bound to those instructions and does not look beyond (in most cases - with the open-weight models that we’ve tested).

Because most models today are RL-based and in perspective of the amount of data they are trained on, our thinking was to let the agent use basic tooling to mutate and try-harder by feeding back the results with a confidence score.

In some way, the model is already trained on the data that we were going to give it in the instructions, thereby might as well let it find the right payloads it needs to inject.

Harness problem and long running agents

After failure analysis on our benchmarks, our agent was always able to find the right sink and reason about it.

Our last public results of 81% on XBOW’s benchmark dates back more than 2 months ago (and not with a high-end model like Opus or gpt 5.5 but with open-weight models like Kimi K2.5) costing us around 122$ for the whole benchmark to run instead of prior tests with Sonnet 4.5 for more than 10x the price at ~1200$.

In those failure results, our agent failed 90% of the time on the tooling that we’ve made and a lack of knowledge improvement between subagents. From that we understood the need on improving on :

the tooling that we were using wasn’t robust enough and certainly in authentication and exploit generation.
General task context handling where we weren’t giving the ability to the subagents to track what works and what needs to be done as the next step.
On post-finding validation that was still basic and which relied on the look of a specific flag.

Towards self-improving agentic vulnerability research

In the next months, we are going to continue our research to better our results and architecture. While we are primarly gathering recon and retrying some times the same tasks, we have some really interesting ideas on evolving our agents with the lifecyle of the user’s application. We’re looking to explore self-improving agents in vulnerability research where the agent could change it’s own behavior and code to perform better and more.

Platform launch : attempting total coverage on web applications

One of the biggest issues that we had to solve is how to engage the most coverage possible for the application we are targeting.

Our agent was built initially for directed, specific tasks. This was not usable for a web application. The case we have here is a long-running agent with different steps. What we have done is create the right workflow to look at every angle, every feature and every source in the a web application. We then derived security tests that were specific to our application.

This is were it is fundamentally different from a DAST or other automated tooling. We do not test for things that have no relation with the application at hand. We only test for what we see in the application and from the technology stack gathered, mimicking what we would do manually.

Deadend platform high-level schema

%%{init: {"flowchart": {"htmlLabels": true, "wrappingWidth": 260, "padding": 16}}}%%
flowchart TD
  recon("Recon phase<br/>Automated tooling<br/>gathers attack surface")
  gen("Security test generation<br/>App-specific tasks<br/>via knowledge graph")

  recon --> gen

  subgraph swarm["Swarm agents — one agent per task"]
    a1("Agent → IDOR on /api/orders/:id")
    a2("Agent → Auth bypass on /login")
    a3("Agent → …")
  end

  findings("Findings<br/>Only if confirmed")

  gen --> a1
  gen --> a2
  gen --> a3
  a1 & a2 & a3 --> findings

Our open-source harness roadmap

We mostly use open-source tooling all around our platform and all our projects and deployments. We cannot state enough gratitude towards companies such as Project Discovery for their awesome cybersecurity tooling suite. But also to most cybersecurity tooling maintained by the community that gave us the knowledge and ability to learn in this field.

Our first attempt in the open-source was the deadend CLI, where we were able to present the work that we’ve done in the agentic pentesting field. But this CLI is not close to being as optimized as Claude Code, Pi or Codex (and many others).

However, we’ve been thinking lately to at least contribute our work in taint analysis and vulnerability research into a harness that could be used with those coding agents (as MCP server, skill or Pi extension). A harness equipped with a set of basic tooling (shell, browser for pentesting, python sandbox and validator) that is optimized for dynamic analysis that could be capable of testing a web application, while being authenticated creating exploits, validating results and returning a report.

We are open to ideas and discussions around this subject.

Thank you for reading. If you want to know more of have an access to our private beta, please contact us on discord or via email at contact@straylabs.ai or yassine@straylabs.ai