TLDR: We tested 6 humanization models against 3 AI detectors across 100 texts.

V2 was the only model that could bypass GPTZero (34% pass rate).

We analyzed why some texts pass and others don't, found systematic patterns, and built V3 — same model, better prompt.

Result: 67% GPTZero bypass rate as default, up to 84% with style selection. Short texts went from 17.5% to 80%.

No retraining needed.

Why we ran this experiment

AI detectors are getting better. GPTZero in particular has become the benchmark that most of our users care about. Our v2 model was decent — it was the only model on the market that had any meaningful GPTZero bypass rate — but "decent" wasn't good enough.

We wanted to understand exactly why some texts pass detection and others don't. Not guesswork. Not vibes. We wanted data.

So we designed a proper experiment.

The setup

We generated 100 AI essays using GPT-4o-mini across 50 different topics — science, technology, health, economics, philosophy, culture, and more. We made sure to cover different content lengths:

40 short texts (~300 words)
50 medium texts (~500 words)
10 long texts (~800 words)

Every text used the same generation parameters with a fixed random seed for reproducibility. No cherry-picking.

6 models, head to head

We ran every generated text through 6 different humanization approaches:

Rephrasy V1 — our first-generation production model
Rephrasy V2 — our second-generation model (previous best)
GPT5.1 — a vanilla model with instructions to rewrite naturally
Fine-tuned OSM — a custom model trained specifically for humanization
Fine-tuned OSM (DPO) — trained with direct preference optimization
Fine-tuned a 8B-small OSM — a lightweight fine-tuned model

Each output was quality-checked: if the humanized text was less than 70% of the original word count, we retried up to 3 times before recording it as a failure.

3 detectors, not just one

We didn't just test against GPTZero. Every humanized output was evaluated by three independent AI detection systems:

GPTZero — the most widely used commercial detector
ZeroGPT — a popular public detector
Rephrasy AI Detector — our own proprietary detection system

Using multiple detectors matters. Each one uses different detection strategies, and a model that fools one might get caught by another.

The results: V2 was already the best

After running all 600+ detection tests, one thing was obvious: V2 was the only model with any real GPTZero bypass capability.

Every other model — including three custom fine-tuned models we built from scratch — stayed below a 7% pass rate against GPTZero. Fine-tuning alone wasn't enough. Throwing more compute at the problem wasn't enough.

V2 had a 48% first-pass bypass rate. Not great, but miles ahead of everything else. That made it the clear foundation to build on.

Finding the patterns

This is where it gets interesting.

We took every V2 output and split them into two groups: texts that passed GPTZero and texts that didn't. Then we analyzed what made them different.

The patterns were surprisingly consistent. Texts that passed detection shared specific stylistic features. Texts that failed shared different ones. The difference wasn't random — it was systematic.

We identified the key signals that GPTZero uses to flag content, and we identified the counter-patterns that make text look human. We won't share the specifics (for obvious reasons), but the data was clear enough to act on.

Building v3

Armed with these findings, we built an enhanced system prompt that explicitly encodes the patterns we found. No model retraining. No new fine-tuning data. No post-processing chains. Just a better prompt and another pre-processing step, informed by real data.

We tested this enhanced prompt on 100 new samples using the exact same model and parameters as v2.

The v3 results

Previous model (v2): 34% GPTZero bypass rate, average AI probability 0.67

New model (v3): 67% GPTZero bypass rate, average AI probability 0.35

That's a 33 percentage point improvement. The average AI probability dropped by 47% in relative terms.

It works across all text lengths

One of the biggest weaknesses of v2 was short text. If you humanized a paragraph or a short answer, v2 only passed 17.5% of the time. That was a real problem for users who work with shorter content.

v3 fixed this completely:

Short texts (~300 words): 17.5% → 67.5% — a 50 percentage point improvement
Medium texts (~500 words): 46% → 66% — a 20 percentage point improvement
Long texts (~800 words): 40% → 70% — a 30 percentage point improvement

The biggest gain is on short texts, which was the hardest category. v3 brings every length category to roughly the same 67% bypass rate. Your content length no longer matters.

The best thing, if you keep your humanization in small paragraphs, our humanizer is able to basically bypass GPTZero completely without even rehumanizing.

What didn't work

Not everything we tried succeeded, and we think that's worth sharing too.

Fine-tuning alone failed. We trained three custom models from scratch on humanization datasets. None of them exceeded 7% on GPTZero. The token-level distribution of fine-tuned models is still detectable, regardless of the training data.

Chaining models failed. We tried passing v2 output through a second model to further refine it. The bypass rate dropped from 60% to 0%. Running text through any standard language model re-introduces the exact token fingerprint that GPTZero detects.

This told us something important: GPTZero doesn't just look at writing style — it detects model-level token distributions.

Double-pass humanization barely helped. Running the same text through the humanizer twice improved v1 from 3% to 7%. Not worth the extra credit cost. A single pass with a better prompt is far more effective than multiple passes with a mediocre one.

What this means for you

Higher bypass rates, no extra steps.
Select "Undetectable Model v3" in the humanizer, paste your text, and click humanize. That's it.

Short content is no longer a weakness. Whether you're humanizing a paragraph, an essay, or a full article — the bypass rate is consistent at around 67%.

Available via API too

The new enhanced model is now available in our AI Humanizer API.

Pass `"v3"` as the model parameter:

json { "text": "Your AI-generated text here", "model": "v3" }

v2 is now freely accessible for everyone

We've made the previous best model available to all users at no extra cost. v3 is exclusive to subscribers.

Model v3 is available now. Try it.

Going further: style-optimized prompts

We didn't stop at 67%. After proving that prompt engineering works, we tested whether different writing styles affect bypass rates differently.

We created style-specific prompts — professional, journalistic, and creative and tested each one on 50 new samples across all text lengths.

The results surprised us:

- Creative style: 84% bypass rate, avg AI probability 0.18

- Journalistic style: 76% bypass rate, avg AI probability 0.26

- Professional style: 68% bypass rate, avg AI probability 0.38

The creative style hit 100% bypass on long texts and 80% on short texts — the category that used to be our weakest.

This means v3 now offers style selection. You're not locked into one tone.

Pick the style that fits your content, and the bypass rate stays high across the board.

Introducing Model v3: Enhanced GPTZero Bypass Rate