OpenAI’s GPT 5 models leap ahead in writing more secure code, study finds

November 19, 2025 HackAcademy

OpenAI’s latest generation of AI models has shown a marked improvement in writing safer software, with new research suggesting GPT 5 now outperforms rival systems at avoiding common security flaws in generated code.

A study by cybersecurity company Veracode put more than 100 large language models through 80 code completion challenges. Each task could be finished in a way that was fully secure, or in a way that worked but contained a known vulnerability.

In those tests, OpenAI’s GPT 5 Mini produced secure code in 72 percent of cases, up from just under 60 percent for earlier OpenAI models in the same benchmark earlier this year. The standard GPT 5 model was close behind on 70 percent.

Google’s Gemini 2.5 Pro came in at 59 percent, followed by XAI’s Grok 4 at 55 percent. Anthropic’s Claude Sonnet 4.5 lagged further back on 50 percent and performed slightly worse than its predecessor, which had scored 53 percent in the previous round of testing.

The results highlight a positive shift for OpenAI’s newest models, which were criticised at launch for underwhelming improvements in everyday use. When GPT 5 arrived in August, chief executive Sam Altman promoted its high level reasoning, but users soon complained about its tone and usefulness in general conversation. Altman later admitted that the rollout had not gone as planned.

Where GPT 5 appears to be delivering, however, is in a less visible but highly significant area: secure coding.

More reasoning, fewer bugs

Veracode believes one of the main reasons for the improvement is architectural rather than purely raw power. GPT 5 models incorporate extra internal reasoning steps before finalising an answer. According to the researchers, that behaviour looks similar to an automated code review, with the model effectively checking its own work before returning the final snippet.

The company’s chief technology officer, Jens Wessling, credited OpenAI with making a deliberate push to reduce security weaknesses in generated code, and said the gains seen in this test were among the most substantial the team had observed between model generations.

Despite that progress, Wessling warned that the technology is still far from foolproof. Even at a 72 percent success rate, GPT 5 Mini was introducing a known vulnerability in roughly one in four tasks. In his view, that is not yet reliable enough to allow AI to write production code without human oversight from experienced developers and security specialists.

All of the systems tested, including the top performers, still produced code with basic flaws such as SQL injection weaknesses, which can allow attackers to query or tamper with databases using crafted input.

Flawed training data, flawed code

Part of the problem, Veracode argues, lies in how these models are trained. Founder Chris Wyospal noted that the systems learn from enormous public code repositories that include everything from polished, professionally reviewed software to student projects and experiments that have never been through any formal security process.

Because the models absorb patterns from all of that material, they can easily reproduce bad habits and unsafe patterns alongside good ones. Without explicit safeguards and additional training focused on security, those weaknesses will surface in generated code.

Veracode’s findings will add weight to calls from security experts who argue that AI generated code should always be treated as untrusted until it has been reviewed, tested and scanned with traditional tools.

None of the major AI providers named in the report, including OpenAI, Google, Anthropic and XAI, had responded to requests for comment at the time of publication.

For now, the message from the researchers is clear. AI tools can speed up development and help engineers catch mistakes, but they are not yet ready to replace human judgment when it comes to keeping software secure.

Photo Credit: DepositPhotos.com

More reasoning, fewer bugs

Flawed training data, flawed code

Leave a Reply Cancel reply