Generated AI makes PEN-TEST vulnerability repair worse

liralbes April 22, 2025

0 4 minutes read

Generated AI makes PEN-TEST vulnerability repair worse

Technology, organizational and cultural factors are preventing businesses from finding vulnerability in penetration testing – problems that generate AI are intensifying rather than mitigating.

According to a study by penetration testing as a service company Cobalt, organizations will fix less than half (48%) of all exploitative vulnerabilities, a figure that drops to 21% for labeled AI Gen Gen App defects.

Vulnerabilities identified in security audits are rated as high or critical for severity, with a score of 69%.

The median time to resolve serious vulnerabilities has been greatly reduced since 2017 – from 112 days last year to 37 days. Kubart said this shows the positive impact of the “left” security plan.

Repair headaches

Sometimes, organizations make conscious business decisions to accept certain risks, rather than disrupting operations or incurring huge costs in addressing certain vulnerabilities.

Poor remediation plans and resource constraints are also a factor in slow patching. In some cases, vulnerabilities are found in older software or hardware that cannot be easily updated or replaced.

“Some organizations can only work for compliance or third-party approvals – to obtain Pentecost,” Cobalt researchers wrote. “Remediation risks are less direct concerns. However, in most cases, it depends on many organizational issues that involve people, processes, and technology.”

Next Gen-ai-con

Cobalt’s latest yearly edition of the Pentagram Report found that most companies have conducted pen tests on large language model (LLM) web applications, with one-third (32%) of the tests finding vulnerabilities worthy of careful evaluation.

Various LLM defects were identified, including timely injection, model manipulation and data leakage, and only 21% of the defects were fixed. Cobalt warns that AI development “has no safety net to play without safety net.”

These figures are based on an analysis of data collected in more than 5,000 pen tests of cobalt. In a survey of their customers, more than half of security leaders (52%) said they were under pressure to prioritize speed rather than safety.

Vulnerability “marked but not fixed”

Independent security experts told CSO that the cobalt discovery was consistent with what they witnessed on the bug repair stage.

“Most organizations are still too slow to resolve known vulnerabilities, which is rarely attributed to a lack of awareness,” Sparrow, chief operating officer of senior engineering executive officer Sparrow, told CSO. “Vulnerabilities are being marked – but they are not fixed.”

Mitigation measures to mitigate vulnerability have been delayed as businesses face competitive priorities.

“The safety team is overstretched, the engineering team focuses on transportation functions, and unless there is regulatory pressure or violations, solving the problem of “known issues” does not get the same attention.” Lei said.

Error fixes in the AI era

AI applications in particular introduce a different set of problems that complicate vulnerability repair.

“Many of them are quickly built, using new frameworks and third-party tools that have not been fully tested in production environments,” Lei said. “You have strange surfaces of attack, unpredictable models, and dependencies that teams cannot fully control.”

“So, even if you find vulnerabilities, solving them can be complex and time-consuming – assuming you even have internal expertise,” Ray added.

The generated AI application has two components: the application and the AI itself, usually an LLM, such as ChatGpt.

“Traditional application vulnerabilities are as easy to resolve as normal vulnerabilities; there is no difference,” said Inti de Ceukelaire, chief hacker officer of Bug Bounty Platform Intigriti.

For example, a Gen AI application can decide to use programming features to find certain documents. If there is a vulnerability in this programming feature, the developer can simply change the code.

In contrast, the vulnerability in LLM itself (the neural network or “brain” of AI) is “more difficult to fix because it is not always easy to understand why certain behaviors are triggered,” de Ceukelaire said.

“People might make assumptions, train or adjust the model to avoid this behavior, but you can’t be 100% sure the problem has been solved,” he said. “In that sense, comparing it to a traditional ‘patch’ might be a bit stretching.”

When asked by Intigriti’s comments, Cobalt said that its work and findings related to Gen Gen focused primarily on “verifying the integrity of the system supported by LLM, rather than evaluating the entire breadth of LLM’s trained behavior or output.”

Bug diversion

If the CISO wants to improve the remediation rate, it needs to make it easier for the team to determine a security fix. This could mean integrating security tools early in the development process, or setting performance metrics within the resolution time of the solution.

“It also means having clear ownership – someone responsible for making sure the vulnerability is actually fixed, not just the submission,” said Sparrow’s Lei.

Other experts believe that security professionals should focus their limited resources on the most risky vulnerabilities, such as the serious vulnerability of direct exposure to the Internet.

According to Tod Beardsley, accidental exposure and reduction of technical debt should also be given priority.

“Good penetration testing will help CISO identify areas where criminals may thrive, rather than simply listing a set of key vulnerabilities without context,” Beardsley told CSO.

Security teams are easily overwhelmed by the number of vulnerability that includes routine penetration tests (results from vulnerability scanning tools) as well as vulnerability tests (including regular penetration tests).

“It’s an information overload, and it’s really hard for the team to manage all of these issues and determine remedial measures based on the severity of the risk,” said Thomas Richards, director of infrastructure security operations at Black Duck, an application security testing company.

Like Runzero’s Beardsley, Richards believes that the results of the pen test need to be viewed in the right environment.

“When a report is obtained after a penetration test, the internal security team will review the report to determine its accuracy and what actions to take next,” Richards said. “This step does take time, but allows organizations to prioritize the highest risk first.”

The results of the vulnerability scanning tool need to be more cautious.

“We often find through automation tools that the default severity of output is not always accurate given other factors such as available exploits, network accessibility and other remedies to reduce the risk of vulnerability,” Richards explained. “Generally, even on critical systems, issues are patched.”