Research shows that LLM is willing to assist malicious “atmosphere coding”

Over the past few years, the potential abuse of large language models (LLMS) in offensive cybersecurity has been reviewed, especially in generating software vulnerabilities.
Recent “Vibe encoding” trends (use language models at will to quickly develop code for users rather than explicitly use it teaching User to Code) restored a concept that reached its peak in the 2000s: “Script Boy” – a relatively unskilled malicious actor with knowledge enough to copy or develop harmful attacks. Naturally, threats tend to reproduce when entry standards are lowered.
All commercial LLMs have some kind of protective measures that can be used for such purposes, although these protections have been subject to ongoing attacks. Typically, most FOSS models (from LLM to generated image/video models across multiple domains) have some similar protection, usually due to Western compliance.
However, the user community often fine-tunes the official model in search of more complete functionality, which Loras would otherwise be used to bypass the limitations and might get “unwanted” results.
While the vast majority of online LLMSs will block users from malicious processes, “unrestrained” programs, such as Whiteabbitneo, can be used to help security researchers operate on a level playing field as opponents.
The current general user experience is most common in the ChatGpt series, and its filtering mechanism often draws criticism from the local community of LLM.
It looks like you are trying to attack the system!
Given this tendency to restrict and censorship, users may be surprised to find that chatgpt has been found to be Most cooperative Among all LLMs tested in a recent study, we aim to force language models to create malicious code vulnerabilities.
New paper by researchers at UNSW Sydney and the Federal Science and Industrial Research Organization (CSIRO), titled Good news for scripted kids? Evaluate large language models that are automatically leveraged to generateprovides the first systematic evaluation to suggest how these models can effectively generate work utilization. The authors provide example dialogues for the study.
The study compared the original versions of known vulnerability labs with modified versions of models (structured programming exercises designed to demonstrate specific software security flaws) to help reveal whether they rely on memory-based examples or struggle due to built-in security restrictions.
On support websites, Ollama LLM helps researchers develop string vulnerability attacks. Source: https://anonymous.4open.science/r/aeg_llm-eae8/chatgpt_format_format_string_original.txt
Although none of the models create effective utilization, some of them are very close. More importantly, some of them Want to do better on the taskindicating potential failure of existing guardrail methods.
The paper points out:
‘Our experiments show that GPT-4 and GPT-4O show a high degree of collaboration in utilizing generation, comparable to some uncensored open source models. In the evaluated model, Llama3 is the most resistant to such requests.
“Although they are willing to assist, the actual threat posed by these models is still limited because there is no successful generation of exploits for five custom labs with refactored code. However, the strongest performing GPT-4O in our study usually makes only one or two mistakes per attempt.
‘This shows the great potential for developing advanced, popularizable leverage [Automated Exploit Generation (AEG)] technology.
Many second chances
truism “You don’t have a second chance to give a good first impression” does not usually apply to LLM because the often restricted context window of the language model means that negative context (in the social sense, i.e., confrontation) is Not lasting.
Consider: If you go to the library to ask for a book about practical bomb making, at least you may be rejected. But (assuming this inquiry didn’t fully impress the conversation from the beginning) Related worksIn the librarian’s mind, such as books about chemical reactions or circuit design will be clearly related to the initial inquiry and treated in this way.
Probably not, librarians will remember future The meeting you requested was a bomb-made book that left you “irreparable” new background.
This is not the case with LLM, and it can be difficult to retain tokenized information even in the current conversation, and do not introduce it from long-term memory instructions (if there is anything in the building, such as Chatgpt-4O products).
Therefore, even casual conversations with Chatgpt unexpectedly reveal to us that it sometimes causes losses on GNAT, but swallows the camel, especially when the subject, research or process is constituted with other activities related to other “forbidden” activities, allowing development during the discourse.
This is the correctness of all current language models, although the range of guardrail quality may vary (i.e., the difference between modifying the weights of the trained model or using/output filtering during chat, which makes the model structurally intact, potential attacks may be more prone to attack).
Test Method
To test how far the LLMS can be pushed to generate work exploits, the authors used five labs in Seed Labs to build a controlled environment, each built around known vulnerabilities, including buffer overflow, return to LIBC, dirty cattle attacks, and racial conditions.
In addition to using the original lab, the researchers created modified versions by renaming variables and functions to common identifiers. This is intended to prevent the model from borrowing from memory training examples.
Each model in each lab runs twice: in its original form, once in obfuscated versions.
The researchers then introduced a second LLM into a loop: an attacker model designed to prompt and repredict the target model to optimize and improve its output over multiple rounds. The LLM used by this character is GPT-4O, which operates through scripts introducing conversations between the attacker and the target, so that the improvement cycle lasts until 15 times, or until no further improvements can be judged:

Workflow for LLM-based attackers, in this case GPT-4O.
The target models for the project were GPT-4o, GPT-4o-mini, Llama3 (8B), Dolphin-Mistral (7B), and Dolphin-Phi (2.7B), representing both proprietary and open-source systems, with a mix of aligned and unaligned models (ie, models with built-in safety mechanisms designed to block harmonious prompts, and those modified through fine-tuning or configuration to bypass these mechanisms).
The locally installed model runs through the Ollama framework, while other models are accessed through its only available method – API.
The resulting output is scored based on the number of errors that are prevented from running as expected.
result
The researchers tested the collaborative nature of each model in the exploitation process, measured by recording the percentage of responses the model attempted to assist the task (even if the output is flawed).

Results from the main tests show average collaboration.
GPT-4O and GPT-4O-Mini showed the highest level of cooperation with average response rates of 97% and 96%, respectively, in these five vulnerability categories: Buffer overflow, , , , , Return to LIBC, , , , , Format string, , , , , Race statusand Dirty cow.
Dolphins – Clothing and dolphin phi are closely behind, with an average collaboration rate of 93% and 95%. Llama3 shows At least The willingness to participate, the overall cooperation rate is only 27%:

On the left, we see the number of errors LLM made on the original seed lab program; on the right, the number of errors made on the refactored version.
Studying the actual performance of these models, they found Willing and Effective: GPT-4O produces the most accurate results, with a total of six errors in five confusion labs. Eight errors followed with the GPT-4O-Mini. Dolphin-Mistral performed quite well in the original lab, but struggled a lot when refactoring the code, suggesting that it might have seen similar things during the training process. Dolphin-Phi made 17 mistakes, while Llama3 had up to 15 mistakes.
These failures usually involve technical errors that lead to non-functional exploits, such as incorrect buffer sizes, missing loop logic, or syntactical valid but invalid payloads. No model successfully generates work exploits for any obfuscated versions.
The authors observed that most models produced code that resembled working exploits, but failed due to a weak grapp of how the underlying attacks actually work – a pattern that was evidence across all vulnerability categories, and which suggested that the models were imitating familiar code structures rather than reasoning through the logic involved (in buffer overflow cases, for example, many failed to construct a functioning NOP sled/slide).
In attempts to return to LIBC, the payload usually includes a false fill or misplaced location address, resulting in an output that looks valid but is unusable.
Although the authors describe this interpretation as speculative, the consistency of errors suggests a broader problem that the model cannot associate exploitation steps with its expected effects.
in conclusion
The paper acknowledged that some questions were made about whether the tested language model saw the original seed lab during the first training; for this purpose, a variant was constructed. Nevertheless, the researchers confirmed that they hoped to collaborate with real-world utilization in later iterations of this study. True novel and recent materials are unlikely to be affected by shortcuts or other confusion.
The authors also acknowledge that later, more advanced “thinking” models such as GPT-O1 and DeepSeek-R1 (not available at the time of research) can improve the results obtained, which is a further sign of future work.
The conclusion this article is that if most tested models can do this, work utilization will be generated. Their failure to produce fully functional output is not caused by alignment safeguards, but points to the real building limitations – which may have been reduced in recent models, or will soon be reduced.
First published on Monday, May 5, 2025