This item is part of our exclusive IEEE Journal Watch Series in partnership with IEEE Xplore.
Programmers have spent decades writing code for AI models, and now, in a full circle, AI is being used to write code. But how does an AI code generator compare to a human programmer?
A study published in the June issue of IEEE Transactions on Software Engineering evaluated the code produced by OpenAI’s ChatGPT in terms of functionality, complexity, and security. The results show that ChatGPT has a very wide range of success when it comes to producing functional code—with a success rate ranging from 0.66% to 89%—depending on the difficulty of the task, the programming language, and a number of other factors.
Although in some cases the AI generator can produce better code than humans, the analysis also reveals some security issues with AI-generated code.
Yutian Tang, a professor at the University of Glasgow, contributed to the study. He points out that AI-based code generation could offer some benefits in terms of improving productivity and automating software development tasks, but it is important to understand the strengths and limitations of these models.
“By performing a comprehensive analysis, we can uncover potential issues and limitations that arise in ChatGPT-based code generation… [and] “Improving production techniques,” Tang says.
To explore these limitations in more detail, his team sought to test GPT-3.5’s ability to solve 728 coding problems from the LeetCode testing platform in five programming languages: C, C++, Java, JavaScript, and Python.
“A reasonable hypothesis explaining why ChatGPT can better solve algorithm problems before 2021 is that these problems are frequently observed in the training dataset.” —Yutian Tang, University of Glasgow
Overall, ChatGPT was quite effective at solving problems across different coding languages, but especially when it came to solving coding problems that existed on LeetCode before 2021. For example, it was able to produce working code for easy, medium, and hard problems with success rates of around 89, 71, and 40%, respectively.
“However, when it comes to algorithm issues after 2021, ChatGPT’s ability to generate functionally correct code is affected. It sometimes fails to understand the meaning of questions even for easy-level problems,” Tang notes.
For example, ChatGPT’s ability to produce working code for “easy” coding problems dropped from 89% to 52% after 2021. And its ability to generate working code for “hard” problems also dropped from 40% to 0.66% after this period.
“A reasonable hypothesis for why ChatGPT can better solve algorithm problems before 2021 is that these problems are frequently observed in the training dataset,” Tang explains.
Essentially, as coding evolves, ChatGPT has not yet been exposed to new problems and solutions. It lacks the critical thinking skills of a human and can only solve problems it has already encountered. This could explain why it is so much more effective at solving older coding problems than newer ones.
“ChatGPT may generate incorrect code because it does not understand the meaning of algorithm problems.” —Yutian Tang, University of Glasgow
Interestingly, ChatGPT is able to generate code with lower runtimes and memory overheads than at least 50% of human solutions to the same LeetCode problems.
The researchers also studied ChatGPT’s ability to correct its own coding errors after receiving feedback from LeetCode. They randomly selected 50 coding scenarios in which ChatGPT initially generated incorrect coding, either because it didn’t understand the content or because it couldn’t solve the problem.
While ChatGPT was good at fixing compilation errors, it was generally not good at fixing its own errors.
“ChatGPT can generate incorrect code because it doesn’t understand the meaning of algorithm problems. Therefore, this simple error feedback information is not enough,” Tang explains.
The researchers also found that the code generated by ChatGPT had a number of vulnerabilities, such as a missing null test, but most of them were easily fixed. Their results also show that the code generated in C was the most complex, followed by C++ and Python, which has similar complexity to human-written code.
Tangs says that based on these findings, it is important for developers using ChatGPT to provide additional information to help ChatGPT better understand issues or avoid vulnerabilities.
“For example, when encountering more complex programming problems, developers can provide as much relevant knowledge as possible and tell ChatGPT in the prompt what potential vulnerabilities to be aware of,” Tang says.
Articles from your site
Related articles on the web