Exploring the Generation and Detection of Weaknesses in LLM Generated Code

None, None

Exploring the Generation and Detection of Weaknesses in LLM Generated Code

LLMs can not be trusted to produce secure code, but they can detect it

Bachelor Thesis (2024)

Author(s)

I. Vasiliauskas (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. Al-Kaswan – Mentor (TU Delft - Software Engineering)

A. van Deursen – Graduation committee member (TU Delft - Software Engineering)

M. Izadi – Graduation committee member (TU Delft - Software Engineering)

Faculty

Electrical Engineering, Mathematics and Computer Science

Artificial intelligence Large Language Models (LLMs) Software security

To reference this document use:

https://resolver.tudelft.nl/uuid:d2bb29dc-3121-4bac-a77e-33c3a132ff24

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

28-06-2024

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large Language Models (LLMs) have gained a lot of popularity for code generation in recent years. Developers might use LLM-generated code in projects where the security of software matters. A relevant question is therefore: what is the prevalence of code weaknesses in LLM-generated code, and can we use LLMs to detect them? In this research, we generate prompts based on a taxonomy of code weaknesses and run them on multiple LLMs with varying properties. We evaluate the results on the existence of insecurities both manually and by the LLMs themselves. We can conclude that even when LLMs are not provoked and asked benign realistic requests, they often generate code containing known software weaknesses. We find a correlation between model parameter size and the percentage of secure answers. However, they are exceptionally successful in recognizing these insecurities themselves. Future work should focus on a wider set of models and a larger set of prompts, to get more results on this subject.

Files

CSE3000_Final_Paper_Ignas.pdf

(pdf | 0.174 Mb)

License info not available