Security Implications of AI written code

Summary

The increasing integration of Artificial Intelligence (AI) into programming is fundamentally transforming the way software is conceived and developed. With advances in generative AI, exemplified by foundational models like OpenAI Codex, Github Copilot, and Google’s DeepMind AlphaCode, programmers are increasingly adopting tools that automate crucial parts of the development process, generating code based on specific requirements.

However, these advances come with significant challenges. Among them are cybersecurity concerns related to protecting users’ privacy and personal data, the illusion that automatically generated code is secure, and issues related to the data fed into the models. Security policies also emerge as a critical area of concern, along with threats of malware, data poisoning, hallucinations, and a range of biases that generative AI models may incorporate.

These challenges underscore the crucial importance of addressing not only the effectiveness and efficiency of these AI tools but also their ethical and security implications. As we move forward in this era of AI-driven programming, it is essential to tackle these challenges proactively and collaboratively, ensuring that the benefits of these technologies are maximized while risks are effectively mitigated.

The generative AI ecosystem

Generative AI is a subset of artificial intelligence that specifically targets code generation, creating new code based on learned patterns from existing datasets. Instead of solely recognizing data, it has the capacity to produce original code that mirrors patterns observed in the training data.

The utilization of generative AI in code generation has yielded several advantages:

Speeding up the SDLC

Its application accelerates the Software Development Life Cycle (SDLC) by automating code generation. Leveraging foundation models enables swift delivery of software products and features to end-users, thereby capitalizing on market trends and demands. According to a McKinsey study, generative AI speeds up the process by at least twice as much.

Increased Developer Productivity

Developer productivity is boosted by at least 33%, according to a survey by Stackoverflow with 39 042 submissions. This is because developers can manage their time more efficiently by avoiding repetitive tasks. By automating these repetitive tasks, these tools reduce arduous work, allowing programmers to focus on more complex coding aspects. This provides more time to work on higher-level design and innovation.

Democratization of Programming

Generative AI reduces entry barriers for both technical and non-technical individuals. Even those with limited experience can utilize these solutions to create functional code snippets without needing to grasp the intricacies of programming deeply. This not only empowers new developers but also fosters inclusivity and diversity in the software development community, making it more open and accessible to all.

Generative AI and Our Data

Tools like ChatGPT are trained on vast amounts of text, which may or may not contain personal information. This training data informs the tool but is not supposed to be recalled verbatim. However, the Indiana study shows how the LLM’s memory can be jogged to retain precise personal information, with just a bit of fine-tuning.

By default, when you use ChatGPT, OpenAI will save data such as your conversation history, as well as account details like name, email, IP address, and device information. This collected data can be used to train and improve future iterations of OpenAI models. Note that you can opt out of your data being stored and/or used for training, on both the web and mobile versions of ChatGPT.

Similarly, Google’s Bard stores your conversation data for up to 18 months by default. Human reviewers read and annotate conversation data to improve future iterations of the model, so you should never disclose confidential or proprietary information. You can choose to delete Bard conversation history in your Google Account settings, but even with this option, human reviewers will see the depersonalized conversation data.

Risks and Challenges of Cybersecurity in Generative AI

The following will delve into the primary security issues associated with code generated by generative AI systems. This critical aspect of integrating generative AI into programming raises various concerns, from security vulnerabilities to potential loopholes that could compromise the integrity and security of developed systems.

There are several risks associated with using code written by generative AI models, such as:

Security Illusion

Developers, when using generative AI to obtain code snippets, are more prone to believe they are obtaining and writing secure code. According to a Stanford University study, AI-based coding tools were observed generating insecure code in controlled laboratory environments. This finding raises significant questions about their application in real-world scenarios.

Compliance com o GDPR

The generative artificial intelligence’s ability to create synthetic code highly similar to real code raises crucial questions about identifying the origin of this code. With the difficulty in distinguishing between AI-generated code and human code, concerns arise regarding the correct attribution of authorship and legal responsibility. This is particularly relevant in the context of the General Data Protection Regulation (GDPR), where clear identification of the code’s origin is essential to ensure compliance with data protection regulations.

Precautions in Data Input

It is crucial to adopt precautions regarding the content of prompts entered into generative AI models. The code submitted to such services may contain not only confidential intellectual property of the company but also sensitive data such as API keys that grant privileged access to customer information. Transmitting sensitive data to third-party AI providers like OpenAI can pose compliance issues with laws such as GDPR, especially if this data includes Personally Identifiable Information (PII). At Balwurk, we understand the critical importance of Governance, Risk and Compliance (GRC) and provide extensive services to safeguard your systems. Our dedicated team is poised to fortify your defenses and promptly and effectively address any security concerns you may have.

Understanding Company Policies

Before incorporating generative AI tools into your work, it’s crucial to be aware of your company’s policies regarding their usage. Some organizations, such as Amazon, Wells Fargo and Deutsche Bank, have outright prohibited the use of ChatGPT for company tasks.

For those with more lenient policies, it’s still essential to exercise caution. Avoid inputting sensitive or proprietary information into generative AI tools. Additionally, consider disabling chat history to prevent your conversations from being used to train AI models. Remember, your conversation history might not be entirely private.

Malware Threats

The ability of Generative AI to create new types of complex malware poses a serious cybersecurity threat. These malwares can be designed to bypass conventional detection methods, making it challenging for security systems to identify and neutralize them. This significantly increases the risk of successful cyberattacks, including ransomware attacks, which can cause substantial damage to organizations’ systems and data. Furthermore, the use of code snippets generated by Generative AI without proper prior analysis can inadvertently introduce ransomware into software systems, compromising data security and integrity.

To address these concerns, Balwurk offers a range of services, including Application Security Testing (AST), specifically Static Application Security Testing (SAST). SAST services provided by Balwurk can be applied to analyze code generated by Generative AI, helping to identify and mitigate potential vulnerabilities and malware threats. By leveraging SAST solutions, organizations can enhance their cybersecurity posture and protect against the risks associated with the use of AI-generated code.

Data Poisoning

Data poisoning is another cybersecurity risk associated with Generative AI. This type of attack targets AI models in their development and testing environments. It involves injecting malicious data into training data, thus influencing the resulting AI model and its outputs. A Generative AI tool that has been targeted by a data poisoning attack may produce significant and unexpected deviations in its output. Detecting model poisoning is challenging, as the poisoned data may appear harmless. As a result, organizations whose models fall victim to data poisoning may violate GDPR and specific AI regulations, depending on their security and organizational measures.

Package Hallucinations

Generative AI hallucinations refer to phenomena in which artificial intelligence models generate outputs that are incorrect or irrelevant to the task at hand. These hallucinations can result in absurd, distorted, or illogical responses and are often symptoms of flaws in the model’s training or limitations in its ability to comprehend and generalize information. Generative AI is susceptible to hallucinations, particularly in recommending non-existent or outdated packages, known as package hallucinations.

AI hallucinations are akin to how humans sometimes perceive shapes in clouds or faces on the moon. In the case of AI, these misinterpretations occur due to various factors, including overfitting, training data bias/inaccuracy, and high model complexity.

In June 2023, Vulncheck researchers discovered malicious fake GitHub repositories attributed to a non-existent company named High Sierra Cyber Security. It’s imperative not to blindly execute code from LLMs. Before installing recommended packages, verify their legitimacy by confirming their existence and examining factors such as creation date, download count, and feedback from online coding communities.

Balwurk can help in identifying and mitigating the risks associated with package hallucinations by offering robust cybersecurity solutions tailored to the specific needs of organizations.

IA generativa e Bias

Security implications arise in the code generated by generative AI systems, particularly in in-house implementations. When developing such systems internally, it’s paramount to understand the diverse sources of bias that can impact these technologies, particularly in code generation. Three primary areas of concern include biases in training data, algorithmic biases and cognitive biases.

Training data bias

In the context of code generation, bias in training data can result in models that favor certain programming styles or paradigms over others. For example, if the training dataset predominantly contains examples of code written in a specific language or following certain coding practices, the model may reproduce these patterns more frequently, even if other approaches are equally valid.

Algorithmic bias

Algorithmic bias in code generation can arise when the model favors certain code solutions or structures based on arbitrary or non-representative criteria. For instance, if the model is trained on a dataset primarily consisting of code solutions from a particular community or source, it may tend to generate code that reflects those preferences, even if there are more effective or appropriate approaches for the given problem.

Cognitive bias

Cognitive bias in code generation can occur when developers interpret and implement the results generated by the model in a biased manner, influenced by their own experiences, biases, or assumptions. For example, if a developer has a predisposition towards certain coding practices or design concepts, they may tend to select and use model results that align with their own preferences, even if it results in less objective or efficient solutions.

It’s essential to proactively address these sources of bias when developing generative AI systems for code generation to ensure that the models can produce unbiased, objective, and culturally sensitive results. This includes diversifying training datasets, implementing bias control mechanisms during training, and fostering an organizational culture that values equity and diversity in decision-making related to software development.

Conclusion

In conclusion, as generative AI becomes increasingly integrated into programming, it presents significant cybersecurity risks. These include concerns about user privacy, code security, and biases within AI models. By prioritizing collaboration and ethical practices, we can effectively address these challenges and harness the full potential of AI-driven programming while safeguarding against potential risks.

References

Manjoo, F. (2023, June 3). Opinion | It’s the end of computer programming as we know it. (And I feel fine.). The New York Times. https://www.nytimes.com/2023/06/02/opinion/ai-coding.html

Wilkinson, L. (2024, March 21). How CIOs can infuse security into generative AI adoption. CIO Dive. https://www.ciodive.com/news/cybersecurity-generative-ai-flaws-platform-ecosystem-vulnerabilities/710627/

SonarSource. (2024, February 9). AI code generation: benefits and risks. Sonar.

https://www.sonarsource.com/learn/ai-code-generation-benefits-risks

Deniz, B. K., Gnanasambandam, C., Harrysson, M., Hussin, A., & Srivastava, S. (2023, June 27). Unleashing developer productivity with generative AI. McKinsey & Company. https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/unleashing-developer-productivity-with-generative-ai

Hype or not? AI’s benefits for developers explored in the 2023 Developer Survey – Stack Overflow. (2023, June 14). https://stackoverflow.blog/2023/06/14/hype-or-not-developers-have-something-to-say-about-ai/

Wankhede, C. (2023) Does ChatGPT save your data? Here’s how to delete your conversations, Android Authority. Available at: https://www.androidauthority.com/does-chatgpt-save-data-conversations-3310883/.

Mok, A. (2023) Amazon, Apple, and 12 other major companies that have restricted employees from using chatgpt, Business Insider. Available at: https://www.businessinsider.com/chatgpt-companies-issued-bans-restrictions-openai-ai-amazon-apple-2023-7.

Do Users Write More Insecure Code with AI Assistants? (arxiv.org)

8 Generative AI security Risks that you should know. (2023, September 7). GlobalSign.

https://www.globalsign.com/en/blog/8-generative-ai-security-risks

What are AI hallucinations? | IBM. (n.d.). https://www.ibm.com/topics/ai-hallucinations

Lanyado, B. (2023, December 13). Can you trust chatgpt’s package recommendations? Vulcan Cyber. https://vulcan.io/blog/ai-hallucinations-package-risk?nab=0