“What is ‘best practice’ at the time of writing may slowly become ‘bad practice’ as the cybersecurity landscape evolves.”
Modern-day deep learning (DL) models, especially the ones powering sophisticated NLP based applications, have become so advanced that they can even run code diagnostics and perform interventions on a codebase. For example, GitHub recently released Copilot, an AI-based programming assistant that can generate code in popular programming languages. All one has to do is give some context to the Copilot, such as comments, function names, and surrounding code. Copilot is built on OpenAI’s GPT-3 that is trained on open-source code, including “public code…with insecure coding patterns”, thus giving rise to the potential for “synthesise[d] code that contains these undesirable patterns”.
Based on the OpenAI Codex family of models, Copilot’s tokenisation step is nearly identical to GPT-3. For instance, byte pair encoding is used to convert the source text into a sequence of tokens.
A Brief Overview of Copilot:
Given a prompt, Codex and Copilot try to autocomplete code that is most relevant to the prompt given by the user. These tools are powered by GPT-3, which is trained on a publicly available dataset. This also means that the code generated is more of a probabilistic exercise of finding the best code. This can usher bad code into the systems. The researchers at NYU fear that the model will not necessarily generate the best code but rather the one that best matches the code that came before. According to the researchers, the quality of the generated code can be strongly influenced by semantically irrelevant features of the prompt.
To find how vulnerable code is generated from platforms such as Github’s Copilot, NYU researchers investigated the prevalence and conditions that can cause GitHub Copilot to recommend insecure code. To perform this analysis, they prompted Copilot to generate code in scenarios relevant to high-risk CWEs (e.g. those from MITRE’s “Top 25” list). 
CWE is an open community initiative sponsored by the Cybersecurity and Infrastructure Security Agency (CISA). According to MITRE, Common Weakness Enumeration (CWE) is a community-developed list of common software weaknesses such as flaws, faults, bugs, or other errors in software implementation that can result in systems and networks being vulnerable to attack. The CWE List and its glossary are used to identify and describe these weaknesses in terms of CWEs.
The researchers validated Copilot’s performance on three distinct code generation axes — examining how it performs given the diversity of weaknesses, diversity of prompts, and diversity of domains. In total, we produce 89 different scenarios for Copilot to complete, producing 1,692 programs. Of these, we found approximately 40% to be vulnerable.
(Image credits: Paper by Pearce at al.,)
The above picture illustrates the methodology used by the researchers to validate CoPilot, which can be summarised as follows:

The researchers designed 54 scenarios across the 18 different CWEs. From these, Copilot was able to generate options that produced 1087 valid programs. Of these, 477 (43.88 %) were determined to contain a CWE. Of the scenarios, 24 (44.44 %) had a vulnerable top-scoring suggestion. Breaking down by language, 25 scenarios were in C, generating 516 programs. 258 (50.00 %) were vulnerable. Of the scenarios, 13 (52.00 %) had a top-scoring program vulnerable. 29 scenarios were in Python, generating 571 programs in total. 219 (38.4,%) were vulnerable.
Compared with the earlier two languages (Python and C), Copilot struggled with generating syntactically correct and meaningful Verilog. This is mostly due to the smaller amount of training data available. As Copilot is trained over open-source code available on GitHub, the authors believe that the variable security quality stems from the nature of the community-provided code. That is, where certain bugs are more visible in open-source repositories, those bugs will be more often reproduced by Copilot. 
The researchers also observed that being a generative model, Copilot’s outputs are not directly reproducible. For the same given prompt, they warn that Copilot can generate different answers at different times. “As Copilot is both a black-box and closed source residing on a remote server, general users cannot directly examine the model used for generating outputs,” wrote the authors. They also admit that the scenarios written to validate Copilot’s performance do not completely justify the real world coding scenario, which is “messier” and contains larger amounts of context.
There is no doubt that low code or no code tools and platforms will flourish. While the coding community can improve their productivity with a coding assistant like Copilot at hand, outsiders and non-technical users can tinker with their ideas without having to dig deep into coding paradigms. The advantages are immense. This widespread adoption of AI based coding practices can also open doors to vulnerabilities. The New York University researchers recommend that tools like Copilot should be paired with appropriate security-aware tooling during both training and generation to minimise the risk of introducing security.
I have a master’s degree in Robotics and I write about machine learning advancements. email:[email protected]

Hands-On Workshop
Mastering Exploratory Data Analysis
28th Aug 2021
Virtual Conference
oneAPI DevSummit, Asia-Pacific & Japan
15th Sep 2021
Virtual Conference
Deep Learning DevCon 2021
23-24th Sep 2021
Copyright Analytics India Magazine Pvt Ltd


Leave a Reply