This article discusses the evolution of Large Language Models (LLMs) for code, from RNNs to Transformers. It covers the development of models like Code2Vec, CodeBERT, Codex, CodeT5, PLBART, and the latest model, Code Llama. These models have advanced code understanding and generation tasks, improving programming efficiency.
How Code LLMs progressed from RNNs to Transformers
Introduction
Recent years have seen remarkable evolution of language models with the introduction of Transformers, which has revolutionized the way we perform our daily tasks like writing emails, creating documentations, searching the web and even the way we code. With researchers applying Large Language Models in code intelligence tasks, a new field of Neural Code Intelligence has emerged. This domain aims at improving programming efficiency and minimizing human errors in the software industry by solving tasks like code summarization, generation and translation.
With the latest release of Code Llama, the state of art model by Meta AI for code generation and understanding, this article looks back at the evolution of Large Language Models (LLMs) for Code, from RNNs to Transformers.
Code2Vec, 2018
This was one of the first attempts for language models to understand code. Code2Vec aimed at representing code snippets into embeddings. These embeddings capture semantic and structural information from the code, making them useful for various software engineering tasks such as code classification, retrieval, and understanding.
Training Set: 14M Java Program Examples
Model Architecture: RNN + Feed-Forward Network
Novelty:
– Path-based Attention Model: The authors propose a novel neural network architecture that uses syntactic paths in the Abstract Syntax Tree (AST) of a code snippet as input features. The model learns to assign different attention weights to each path, and to aggregate them into a single code vector.
CodeBERT, 2020
CodeBERT, developed by Microsoft Research team, represents a significant advancement in the realm of Large Language Models (LLMs) for code by introducing multimodal data pre-training, combining Natural Language and Programming Language (NL + PL) on the Transformer based BERT model. The model is trained on a diverse dataset comprising both bimodal data points pair and unimodal data points for Masked Language Modeling (MLM) and Replaced Token Detection (RTD) tasks.
Training Dataset: Codesearch Net Dataset- 2.1M bimodal Data points (NL + PL), 6.4M Unimodal Data Points (6 languages — Python, Java, Javascript, PHP, Ruby, Go)
Parameter Size: 125M
Model Architecture: RoBERTa-base
Novelty:
– Bimodal Training: CodeBERT introduces an innovative training approach that encompasses both Natural Language and Programming Language tokens.
– Replace Token Detection (RTD) Task for code: CodeBERT pre-training used Replace Token Detection (RTD) instead of Next Sentence Prediction(NSP) which showed superior performance.
Codex, 2021
Codex was one of the first successful Code LLM to generate code from doc-string or Natural language prompts with high accuracy, and predecessor of widely used Github Copilot. Developed by the OpenAI team, Codex uses GPT3 architecture & tokenizer, and pre-trains on a large corpus of Github code. This Large Language model has 12B parameters, and was a state-of-art model in 2021.
Training Dataset: 159GB of python files from 54M Github Repositories.
Parameter Size: 12B (Codex- 12B)
Model Architecture: GPT3
Novelty:
– One of the first successful models which excelled in code-writing capabilities from Natural language prompts.
– Authors of this model also created a new dataset, “HumanEval” to benchmark models for code-generation tasks.
CodeT5, 2021
Code-T5 is an encoder-decoder model based on the T5 architecture, distinct from both CodeBERT (encoder-only) and Codex (decoder-only) models. It introduces a unique identifier-aware denoising pre-training task which helps the model distinguish and recover identifiers in code, enhancing its understanding of structure.
Training Dataset: Codesearch Net Dataset (Same as CodeBERT)
Parameter Size: 220M
Model Architecture: T5 (Encoder-Decoder Architecture)
Novelty:
– Encoder-Decoder Model: One of the first Encoder-Decoder Code LLM to support both code-understanding and code-generation tasks.
– Proposes a novel pre-training objective identifier-aware denoising, which learns token-type information and structure of the code.
PLBart, 2021
PLBART, or Program and Language BART, model leverages the BART model architecture to automate a range of software engineering tasks, encompassing code summarization, generation, and translation under the umbrella of PLUG (Program and Language Understanding and Generation).
Training Dataset: 2M Java and Python Functions and their Natural Language descriptions collected from Github, Stackoverflow (code).
Parameter Size: 140M (6 encoder layer + 6 decoder layer + additional norm layer on encoder and decoder)
Model Architecture: BART
Novelty:
– Denoising Auto-encoder Approach: Employs a denoising auto-encoder approach, which enhances code understanding and generation by effectively utilizing the bidirectional and auto-regressive properties of both the encoder and decoder, combining the strengths of BERT and GPT models.
– Diverse Noising Strategies: Proposes multiple denoising strategies, such as token masking, token deletion, and token infilling.
Code Llama, 2023
Code Llama is the latest Code LLM, released by Meta, which beats all the existing open-source models in several benchmark datasets. It scores 53% on HumanEval Dataset and 55% on MBPP dataset. These gains can be attributed to longer context length and training pre-trained Llama 2 on extra tokens from Program and Natural Language.
Training Dataset: 500B tokens + additional 100B tokens for Code llama Python on publicly available code
Model Architecture: Llama 2
Parameter Size: Available in 3 sizes — 7B, 13B and 34B.
Novelty:
– Proposed a fine-tuning step to handle long sequences called Long Context Fine-Tuning.
– Instruction Fine Tuning & Self-Instruct: Performs instruction fine-tuning, which uses explicit instruction or prompts during the fine-tuning process.
Conclusion
Transformers have revolutionized the field of Large Language Models for Code, enabling advancements in code understanding, generation, and translation. These models have the potential to redefine how we code as software engineers, improving efficiency and reducing errors. To stay competitive and leverage the power of AI, companies should consider implementing AI solutions like Code LLMs gradually, starting with pilot projects and expanding usage based on measurable impacts on business outcomes.