An Introduction to Open-Source Language Models Part I: The Case for Open-Source GPT

An Introduction to Open-Source Language Models Part I: The Case for Open-Source GPT

While the Generative Pre-trained Transformer models have existed for a few years, the launch of ChatGPT in late-Nov 2022 has received massive traction, jumpstarting a massive arms race with every major tech company wanting in. As opposed to GPT 1 & 2, GPT 3 and later versions were released as closed-source, which poses several problems for smaller organizations wanting to take part in the revolution and improve their product offering. In this series of posts, we are happy to share some of our research and hands-on experience in dealing with some of these models.

The Case for Open-Source GPT

OpenAI does provide an API to their newer closed-source GPT models, allowing anyone to use their various models on a pay-per-use basis (with the exception of GPT 4 for now), however, this approach is not without issues.

From an industry standpoint, the choice to release GPT closed source is often criticized by developers and researchers, who raise concerns about the competitive disadvantage for smaller companies and organizations that may not be able to pay for API access or to develop their own LLMs. Also, it makes it difficult to verify the quality and biases of the underlying data and training process as advertised by OpenAI (in fact, the GPT models have been criticized for being biased in the past).

From a technical standpoint, OpenAI’s pay-per-use API licensing model may in some cases incur greater bills than deploying a large language model on your choice of cloud provider or even on-prem. Moreover, using the OpenAI API involves sending client data to be processed on the OpenAI servers, which has significant privacy & compliance implications. In fact, OpenAI states that data that goes through the API is saved for 30 days for “abuse and misuse monitoring purposes”.

Open-Source GPT and Research in DoControl

We primarily use open-source LLMs to tag data for training and prototyping our machine-learning algorithms using zero-shot and few-shot learning. LLM-based data annotation is often cheaper and faster than human annotation (though not always of the same quality). Doing this via an open-source model means we can avoid sharing sensitive information with 3rd parties.

This approach allows us to iterate and deliver insights and security alerts to our clients faster.

Understanding the Technology and History Behind GPT

The world of open-source large language models has a lot of different, convoluted offerings which may make it overwhelming at first. In order to understand it, we should first understand the technology behind it and its history, with OpenAI being the driving force. 

Understanding Model Size (Parameter Count)

For natural language to be processed by an LLM, the text needs to be parsed and transformed into numerical parameters.

I like to compare the number of parameters of a large language model to the number of bits in a traditional computer processor. The more bits a CPU has, the more sophisticated input it can process at once. 

However, the number of bits is not everything: the output and performance of the CPU are heavily influenced by the algorithm being run, and its compilation of the algorithm. In the LLM world, we can compare the algorithm and compilation to the model and the training process, which influence the LLM performance to a high degree.

A Brief History of GPT

Before discussing the open-source GPT offerings, we should briefly discuss the evolution of OpenAI’s GPT and the timeline. The first Generative Pre-trained Transformer large language model (LLM) was released by OpenAI back in 2018. It sported a parameter count of ~130M. GPT was not the first LLM to be released, it, however, was the first to use the now-popular Transformer model architecture and its self-attention training algorithm. Also, it was the first to be trained on a then-huge dataset sizing 40GB. It was also novel in its concept, GPT was trained to complete sentences and generate new text, while older LLMs were trained with more specific tasks (translation, sentiment analysis, etc).

In February 2019, OpenAI released GPT 2. By tuning the architecture, increasing the number of parameters from ~130M to 1.5B, and improving the language modeling, GPT 2 became a straight-up improvement from GPT 1, outputting more coherent text and catering to more tasks. 

GPT 3 was released in 2020 and is the first GPT version to not be released open-source. OpenAI’s decision to not release GPT 3 open source is likely due to both ethical concerns (misuse of the technology) and monetization intentions. GPT 3 was released in different model sizes with a parameter count ranging from ~125M to ~165B.

In early 2022, two important things happen: In January, OpenAI release InstructGPT  –  models that were fine-tuned to follow human instructions. The model is actually inspired by a Google paper that came out a few months later and introduces their FLAN model. InstructGPT allowed the model to be more specific, instead of a model that just generates text statistically, the instruction prefix added in InstructGPT helps control how the resulting text is generated. In March, OpenAI releases GPT 3.5  – an improved version of GPT 3 trained on newer data. In November 2022, on the basis of InstructGPT and GPT 3.5, OpenAI released ChatGPT. All hell breaks loose. LLM, GPT, and AI are now everybody’s favorite buzzwords.

On March 14th of 2023, OpenAI released GPT 4. The new multimodal model supports images as input and is advertised to be even more reliable and nuanced than GPT 3.5, facing standardized tests such as the SAT and the uniform bar exam, and exceeds the passing score on the United States Medical Licensing Examination. GPT 4 is the most “closed” version of GPT to date, the API is currently in a limited beta state, and ChatGPT 4 is available for ChatGPT Plus subscribers with a usage cap. Many technical aspects such as the training method, dataset construction, and even the model size remain undisclosed.

Stay Tuned for Part II

In the next post, we will go through a brief history and overview of post-GPT3 era open-source GPT models, based on our hands-on experience and research. To name-drop a few models you may be interested in Googling and trying out as a spoiler for the next part: GPT-J-6B/GPT-NeoX-20B, LLaMA, Alpaca, Dolly & FLAN UL-20.

Read Part II here.

No items found.
The SaaS Security Threat Landscape Report

Research-based benchmarks to assess risk across critical threat model

Read now
DoControl - SaaS data access control - open blog button
Learn more about DoControl.
Get a demo today.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Follow DoControl on social media
DoControl - SaaS data access control - Linkedin logoDoControl - SaaS data access control - Twitter logo
Related Posts