OpenAI has launched its latest model, GPT-4, which is a large multimodal model capable of accepting both image and text inputs and producing text outputs. The company claims that the model exhibits human-level performance on various professional and academic benchmarks, making it a significant improvement over its predecessor, GPT-3.5.
GPT-4 is now available via OpenAI’s API with a waitlist, and for ChatGPT Plus, OpenAI’s premium plan for ChatGPT, its AI-powered chatbot.
GPT-4 is capable of accepting both text and image inputs, a major improvement over its predecessor, GPT-3.5, which only accepted text. GPT-4 has already demonstrated human-like performance on various professional and academic benchmarks.
According to OpenAI, GPT-4 passes a simulated bar exam with a score in the top 10% of test takers, which is a considerable improvement over GPT-3.5, which had a score around the bottom 10%.
One of the more interesting aspects of GPT-4 is its ability to understand images as well as text. GPT-4 can caption and interpret relatively complex images, for example, identifying a Lightning Cable adapter from a picture of a plugged-in iPhone. This capability is being tested with a single partner, Be My Eyes, which is using it to develop a new Virtual Volunteer feature. This feature can answer questions about images sent to it, such as identifying the contents of a user’s refrigerator and suggesting recipes based on those ingredients.
OpenAI has also introduced a new API capability, “system” messages, that allows developers to prescribe style and task by describing specific directions. System messages are essentially instructions that set the tone and establish boundaries for the AI’s next interactions. This capability will also come to ChatGPT in the future.
Despite these upgrades, OpenAI acknowledges that GPT-4 is not perfect. It still “hallucinates” facts and makes reasoning errors, sometimes with great confidence. For example, in one instance, GPT-4 described Elvis Presley as the “son of an actor,” a blatant misstep. However, OpenAI notes that GPT-4 is 82% less likely to respond to requests for “disallowed” content compared to GPT-3.5 and responds to sensitive requests, such as medical advice and anything pertaining to self-harm, in accordance with OpenAI’s policies 29% more often.
Microsoft has confirmed that Bing Chat, its chatbot tech co-developed with OpenAI, is running on GPT-4. Other early adopters include Stripe, which is using GPT-4 to scan business websites and deliver a summary to customer support staff, and Duolingo, which has built GPT-4 into a new language learning subscription tier.
OpenAI spent six months iteratively aligning GPT-4 using lessons from an adversarial testing program as well as ChatGPT, resulting in “best-ever results” on factuality, steerability, and refusing to go outside of guardrails, according to the company. The improvements are expected to allow GPT-4 to handle much more nuanced instructions than its predecessor, GPT-3.5.
The company has rebuilt its entire deep learning stack over the past two years and co-designed a supercomputer with Azure from the ground up for its workload. A year ago, they trained GPT-3.5 as a test run of the system, and the bugs were fixed, and theoretical foundations were improved.
OpenAI has released GPT-4’s text input capability via its ChatGPT system and API, but the image input capability will be made available to a single partner before being rolled out more widely. The company is also open-sourcing its OpenAI Evals framework, which automates the evaluation of AI model performance, to allow anyone to report shortcomings in OpenAI’s models and help guide further improvements.
GPT-4 was found to be more reliable, creative, and capable of handling much more nuanced instructions than its predecessor. In tests, OpenAI found that the difference between GPT-3.5 and GPT-4 became apparent when the complexity of the task reached a sufficient threshold. OpenAI tested the models on various benchmarks, including simulating exams originally designed for humans, using the most recent publicly-available tests or purchasing practice exams for the 2022-2023 period.
Additionally, it is important to note that GPT-4 still has limitations in terms of its knowledge base. The model does not have access to information beyond September 2021, which means it may not be able to provide accurate or up-to-date responses to questions that require knowledge of recent events.

