
Reactor Mk.1 performances: MMLU, HumanEval and BBH test results

September 27, 2024
Abstract - The paper presentsthe performance results of Reactor Mk.1, ARC’s flagship large language model,through a benchmarking process analysis. The model utilizes the Lychee AIengine and possesses less than 100 billion parameters, resulting in acombination of efficiency and potency. The Reactor Mk.1 outperformed modelssuch as GPT-4o, Claude Opus, and Llama 3, with achieved scores of 92% on theMMLU dataset, 91% on HumanEval dataset, and 88% on BBH dataset. It excels inboth managing difficult jobs and reasoning, establishing as a prominent AIsolution in the present cutting-edge AI technology.  

Index Terms- Benchmark evaluation, BIG-Bench-Hard,

HumanEval, Massive Multitask Language Understanding


I. Reactor Mk.1

Reactor Mk.1, developed by ARC [1], is a new AI modelfor mass adoption which is built upon Lychee AI, a NASA award-winning AIengine. With less than 100B parameters in total included in its structure,ARC's vision with the Reactor Mk.1 empower the common user in AI, shaping thefuture of digital interaction and connectivity. Long-term speaking, ARC plansto support the Reactor Mk.1 with educational resources to empower users tobetter understand and utilise the full potential of AI technology. II. Other models i. GPT 4o

OpenAI has launched GPT-4 Omni (GPT-4o) [2], a newmultimodal language model. This model supports real-time conversations,Q&A, and text generation, utilizing all modalities in a single model tounderstand and respond to text, image, and audio inputs. One of the mainfeatures of GPT-4o is its ability to engage in real-time verbal conversationswith minimal delay, respond to questions using its knowledge base, and performtasks like summarizing and generating text. It also processes and responds to combinationsof text, audio, and image files.

ii. Claude Opus

Claude Opus [3], created by Anthropic [4], is capable of performing morecomplex cognitive tasks than simple pattern recognition or text generation. For example,it can analyze static images, handwritten notes, and graphs, and it cangenerate code for websites in HTML and CSS. Claude can turn images intostructured JSON data and debug complex code bases. Additionally, it cantranslate between various languages in real-time, practice grammar, and createmultilingual content.

iii. Llama3

Meta Llama 3 [5], represents one of the AI assistantsdesigned to help users learn, create content, and connect with others. Twomodels of Llama 3 were released, featuring 8 billion and 70 billion parameters,supporting a wide range of use cases. Llama 3 demonstrates state-of-the-artperformance on industry benchmarks and offers improved reasoning and codegeneration. It uses a decoder-only transformer architecture, featuring atokenizer with a 128K vocabulary and grouped query attention across 8B and 70Bsizes. The models are trained on sequences of 8,192 tokens to ensure efficientlanguage encoding and inference.

iv. Gemini

Gemini [6], introduced by Google, offers differentmodels for various use cases, ranging from data centres to on-device tasks.These models are natively multimodal and capable of understanding and combiningtext, code, images, audio, and video files. This capability enables thegeneration of code based on various inputs and the performance of complexreasoning tasks. The new Gemini models, Gemini Pro and Ultra versions,outperform previous models in terms of pretraining and post-trainingimprovements. Their performance is also superior in reasoning and codegeneration. Importantly, Gemini models undergo extensive safety testing,including bias assessments, in collaboration with external experts to identifyand mitigate potential risks. v. GPT 3.5

The GPT-3.5 model [7] is designed to understand and generate naturallanguage as well as code. This cost-effective model, featuring 175 billionparameters, is optimized for both chat applications and traditional tasks. As afine-tuned version of GPT-3, it uses deep learning to produce humanlike text.GPT-3.5 performs well in providing relevant results due to its refinedarchitecture. The latest embedding models, including text-embedding-3-large,text-embedding-3-small, and text-embedding-ada-002, also offer good performancein multilingual retrieval tasks. These models allow adjustments to theembedding size through a new dimension parameter, providing control over costand performance.

vi. Mistral

On December 11, 2023, Mistral AI [8] releasedMixtral 8x7B [9], a Sparse Mixture-of-Experts (SMoE) [10] model with openweights. Mixtral 8x7B demonstrated better performance than Llama 2 70B on mostbenchmarks and offers six times faster inference. It also shows superiorproperties compared to GPT-3.5, making it a good choice regarding cost andperformance. Mixtral can handle a context of 32k tokens, shows strong featuresin code generation, and can be fine-tuned to follow instructions, achieving ascore of 8.3 on MT-Bench. With 46.7 billion total parameters but using only12.9 billion per token, Mixtral maintains both speed and cost efficiency.


Benchmarking of the introduced models will beperformed on three globally recognized and widely utilized datasets fortraining LLMs: Massive Multitask Language Understanding (MMLU), HumanEval, andBIG-Bench-Hard (BBH) datasets.  


The MMLU [11] is proposed with the purpose to assessa model's world knowledge and problem-solving ability. It represents a novelbenchmark approach designed to evaluate the multitasking accuracy of a languagemodel. This test covers 57 different subjects, including elementarymathematics, US history, computer science, and law. The questions are collectedfrom various sources, such as practice exams, educational materials, andcourses, and cover different difficulty levels, from elementary to professional.For example, the "Professional Medicine" task includes questions frommedical licensing exams, while "High School Psychology" featuresquestions from Advanced Placement exams. This collection helps measure amodel's ability to learn and apply knowledge across different subjects.

In essence, MMLU tests models in zero-shot andfew-shot settings, requiring them to answer questions without additionaltraining. Despite the AI progress witnessed today, even the best models stillexhibit poor MMLU performance in expert-level accuracy across all 57 tasks.Additionally, these models commonly perform inconsistently, often failing inareas such as morality and law, where they display random accuracy. II. HumanEval

The researchers created HumanEval, an evaluation set to measurefunctional correctness in synthesizing programs from docstrings. Codex (a GPTlanguage model from Chen et al., 2021 [12]) had achieved a 28.8% success rateon this set, while GPT-3 solves almost none, and GPT-J solves 11.4%. HumanEval finds application in variousmachine learning cases, particularly within the domain of LLMs. It assesses thefunctional correctness of code generated by LLMs and presents programmingchallenges for models to solve by generating code from docstrings. Evaluationrelies on the code's ability to pass provided unit tests. Additionally, thedataset serves as a benchmark for comparing the performance of different LLMsin code generation tasks, enabling the use of a standardized set for performanceevaluations. In addition, HumanEval's application has introduced the creationof new evaluation metrics like pass@k, which offer additional assessments ofmodels' programming challenge-solving abilities.


The BBH [13] dataset represents a subset of theBIG-Bench benchmark, designed to evaluate the capabilities of LLMs acrossvarious domains, including traditional NLP, mathematics, and commonsensereasoning. This dataset encompasses more than 200 tasks, aiming to push thelimits of current language models. It specifically targets 23 unsolved tasksidentified based on criteria such as the requirement for more than threesubtasks and a minimum of 103 examples.

To assess BBH results accurately, multiple-choice andexactmatch evaluation metrics were employed. Analysis of BBH data revealedsignificant performance improvements with the application of chain-of-thought(CoT) prompting. For instance, CoT prompting enabled the PaLM model to surpassaverage human-rater performance on 10 out of the 23 tasks, while Codex(code-davinci-002) exceeded human-rater performance on 17 out of the 23 tasks.This enhancement is attributed to CoT prompting's ability to guide modelsthrough multi-step reasoning processes, essential for tackling complex tasks.


Testing the models on the described test datasets(Table 1), Reactor Mk. 1 demonstrated significant benchmark scores, achieving a92% score on the MMLU benchmark, a 91% score on the HumanEval, and an 88% scoreon the BBH evaluation.  


Incomparison to other presented models in Table 1, Reactor Mk. 1's performancehas a superior position in several analysed categories. For instance, in theMMLU benchmark, Reactor Mk. 1 92% surpasses and outperforms the OpenAI'sGPT-4o, which scored 88.7%, and as well as significantly outperforms othermodels like Anthropic's Claude Opus and Meta's Llama3, which scored 86.8% and86.1%, respectively. Google Gemini and OpenAI GPT-3.5 were further behind,scoring 81.9% and 70%.  

Onthe HumanEval benchmark, which assesses code generation capabilities, ReactorMk. 1 achieved a 91% score, with the outperformance of all compared models.OpenAI's GPT-4 was close behind with a score of 90.2%, followed by Anthropic'sClaude at 84.9% and Meta's Llama at 84.1%.

GoogleGemini and OpenAI GPT-3.5 scored 71.9% and 48.1%, respectively, but indicatinga significant performance gap.

Forthe BBH evaluation, which focuses on challenging tasks that require complexreasoning, Reactor Mk. 1 achieved an 88% score. This result demonstrates thesuperior achievement of Reactor Mk. 1 capability in reasoning and handlinglanguage understanding tasks.  

Thedramatic lead achieved by the ARC Reactor Mk. 1, especially with a score of 92%on MMLU, underscores our significant advancements. Remarkably, these resultswere accomplished with a handful of GPUs, highlighting the efficiency and powerof our model compared to the more resource-intensive approaches used by otherleading models. The benchmark scores indicate that the ARC Reactor Mk. 1 notonly outperforms in understanding and generating code but also demonstrateshuge performance in reasoning and handling challenging language tasks. Theseresults position the ARC Reactor Mk. 1 as a leading model in the current stateof the art of AI technology


This article aims to concisely present the performance of the ReactorMk.1 AI model when tested on three popular datasets: MMLU, HumanEval, and BBH.In summary, the model achieved a 92% score on the MMLU dataset, a 91% score onthe HumanEval, and an 88% score on the BBH evaluation. To demonstrate thesignificance of these results, other popular models like GPT 4o, Claude, Llama3, Gemini, and Mistral are used as benchmark models. The Reactor Mk.1 exhibitedsuperior performance compared to the benchmark models,establishing itself as a leader in solving various LLM tasks and complexproblems.


