Technology

Meta got caught gaming AI benchmarks

Published

1 year ago

April 8, 2025

Over the weekend, Meta dropped two new Llama 4 models: a smaller model named Scout, and Maverick, a mid-size model that the company claims can beat GPT-4o and Gemini 2.0 Flash “across a broad range of widely reported benchmarks.”

Maverick quickly secured the number-two spot on LMArena, the AI benchmark site where humans compare outputs from different systems and vote on the best one. In Meta’s press release, the company highlighted Maverick’s ELO score of 1417, which placed it above OpenAI’s 4o and just under Gemini 2.5 Pro. (A higher ELO score means the model wins more often in the arena when going head-to-head with competitors.)

The achievement seemed to position Meta’s open-weight Llama 4 as a serious challenger to the state-of-the-art, closed models from OpenAI, Anthropic, and Google. Then, AI researchers digging through Meta’s documentation discovered something unusual.

In fine print, Meta acknowledges that the version of Maverick tested on LMArena isn’t the same as what’s available to the public. According to Meta’s own materials, it deployed an “experimental chat version” of Maverick to LMArena that was specifically “optimized for conversationality,” TechCrunch first reported.

“Meta’s interpretation of our policy did not match what we expect from model providers,” LMArena posted on X two days after the model’s release. “Meta should have made it clearer that ‘Llama-4-Maverick-03-26-Experimental’ was a customized model to optimize for human preference. As a result of that, we are updating our leaderboard policies to reinforce our commitment to fair, reproducible evaluations so this confusion doesn’t occur in the future.“

A spokesperson for Meta, Ashley Gabriel, said in an emailed statement that “we experiment with all types of custom variants.”

“‘Llama-4-Maverick-03-26-Experimental’ is a chat optimized version we experimented with that also performs well on LMArena,” Gabriel said. “We have now released our open source version and will see how developers customize Llama 4 for their own use cases. We’re excited to see what they will build and look forward to their ongoing feedback.”

While what Meta did with Maverick isn’t explicitly against LMArena’s rules, the site has shared concerns about gaming the system and taken steps to “prevent overfitting and benchmark leakage.” When companies can submit specially-tuned versions of their models for testing while releasing different versions to the public, benchmark rankings like LMArena become less meaningful as indicators of real-world performance.

”It’s the most widely respected general benchmark because all of the other ones suck,” independent AI researcher Simon Willison tells The Verge. “When Llama 4 came out, the fact that it came second in the arena, just after Gemini 2.5 Pro — that really impressed me, and I’m kicking myself for not reading the small print.”

Shortly after Meta released Maverick and Scout, the AI community started talking about a rumor that Meta had also trained its Llama 4 models to perform better on benchmarks while hiding their real limitations. VP of generative AI at Meta, Ahmad Al-Dahle, addressed the accusations in a post on X: “We’ve also heard claims that we trained on test sets — that’s simply not true and we would never do that. Our best understanding is that the variable quality people are seeing is due to needing to stabilize implementations.”

“It’s a very confusing release generally.”

Some also noticed that Llama 4 was released at an odd time. Saturday doesn’t tend to be when big AI news drops. After someone on Threads asked why Llama 4 was released over the weekend, Meta CEO Mark Zuckerberg replied: “That’s when it was ready.”

“It’s a very confusing release generally,” says Willison, who closely follows and documents AI models. “The model score that we got there is completely worthless to me. I can’t even use the model that they got a high score on.”

Meta’s path to releasing Llama 4 wasn’t exactly smooth. According to a recent report from The Information, the company repeatedly pushed back the launch due to the model failing to meet internal expectations. Those expectations are especially high after DeepSeek, an open-source AI startup from China, released an open-weight model that generated a ton of buzz.

Ultimately, using an optimized model in LMArena puts developers in a difficult position. When selecting models like Llama 4 for their applications, they naturally look to benchmarks for guidance. But as is the case for Maverick, those benchmarks can reflect capabilities that aren’t actually available in the models that the public can access.

As AI development accelerates, this episode shows how benchmarks are becoming battlegrounds. It also shows how Meta is eager to be seen as an AI leader, even if that means gaming the system.

Update, April 7th: The story was updated to add Meta’s statement.

Technology

Tesla driver faces manslaughter charges over Texas crash that killed a woman inside her home

Published

1 hour ago

July 2, 2026

Press Room

Tesla driver faces manslaughter charges over Texas crash that killed a woman inside her home

On the video, I saw BUTLER’s Tesla continue to increase in speed, and saw the amount of pressure being applied to the accelerator pedal also increase in speed. In about six (6) seconds, the accelerator pedal was pressed all the way down to 100%, “pedal to the metal,” and the vehicle reached a speed of 73 miles per hour, more than double the speed limit on that residential street. The Tesla continued straight towards the middle of the cul-de-sac, struck the curb of the complainant’s driveway, and went airborne towards the front of the home… I noted that the brake pedal was never pressed in the final minute before the crash.”

Technology

Fox News AI Newsletter: American manufacturer says AI is creating jobs, not replacing them

Published

2 hours ago

July 2, 2026

Press Room

NEWYou can now listen to Fox News articles!

Welcome to Fox News’ Artificial Intelligence newsletter with the latest AI technology advancements.

IN TODAY’S NEWSLETTER:

– One of America’s oldest manufacturers says AI is creating jobs — not replacing them

– A missing kitten rode under a car hood. AI brought her home

– Trump says Taiwan is doubling the size of chipmaking plant in Arizona

DOMESTIC OUTPUT: Before Henry Ford rolled out the Model T, before the Wright brothers took to the skies and before the Statue of Liberty welcomed millions to America’s shores, Corning was already charting a course of innovation that continues today.

A Corning employee handles optical fiber as part of the manufacturing process that supports broadband and telecommunications infrastructure. (Courtesy of Corning)

DIGITAL RESCUE: Ame thought Lucy might be hiding upstairs. The family’s kitten had missed dinner, which felt odd. Still, cats hide. They nap in strange places. Sometimes, they ignore everyone.

MANUFACTURING PUSH: President Donald Trump on Wednesday said that Taiwan is doubling the size of the chipmaking plants under construction in Arizona, adding that it could help the U.S. share of the chip market rise to 50% by the end of his term.

LICENSED TO AI: The Trump administration has lifted export restrictions on two of Anthropic’s latest artificial intelligence models after the company worked with the Commerce Department on a national security review, according to statements released Tuesday.

SHIFTING GEARS: Ford has rehired experienced human engineers to help address the shortcomings of artificial intelligence (AI) tools meant to tackle quality issues in the automaker’s production processes.

PULSE CHECK: A routine heart test may be hiding a warning sign that doctors have missed for years. That is the big takeaway from new UC Berkeley research published in Nature. Researchers trained an artificial intelligence model to study ECGs, also called EKGs, and look for patterns tied to sudden cardiac death.

For participants under 65, an increase in the pulse pressure-heart rate index was associated with a 76% higher risk of developing dementia. (iStock)

NEW ERA: A new report is pushing back on artificial intelligence “doomsday” fears, arguing the technology could unleash one of the biggest productivity booms in American history — unless Washington slows it down with premature regulation.

REIN IN GLOOM: A Nobel Prize-winning economist has warned that persistent predictions of artificial intelligence destroying the job market could become a self-fulfilling prophecy. Robert Shiller, who shared the 2013 Nobel Prize in economics for his work on asset prices, wrote a guest essay on Monday in The New York Times that argued the panic over AI is not a new sociological phenomenon.

RAMAGEDDON ARRIVES: Apple has started charging more for some of its products, and AI is one of the big reasons why. The increases apply to select iPads and MacBooks, along with HomePod speakers and Apple TV devices. Apple’s own store pages now show higher prices on several models than earlier launch materials listed. The iPhone was not included in this round, but analysts warn that may not last.

A customer holds a new iPhone during the first day of in-store sales of Apple’s latest products at Apple’s Fifth Avenue store in New York, on Friday, Sept. 19, 2025. (Kena Betancur/Bloomberg via Getty Images)

Subscribe now to get the Fox News Artificial Intelligence Newsletter in your inbox.

FOLLOW FOX NEWS ON SOCIAL MEDIA

Facebook

Instagram

YouTube

LinkedIn

SIGN UP FOR OUR OTHER NEWSLETTERS

Fox News First

Fox News Opinion

Fox News Lifestyle

Fox News Health

DOWNLOAD OUR APPS

Fox News

FOX Business

Fox Weather

Fox Sports

Tubi

WATCH FOX NEWS ONLINE

Fox News Go

STREAM FOX NATION

Fox Nation

Stay up to date on the latest AI technology advancements and learn about the challenges and opportunities AI presents now and for the future with Fox News here.

This article was written by Fox News staff.

Technology

Mystery box shows are complicated for everyone — even the actors

Published

11 hours ago

July 2, 2026

Press Room

Mystery box shows are complicated for everyone — even the actors

Silo is such a complicated show that even its showrunner gets confused sometimes. While filming the final seasons of the Apple TV sci-fi thriller, Graham Yost remembers two instances where he messed up details: once it was an actor who realized that a conversation they were about to shoot should’ve already taken place, the other involved the Japanese localization team pointing out that a subtitle didn’t match what was going on onscreen. In both instances, the problem was ultimately fixed, but Yost’s reaction was the same: “Oh shit, you’re right.”

Keeping everything straight is one of the big challenges of working on such a complex series, and as Silo enters into its final two seasons, the challenge has only increased. So it’s a good thing Yost has a team working alongside him looking for those mistakes. “It’s a lot to keep track of, but everyone is pitching in,” he says, “and I love this sense of collaboration.”

Season 3 of Silo starts streaming on July 3rd, and it expands the story’s scope quite a bit. The series follows the lives of the residents of a huge underground bunker hundreds of years in the future. The silo is home to 10,000 people who essentially live in a vertical city, one divided into layers that each have their own jobs and cultures, from the mines at the bottom to the government up top. The only way to navigate the silo is through a massive spiral staircase that goes from top to bottom, creating a very physical form of class division.

Initially it seemed the residents were the last remnants of humanity living in a postapocalyptic wasteland. But over the course of the first two seasons, it became clear that they lived in but one silo of many, each housing their own communities while isolated from the rest. Season 3 adds a new wrinkle: showing how the world came to be this way in the first place, a process that starts in a world that looks much like our own.

The season 3 premiere constantly jumps back and forth between the bleak future where we’ve spent the last two seasons and our present day, when the decisions were made that led to everyone being trapped inside of underground bunkers. Things are already complicated as the show picks up from last season — protagonist / silo mayor / reluctant revolutionary Juliette (Rebecca Ferguson) has just become the first person to venture between silos and is now suffering from memory loss — and the multiple timelines only ratchets that up.

“It’s a lot of pieces you’re trying to put together.”

The cast of Silo all have different techniques for dealing with this challenge, which becomes even harder given that scenes are rarely shot in chronological order. For some, daily team meetings with directors can be an invaluable tool. “A lot of days, we’d start the day with story time, and the director would go through where we’re at, where we just came from, what happens next,” explains Alexandria Riley, who plays newly promoted authority figure Camille Sims in the show. “It’s already a complicated story anyway, but then when shooting out of order, you do get a bit foggy.” Ferguson notes that the hair-and-makeup team can be particularly helpful in tracking the story, as they need to be on top of things like scars and burns to maintain consistency. Every detail counts. “The little changes that you do have enormous ripple effects going forward,” she says.

“It’s a lot of pieces you’re trying to put together,” adds Common, who plays Camille’s husband Robert on the show. “It is our job to know where we are, but thank god we had support, too. There are times when I’d have to talk to Alex about something just to be reminded.” The two actors even had separate rehearsals together to make sure they had everything down.

Others took a different approach. Jessica Henwick, for instance, joined the main cast as the present-day investigative reporter Helen in season 3, and says that “I didn’t read any scenes except my own. Because I’m a fan of the show, I wanted to preserve that experience. I will watch season 3 as a fan and see what happens. I don’t know what happens except in our storyline.” (Henwick is such a fan that, soon after she was cast, she had a single goal in mind: “I went to the set and explored the stairs.”)

Image: Apple

One thing that doesn’t help much, however, is delving into the source material. Silo is based on a trilogy of books by author Hugh Howey; the first two seasons explored the first book, while the final two will wrap up the rest of the story. But much has changed in the adaptation as the TV show attempts to both make Juliette a more visible figure in the central part of the story and update some of the plotlines to reflect present day concerns like AI.

“I started reading the books and realized very quickly that that wasn’t going to help, because the books are so different,” explains Ashley Zukerman, who plays a congressman in the present day storyline. He says that keeping both the novels and the TV show in his mind at the same time wouldn’t be helpful and instead found “that reading the whole scripts and then finding a way to forget [what his character wouldn’t know] was useful.”

With two seasons to go, Silo is racing toward a conclusion as it attempts to wrap everything up. Yost says that four seasons was always the plan, so the process has been figuring out how to fit everything into a set number of episodes. But since the final two seasons were filmed back to back, it also means that the Silo team are done having to worry about keeping all of those complicated plotlines straight. And as much as she says she’ll miss the experience of working on the show, there is one thing Ferguson is excited to be done with beyond memorizing storylines.

“I fucking hated running up and down those stairs,” she says.

Follow topics and authors from this story to see more like this in your personalized homepage feed and to receive email updates.

Andrew Webster

Our Editors Found the Best Fitness Deals—Save Nearly 50% on Home Gym Equipment

Fitness22 seconds ago

Our Editors Found the Best Fitness Deals—Save Nearly 50% on Home Gym Equipment

Film Review: “looky loo: PART II” – MediaMikes

Movie Reviews12 minutes ago

Film Review: “looky loo: PART II” – MediaMikes

US Resumes Dollar Transfers to Iraq, NYT Reports

World15 minutes ago

US Resumes Dollar Transfers to Iraq, NYT Reports

She Ate High-Protein Ice Cream Daily and Lost 193 Lbs—Her Keys to Success

Health45 minutes ago

She Ate High-Protein Ice Cream Daily and Lost 193 Lbs—Her Keys to Success

House Democrats accuse Trump of ‘hijacking’ America’s 250th birthday for his own gain

Lifestyle1 hour ago

House Democrats accuse Trump of ‘hijacking’ America’s 250th birthday for his own gain

Technology1 hour ago

Tesla driver faces manslaughter charges over Texas crash that killed a woman inside her home

Six Kurdish fighters killed in IRGC ambush as clashes spread across western Iran

World1 hour ago

Six Kurdish fighters killed in IRGC ambush as clashes spread across western Iran

Sanctuary county refused 615 ICE transfer requests, turned over just 11 illegal immigrants, records show

Politics1 hour ago

Sanctuary county refused 615 ICE transfer requests, turned over just 11 illegal immigrants, records show

News Pub

Meta got caught gaming AI benchmarks

Technology

Meta got caught gaming AI benchmarks

Leave a Reply
Cancel reply

Leave a Reply

Technology

Tesla driver faces manslaughter charges over Texas crash that killed a woman inside her home

Technology

Fox News AI Newsletter: American manufacturer says AI is creating jobs, not replacing them

Welcome to Fox News’ Artificial Intelligence newsletter with the latest AI technology advancements.

Subscribe now to get the Fox News Artificial Intelligence Newsletter in your inbox.

FOLLOW FOX NEWS ON SOCIAL MEDIA

SIGN UP FOR OUR OTHER NEWSLETTERS

DOWNLOAD OUR APPS

WATCH FOX NEWS ONLINE

STREAM FOX NATION

Technology

Mystery box shows are complicated for everyone — even the actors

Trending

News Pub

News Pub

Meta got caught gaming AI benchmarks

You may like

Leave a Reply Cancel reply

Leave a Reply

Technology

Tesla driver faces manslaughter charges over Texas crash that killed a woman inside her home

Technology

Fox News AI Newsletter: American manufacturer says AI is creating jobs, not replacing them

Welcome to Fox News’ Artificial Intelligence newsletter with the latest AI technology advancements.

Subscribe now to get the Fox News Artificial Intelligence Newsletter in your inbox.

FOLLOW FOX NEWS ON SOCIAL MEDIA

SIGN UP FOR OUR OTHER NEWSLETTERS

DOWNLOAD OUR APPS

WATCH FOX NEWS ONLINE

STREAM FOX NATION

Technology

Mystery box shows are complicated for everyone — even the actors

Trending

Leave a Reply
Cancel reply