Imagine for a moment that the millions of computer chips inside the servers that power the world’s largest data centers exhibit rare, almost undetectable faults. And the only way to find the flaws was to throw those chips at giant computer problems that would have been unthinkable just a decade ago.
As the tiny switches of computer chips have shrunk to the width of a few atoms, chip reliability has become another concern for people running the world’s largest networks. Companies like Amazon, Facebook, Twitter and many other sites have experienced surprising outages over the past year.
Outages have several causes, such as programming errors and network congestion. But there are growing concerns that cloud computing networks have become larger and more complex, they still depend, at the most basic level, on computer chips that are now less reliable and, in some cases, less predictable.
Over the past year, researchers from Facebook and Google have published studies describing computer hardware failures whose causes were not easily identified. The problem, they said, was not in the software – it was somewhere in the computer hardware made by various companies. Google declined to comment on its study, while Facebook, now known as Meta, did not return requests for comment on its study.
“They see these silent errors, basically coming from the underlying hardware,” said Subhasish Mitra, a Stanford University electrical engineer who specializes in computer hardware testing. Increasingly, Dr. Mitra said, people believe that manufacturing defects are related to these so-called silent errors that cannot be easily detected.
Researchers fear finding rare faults as they try to solve ever-larger computer problems, which strain their systems in unexpected ways.
Companies that operate large data centers began reporting systematic issues more than a decade ago. In 2015, in the engineering publication IEEE Spectrum, a group of computer scientists studying hardware reliability at the University of Toronto reported that each year up to 4% of Google’s millions of computers experience errors that cannot be detected and cause them to shut down unexpectedly .
In a microprocessor that has billions of transistors—or a computer memory board made up of trillions of tiny switches that can each store a 1 or a 0—even the smallest error can disrupt systems that now routinely perform billions of calculations. each second.
At the start of the semiconductor age, engineers worried about the possibility that cosmic rays would sometimes flip a single transistor and alter the result of a calculation. Now they fear that the switches themselves are becoming increasingly unreliable. Facebook researchers even say that switches are more likely to wear out and the lifespan of computers’ memories or processors could be shorter than previously thought.
There is growing evidence that the problem is getting worse with each new generation of chips. A report published in 2020 by chipmaker Advanced Micro Devices found that the most advanced computer memory chips of the time were around 5.5 times less reliable than the previous generation. AMD did not respond to requests for comment on the report.
Tracking down these errors is difficult, said David Ditzel, a veteran hardware engineer who is the president and founder of Esperanto Technologies, a maker of a new type of processor designed for artificial intelligence applications in Mountain View, California. California. He said his company’s new chip, which has just hit the market, has 1,000 processors made from 28 billion transistors.
He compares the chip to an apartment building that would cover the area of the entire United States. Using Mr Ditzel’s metaphor, Dr Mitra said finding new errors was a bit like looking for a single running faucet in an apartment in that building that only malfunctioned when a bedroom light was turned on and that the apartment door is open.
Until now, computer designers have tried to deal with hardware faults by adding special circuits to the chips that correct the errors. Circuitry automatically detects and corrects bad data. It was once considered an extremely rare problem. But several years ago, Google’s production teams started reporting errors that were extremely difficult to diagnose. The miscalculations occurred intermittently and were difficult to reproduce, according to their report.
A team of researchers tried to track down the problem, and last year they published their findings. They concluded that the company’s vast data centers, comprised of computer systems based on millions of processor “cores,” were experiencing new errors that were likely a combination of two factors: smaller transistors that were approaching limits physical and inadequate testing.
In their “Cores That Don’t Count” article, Google researchers noted that the problem was difficult enough that they had already spent decades’ worth of engineering time solving it.
Modern CPU chips are made up of dozens of CPU cores, computational engines that allow tasks to be broken down and solved in parallel. The researchers found that a small subset of nuclei rarely produced inaccurate results and only under certain conditions. They described the behavior as sporadic. In some cases, the cores produced errors only when the calculation speed or the temperature were changed.
According to Google, the increasing complexity of processor design was a major cause of failure. But engineers also said smaller transistors, three-dimensional chips and new designs that only create errors in certain cases have all contributed to the problem.
In a similar paper published last year, a group of Facebook researchers noted that some processors would pass manufacturers’ tests, but then began exhibiting failures when in the field.
Intel executives said they are familiar with research papers from Google and Facebook and are working with the two companies to develop new methods for detecting and correcting hardware errors.
Bryan Jorgensen, vice president of Intel’s Data Platforms Group, said the researchers’ claims were correct and that “the challenge they are issuing to the industry is the right place to go.”
He said Intel had recently started a project to help create standard, open-source software for data center operators. The software would allow them to find and correct hardware errors that the integrated circuits in the chips failed to detect.
The challenge was underscored last year when several Intel customers quietly issued warnings about undetected errors created by their systems. Lenovo, the world’s largest personal computer maker, informed its customers that design changes in several generations of Intel’s Xeon processors meant that the chips could generate a greater number of errors that could not be corrected than previous Intel microprocessors.
Intel hasn’t spoken publicly about the issue, but Mr. Jorgensen acknowledged the issue and said it has been fixed. The company has since changed its design.
Computer engineers are divided on how to meet the challenge. A popular response is the demand for new types of software that proactively monitor hardware errors and allow system operators to remove hardware when it begins to degrade. This has created an opportunity for new start-ups offering software that monitors the health of underlying chips in data centers.
One such operation is TidalScale, a Los Gatos, California-based company that makes specialized software for businesses trying to minimize hardware failures. Its managing director, Gary Smerdon, suggested that TidalScale and others faced a daunting challenge.
“It will be a bit like changing an engine while a plane is still flying,” he said.