With Intel poised to enter the datacenter GPU market, the chipmaker showed off a new software platform mean to simplify management of these devices at scale this week at the International Supercomputing Conference in Hamburg, Germany.
The open-source software, dubbed Intel XPU Manager, is an in-band remote management service for upgrading firmware, monitoring system utilization, and administering GPUs at the individual node level. The code is an important step as Intel prepares to compete against industry stalwarts Nvidia and AMD, which not only lead it in GPU silicon but software management as well.
XPU Manager is a low-level management interface that runs in Kubernetes and is designed to be integrated into existing cluster management and schedulers using RESTful APIs. It also supports local management via the CLI and is validated for use on Ubuntu 20.04 or Red Hat Enterprise Linux 8.4.
Telemetry collected by the software includes GPU utilization, performance metrics, memory bandwidth and package temperatures, among others. It can be imported directly into popular monitoring stacks like Prometheus.
The platform is initially available for Intel-based systems — like the chipmaker’s upcoming Ponte Vecchio and Rialto Bridge GPUs — but thanks to its open source nature, Jeff McVeign, VP and GM of Intel’s supercomputer group, expects the platform will be ported to other architectures before long. Eventually, a datacenter could use XPU manager to manage a mix of Intel, AMD, Nvidia GPUs at scale, he opined.
Pressed on whether Intel plans to offer a commercial version of XPU Manager in line with CEO Pat Gelsinger’s recent emphasis on driving software revenues, McVeign didn’t rule out the possibility, citing past open source projects that had been commoditized in that fashion.
“Our goal right now is to make it available… and to get it out there so that people can utilize it effectively. And then, if it’s valuable for others to get a commercial support license, we will entertain that, but that’s not the motivation for this,” he said. “It’s not about software revenue. It’s really around how do we manage those platforms?”
Intel says greener datacenters are just a click away
Much of this emphasis on software is rooted in improving hardware utilization and making datacenters more sustainable in the process.
According to Intel, by 2030, datacenters could be responsible for 3-7 percent of global energy consumption with compute hardware being the top driver of electricity use. Intel’s Ponte Vecchio GPUs are expected to draw upwards of 600W when the launch later this year, but these sky-high TDPs are hardly unique to Intel. GPUs from its rivals are pulling back even more — 700W in the case of Nvidia’s H100 SXM.
In recent months Intel has leaned on software to address many of these issues. Earlier this year, Intel acquired software startup Granulate in a bid to optimize applications for its hardware at runtime. Similarly, Intel is using software acquired from SigOpt in 2020 to cull the number of parameters required for simulations, to achieve a shorter time to result.
To this end, at ISC this week, Intel also updated its Datacenter Manager software platform to extend greater control over power consumption at a datacenter scale.
“Intel Datacenter Manager has been out for quite a while now, but we’re really bringing a lot of new sustainability and energy efficiency features into it,” McVeign said.
While XPU manager is designed to manage accelerators at the node level, Datacenter Manager, as its name suggests, is designed to manage operations at the cluster level. As such, the updates provide tools like thermal mapping and the ability to cap power consumption for an entire compute cluster.
Intel doesn’t expect software will solve all of its problems. The company has invested heavily to improve the power efficiency of its chips, and earlier this month unveiled a $700 million “mega lab” to investigate novel liquid-cooling tech, including immersion cooling. ®