Challenges, Trends, and Opportunities within Advanced Large-Scale Computing – Part 1

October 21, 2022

Article by Chris Coates, Head of Engineering, HPC, at Q Associates, a Logicalis Company

This will be the first in a series of blog posts about the challenges, trends, and opportunities within Advanced Large-Scale Computing with a particular focus on High Performance Computing, but not limited to it. 

Prologue 

The world of advanced large-scale computing is in a rapid transition. It now pervades all aspects of our society where previously in some cases it would be thought that it would never happen, who thought that High Performance Computing and Artificial Intelligence would be being used for farming?

Deployed on premises, at the edge, or in the cloud, HPC and AI solutions are used for a variety of purposes, and it impacts how we communicate both locally and globally. Examples include:

Engaging in politics – being able to analyse and influence how political campaigns are run or model the public’s response to COVID being key examples.
Purchase goods – both in how we purchase online or in-store, how those online orders are fulfilled, where products are placed on-shelves in store based on analytics and vision-based AI, etc.
Explore research and ideas – being able to simulate designs, changes, synthesise chemicals/drugs/genetically sequence.
Travel – artificial intelligence on traffic flow, autonomous driving, design the cars we drive.
Plan environments for the future – for example, architecture using digital twins to model and simulate the effect of architecture changes in the real world.

It also shapes how we view and understand the world, with Laser Interferometer Gravitational-Wave Observatory and the Large Hadron Collider, the Square Kilometre Array and Diamond Light Source amongst many others helping the next generation of scientific breakthroughs. 

These items are part of a long list of examples. It is safe to say that advanced, large-scale computing has a huge influence on how we live day-to-day, even if you don’t realise it. And even if you did, it could have an even larger influence going forwards.  However, to say that all is well within advanced large-scale computing would be untrue, and there are several concerns and pain points that need to be addressed. 

Skills 

Never has such a pervasive and powerful technology also brought with it such a rapid dearth of resources and skills. With a boom in interest because of the technology and the power it brings, comes a boom in demand – and not one that is easily plugged nor has been in many areas. The development of talent to operate and leverage such systems is significantly lagging behind the growth in both scale, performance, and demand. 

With every rise of a new technology, there is usually an associated ramp-up curve that is often much harder to bridge. More on experience curves later.

System administration is also becoming more complex within HPC environments. Systems are becoming ever more complex and with more components/nodes to fulfil these needs. With it, the scope for failure is widening with each system. This can be seen with the delays in Frontier being a key example. It is going to require a paradigm shift in the way we view and implement computing solutions to move forwards. 

In Research Software Engineering, there is a noted skills gap. This is a multi-faceted issue, the role of a RSE is not only to be a skilled researcher, but in some cases also be a skilled programmer, and an experienced systems administrator. These roles are specialisations in themselves and as such using traditional tooling to achieve this is not fit for purpose.

This “talent lag” isn’t assisted by the fact that there is no such specialist qualification at degree-level on this type of profession. Often RSEs progress from being researchers, or from being software engineers.  

This situation with skills and a dearth of them is exacerbated by the fact that talented people are often found following the money via a small number of either very large companies or creative start-ups.  

People gravitate towards opportunities. Many students, faculty, and staff are leaving academia, national laboratories, and traditional computing companies to pursue those roles. The chance to be creative will always be a draw for the most talented. 

The academic exodus amongst artificial intelligence researchers is well documented. After all, in the commercial world one can now develop and test ideas at a scale simply not possible in academia, using truly big data and uniquely scaled hardware and software infrastructure. The same is true for chip designers and system software developers. These talent challenges are important national policy considerations for all industrialised countries. 

This means that there is a “brain-drain” within the marketplace, with only a few organisations having access to the talent. This then has the potential to stifle growth which in-turn means that innovation and progression of other key challenges is also restricted, resulting in less opportunity (and funds) available for recruitment within the academic sector, which makes the situation even more critical. This is a vicious cycle. 

 Supply 

With this increased demand, comes a risk toward supply. The global semiconductor shortage has highlighted the interdependence of global supply chains and the economic consequences for industry and countries dependent on these functions, which are sometimes geopolitically challenged when it comes to the influence of governments on these facilities. 

There are substantial social, political, economic, and national security risks for any country that lacks a robust silicon fabrication ecosystem. Fabless semiconductor firms are important, but onshore, state of the art fabrication facilities is critical, as the ongoing global semiconductor shortage has shown. 

However, the investment needed to build state of the art facilities is measured in the billions of dollars per facility. There is intense political debate around fabrication capability within the U.S., with similar conversations underway in Europe. Intel, TSMC, and GlobalFoundries recently announced plans to build new chip fabrication facilities in the U.S., each for different reasons, with Intel also looking for fabrication facilities within Europe. Where does the UK stand in all this considering Brexit? Will semiconductor fabrication simply bypass the UK? And with that, those substantial supply chain risks become more acute.   

Power/Cost 

With the end of Dennard scaling, slowdowns in Moore’s Law, and the associated rising costs for not only continuing development within semiconductor advances but powering these systems; the challenges of building progressively faster large-scale compute are ever more increased. Not only economically, but also environmentally. It’s hard to generate and accommodate 20MW of power for example, and even harder to generate it or cool it via green methods. 

Ultimately however, with all power/scaling concerns whilst Moore’s Law is discussed often there should probably be a thought for Wrights Law. This isn’t a new concept, the titled paper from MIT and the Santa Fe Institute “Statistical Basis for Predicting Technological Progress” refers to it and has been accurate so far. 

“Wright’s Law” measured cumulative unit production plotted against price per unit.  Wright discovered that progress increases with experience: each percent increase in cumulative production in a particular industry results in a fixed percentage improvement in production efficiency.  He determined this whilst studying aircraft manufacturing – for every doubling of aircraft production the labour requirement was reduced by 10-15%.  He published his finding in a 1936 Journal of Aeronautical Sciences article titled: “Factors affecting the costs of airplanes”. 

This learning curve (or experience curve) has done well in predicting the prices of many products of completely different natures, including photovoltaic cells (in $/Watt) and DRAMs, even though the processes of cost reduction for these two technologies are dramatically different as are the slopes of their two learning curves. 

This also holds true not just in manufacturing but in systems administration and software development – You only need look at Jez Humbles “If it hurts, do it often” mantra to join the dots to see where the logic comes from. By iterating faster, efficiencies become greater. We will touch on this in one of the next posts. 

Whilst it seems we have digressed a little, the challenge here is that with these increases in power and cost at those initial iterations (i.e., the bleeding edge), it’s also increasingly difficult to sustainably provide advanced large-scale computing as a result. 

There is another concern regarding sustainability that is often overlooked. How many deployments of systems are effectively sat consuming energy whilst organisations can ramp-up via that experience curve to use this technology? That ramp-up time is for all intents and purposes, waste.  

Delays in software development and solution deployment/stability often attribute many thousands of hours in wasted opportunity, and with it, reduced competitive advantage – after all, the very reason for buying a first iteration of a new technology is to get it first and have that advantage. If the system is sat not delivering results for a period of months, the key benefit of early adoption has probably gone along with it. 

Environmental and Extreme Events 

Climate change is having a distinct impact on the development of advanced large-scale computing, how these systems will be built, and where they will be built in future. 

In 2018, during the California wildfire known as the Camp Fire this disaster also had a ripple effect at a supercomputer facility operated by Lawrence Berkeley National Laboratory (LBNL) some 230 kilometres away. The National Energy Research Scientific Computing Center (NERSC) typically relied on outside air to help cool its systems. But smoke and soot from the fire forced engineers to cool recirculated air, driving up humidity levels causing multiple issues and even system blowouts. 

California utilities cut NERSC’s power later that year for fear that winds near LBNL might blow trees into power lines, sparking new fires. Although NERSC has backup generators, many machines were shut down for days. 

Managers at high-performance computing (HPC) facilities are waking up to the costly effects of climate change and the direct environmental effects it is having.  Climate change can bring not only heat, but also increased humidity, reducing the efficiency of the evaporative coolers many HPC data centres rely on. 

For its next system set to open in 2026, NERSC is planning to install power-hungry chiller units, like air conditioners, that would both cool and dehumidify outside air. The cost implications of such adaptations are motivating some to migrate to cooler and drier climates. 

Climate change is also threatening energy supply. Hotter temperatures can increase power demands by other users. During California’s heat wave this summer, when air-conditioning use surged, LLNL was told the facility should prepare for power cuts of 2 to 8 megawatts. This is not unlike similar concerns in Europe, with the general public being told to prepare for potential power cuts also due to war and its effect on global energy supply. There appears to be a progressive increase in extreme events that HPC facilities now must consider. 

Many HPC facilities are heavy users of water, too, which is piped around components to carry away heat—and which will grow scarcer as droughts in certain areas persist or worsen. A decade ago, Los Alamos National Laboratory in New Mexico invested in water treatment facilities so its systems could use reclaimed wastewater rather than more precious municipal water. 

These challenges are not limited to the U.S. During the heatwave this year in the UK, systems experienced cooling challenges and chiller failures resulting in some needing to be shut down. This is a phenomenon that isn’t going away any time soon and other wider world events such as the energy supply crisis mentioned above are leaving HPC programme managers with new challenges. Future systems will need to be constructed in ways that can handle ever more extreme events, and potentially allow them to cut performance – and the need for cooling and power – during these events, and be able to dynamically scale not only up, but down to meet external needs.  

Culture/Morality Implications 

We are empowered with the abilities that these technologies give us however that gives rise to morality questions and challenges such as deep fakes, influencing political trends, risks to privacy and security, driving inequality between the haves and have-nots, cybercrime, and cyberwarfare to name but a few. 

Even most recently, DCMS has launched today an AI Standards Hub – to tackle some of these challenges regarding bias within AI models, both intentional and unintentional, and trying to tackle the problem of trustworthiness. 

Each new rise in technology, generally brings with it challenges at a societal level. You only need look at the effect of quantum computing and the concerns around post-quantum cryptography to realise that. With great power, comes the opportunity to abuse it and there must be checks and balances in place to minimise the risks presented by technologies such as these. 

There are major inequalities beginning to appear also, between the organisations that have access to such disruptive, transformative technology… And those that don’t. This will only become ever more apparent as time and technology progress. In the advanced computing arms race… Who loses? Predominantly the people who can’t get access to use it. This is a major challenge culturally that must be addressed, by reducing the barriers to entry and access. 

 Summary 

Given these technology and industry shifts, the future of large-scale high-performance computing is at a critical crossroads – Even more so as the global competition for scientific computing leadership intensifies, raising several critical technological, economic, and cultural questions. 

In the next post, we will investigate trends within not just large-scale advanced computing but the wider area of computing and their implications upon solution design moving forwards. We’ll then finish up the series with a closer look at the opportunities that are available to attempt to address these challenges. 

Q Associates and Logicalis can help demonstrate solutions and strategies to help you address these issues, empower you to deliver them at the right cost and support your organisation in getting results, faster.

Get in touch with Chris Coates, Head of Engineering, HPC, at Q Associates and find out more about our offerings and expertise.  Email chris.coates@qassociates.co.uk or call 01635 248181.