Artificial intelligence is rapidly transforming weather prediction, enabling forecasts that once required hours of supercomputing time to run in just minutes. But as AI tools play an expanding role in high-stakes hazard modeling, researchers at Rice University say an essential question remains: Do AI-generated storms behave realistically?
Their new study published in the Journal of Geophysical Research: Atmospheres provides a comprehensive evaluation of how AI-based global weather models simulate tropical cyclones. The researchers found that while leading AI systems perform well at predicting storm tracks and large-scale behavior, they can struggle to reproduce the physical structure of storms, particularly the wind patterns that drive real-world impacts.
“In recent years, we’ve seen an explosion in AI-based weather models,” said corresponding author Avantika Gori , assistant professor of civil and environmental engineering at Rice. “These systems are trained on massive atmospheric datasets, and once trained, they can generate global forecasts in just a minute or two, which is dramatically faster than traditional physics-based models.”
That speed represents a major shift for forecasting. Conventional numerical weather models simulate atmospheric processes by solving complex physical equations, a computationally expensive approach. AI models instead learn statistical relationships from historical data, allowing them to produce forecasts with remarkable efficiency.
But their complexity also introduces challenges.
“Because these models are so large with millions or billions of parameters, we don’t always have visibility into how they generate their predictions,” Gori said. “For high-consequence events like tropical cyclones, that makes systematic evaluation critically important.”
The study evaluated two prominent AI global weather models, Pangu-Weather and Aurora, using storms from the North Atlantic and western North Pacific basins between 2020 and 2025. To ensure a rigorous test, the researchers simulated roughly 200 storms outside the models’ training periods, then compared AI-generated storm characteristics with ERA5 reanalysis data.
“We wanted to determine whether the models could reproduce the climatology and physical behaviors observed in real storms,” said postdoctoral student and first author Yanmo Weng . “Many prior evaluations examined only one or two cyclones, but by analyzing hundreds of storms, we were able to draw more accurate and generalizable conclusions about model performance.”
Their analysis showed that AI modeling was most successful at forecasting storm paths.
“We found that the AI models we evaluated performed remarkably well in predicting cyclone tracks,” Gori said. “They reproduced where storms traveled and where they made landfall with a high degree of consistency, which is reassuring since forecasting a storm’s path helps shape evacuation decisions and early warnings.”
Storm intensity, which has traditionally been a challenge for AI weather models, showed uneven but promising improvement. Earlier AI systems often underestimated the strength of tropical cyclones, missing the highest winds and lowest pressures associated with major storms. In the Rice benchmarking, Aurora more closely matched ERA5 intensity distributions, while Pangu-Weather exhibited larger biases for the most intense cyclones.
Even so, accurately representing extreme storms remains difficult. The researchers emphasized an important caveat: ERA5 itself tends to underestimate peak intensity compared to observations, meaning agreement with reanalysis does not automatically imply accuracy.
The study’s most significant caution involved the physical realism of simulated windfields — the internal structure of winds within AI-generated storms. Although many simulations appeared visually convincing, closer analysis revealed that they did not always satisfy established physical constraints. Tests of gradient wind balance, a fundamental relationship governing mature cyclones, showed notable deviations, particularly near storm centers.
“These inconsistencies are not always obvious,” Gori said. “Windfields can look realistic while still violating key aspects of atmospheric physics.”
The team also found that both AI models tended to overestimate inner core size, especially in stronger storms. Such biases matter because cyclone impacts depend not only on track but also on how winds are organized — factors that shape projections of wind damage, rainfall and storm surge. Accurately capturing storm structure is therefore critical for risk assessment. When windfields are not physically consistent, downstream hazard and damage predictions can be affected.
Still, Gori said the findings provide guidance for improvement and not an undermining of the promise of AI forecasting.
“Our work helps identify where bias corrections or additional interpretation may be necessary,” Gori said. “For instance, if a model systematically underestimates intensity, forecasters can adjust rather than relying on the raw output.”
Beyond specific model biases, the researchers also emphasized a broader lesson: AI tools still depend on field expertise.
“These systems are extraordinarily powerful, but they are not self-validating,” Gori said. “Close collaboration between atmospheric scientists and AI developers is essential to ensure that model outputs remain physically meaningful, and advancing these technologies responsibly will require continuous input and refinement by the scientific community.”
In other words, as weather forecasting continues to embrace the promise of AI models, they should be used as a complement to, and not a replacement for, human expertise and understanding.
This research was supported by the National Science Foundation.
Journal of Geophysical Research Atmospheres
Climatological Benchmarking of AI-Generated Tropical Cyclones
21-Jan-2026