A key theme at the conference is the shift towards allocating computational power during inference, rather than primarily during training.
HÀ NỘI – The concept of inference scaling is being hailed as a transformative approach in artificial intelligence (AI) at the AI and Semiconductor International Conference 2025 (AISC 2025) in Hà Nội.
Held from March 12–16 in Hà Nội, with additional sessions in Đà Nẵng, AISC 2025 has attracted over 1,000 technology experts and industry leaders from around the world. Co-organised by Việt Nam’s National Innovation Centre (NIC) and US-based AI firm Aitomatic, the conference explores how AI and semiconductor advancements are reshaping the future of computing.
A key theme at the conference is the shift towards allocating computational power during inference, rather than primarily during training. According to a professor at Stanford University and AI researcher at Google DeepMind, Azalia Mirhoseini, inference scaling represents a new axis for AI performance enhancement.
She likened it to an “infinite monkey” approach, where an AI model can generate multiple outputs and eventually arrive at the correct solution given enough attempts. This contrasts with traditional AI development, which prioritises pre-training and fine-tuning as the main scaling strategies.
AI inference scaling and its impact on accuracy
Emerging research shared at AISC 2025 highlights that allowing AI to generate multiple solutions and selecting the best one can dramatically improve accuracy. A proposed framework, "Large Language Monkeys," demonstrated that running a large language model (LLM) multiple times on the same prompt—while an automated verifier assesses each output—can significantly enhance the likelihood of correct responses.
Across reasoning and programming tasks, researchers observed that the probability of obtaining the correct answer, termed coverage, increases predictably with the number of inference attempts.
According to Mirhoseini, this follows an inference-time scaling law, similar to well-established training scaling laws. She explained that in fields where automated verification is possible—such as unit testing for software or mathematical proof verification—this approach directly enhances problem-solving success.
For instance, in software development benchmarks, an AI-based code generator solved 15.9 per cent of coding problems with a single attempt. However, when given 250 attempts, its accuracy increased to 56 per cent, surpassing the previous best-in-class model, which achieved 43 per cent in a single-shot scenario.
Even a smaller 70-billion-parameter open-source model, when given sufficient inference runs, could match or outperform larger models like GPT-4 on specific coding and reasoning tasks. These findings suggest that computational effort during inference can compensate for smaller model sizes or limited training data, making advanced AI capabilities more accessible without requiring massive models.
Applications
At AISC 2025, researchers showcased several real-world applications of inference scaling across software engineering, hardware programming and semiconductor design.
In software development, a prototype system called 'CodeMonkeys' applies inference scaling to programming tasks. The AI generates multiple candidate code edits and bug fixes in parallel, each evaluated automatically using unit tests.
According to Mirhoseini, this process enables the AI to refine its output iteratively, selecting the most optimal solution. The key insight is that allocating more computational power at the inference stage—rather than during initial training—enhances AI’s ability to write, debug and optimise code.
In hardware programming, researchers introduced 'KernelBench' a tool leveraging inference scaling to automate low-level programming tasks. Writing optimised kernel code—critical for high-performance computing—traditionally requires extensive manual effort.
KernelBench enables AI models to generate kernel code, receive compiler feedback and performance metrics, and refine their output over multiple iterations. This iterative process allows AI to automate complex programming tasks that would otherwise require significant human expertise and time.
The conference also underscored AI’s growing impact on chip design, with Google’s AlphaChip project serving as a standout example. AlphaChip employs deep reinforcement learning to automate chip floorplanning, an essential step in semiconductor design.
According to Google, its AI-generated chip layouts are comparable to or superior to human designs across all performance metrics, while requiring significantly less time. A floorplan that would take months for human engineers to finalise can be generated by AlphaChip’s AI in under six hours. Mirhoseini noted that inference scaling could further enhance AI-driven chip design, enabling rapid evaluation of thousands of design variations to improve efficiency and performance.
The emergence of inference scaling marks a fundamental shift in AI development. Traditionally, AI research has focused on increasing model size and dataset volume to improve performance. However, the findings presented at AISC 2025 suggest that redistributing computational resources to inference may unlock latent AI capabilities without requiring ever-larger models.
This shift also presents new challenges for hardware and software infrastructure. As AI inference workloads grow, developing specialised AI chips that optimise for high-throughput inference will be crucial.
Discussions at AISC 2025 highlighted next-generation AI accelerators and parallel processing techniques aimed at reducing computational costs associated with inference scaling.
Experts at the conference expressed optimism that inference scaling will become a cost-effective and practical approach to AI deployment. By combining advanced inference strategies, automated verification, and high-performance hardware, AI systems may soon tackle problems previously considered too complex or computationally expensive. — VNS