25 minutes
Written: 2026-03-19 00:00 +0000
GTC 2026
- GTC Highlights
- Talks
- Scaling Out and Across: Networking Innovations for Giga-Scale AI Systems [S81561]
- Build a High-Performance Research Cluster [S81731]
- Inside NVIDIA DGX AI Factory: Accelerating Networking for AI Across Cloud, Core, and Edge [S81856]
- Achieve Truly Serverless GPUs With libfuse, CRIU, and CUDA-Checkpoint [S81424]
- Accelerate Cloud Platforms for the Next Era of AI [S81788]
- How We Scaled Kimi K2.5 [S81695]
- vLLM in 2026: Architectural Challenges and Performance Optimizations [S82059]
- General Networking
- Building, Measuring, and Using AI Scientists [S81694]
- An AI-Driven Autonomous Lab of the Future for Chemistry and Materials Science [S81790]
- The State of Open Source AI [S81791]
- A New Paradigm: Verifiable AI [S81489]
- Optimize KV Caches for LLM Inference: Dynamo KVBM, FlexKV, LMCache [S82033]
- MLOps 202: From Models to Production AI Systems at Scale [S81662]
- Science and Engineering With AI Physics and Kit-CAE [S81781]
- Conclusion
GTC Highlights
GTC 2026 is in the books. I did my best to make the rounds. Below is a lengthy list of talks and conversations I had on a wide range of relevant subject material. I figured I’d give a quick summary for those less interested in the details.
GTC is Nvidia’s conference and so the central aspect is their hardware. This might sound strange, but Nvidia is really just starting to do AI hardware, and it won’t even be available this year. Vera Rubin is Nvidia’s new platform that will be available some time in 2027. Possibly later if you don’t have blackmail on a Nvidia exec. The Hopper architecture came out in 2022 and wasn’t that far removed from a typical graphics card. Blackwell came a couple years later, and began to really target AI centric improvements, and supplies are still scarce enough that Hoppers are still relevant. Vera claims to offer ~1000x inference performance over the Hopper architecture. It should be thought provoking that we are this far in to AI, and hardware hasn’t even really gotten started yet. The existing inference capacity being such a small fraction of what will be out in a few years.
The next layer, and the topic of many of the talks is how do we manage all this hardware? Data center buildouts, complex networks, tools to try and squeeze every drop out of the existing hardware. Moonshot AI founder Zhilin Yang noted that “increased efficiency is increased intelligence”. That during such a hardware crunch, anything you can do to get more from what you have directly moves the bar higher.
The layer above that is what to do with these resources once we have them. Model training and inference is still an answer, but there is growing industry specificity. Lots of interest in using these high gpu applications for modeling and simulation of physical processes, think real time wind tunnel simulation, or navigating industrial processes. I was particularly interested in some of the AI scientist talks as it reminded me of some of the bad old days of research.
No conference is complete without as much networking as you can stomach. The Nvidia experts are always helpful, and the conference goers had plenty to say. A lot of companies are engaged in two front wars on AI. First, they have to get it into their product somehow, attendees at GTC were at least well positioned to do so. Secondly they are expected to begin using AI internally to improve their own processes, and this is a much more nebulous operation. Generally it involves taking all the unstructured data (corporate communications, documentation, engineering processes) dumping it into a shared location, and then hoping an AI can figure it out. There isn’t an AI on the market that would perform well in that scenario, so keep those condemned to last year’s models or Copilot in your thoughts. I think it’s common knowledge that most AI efforts fail, and right now a lot of companies aren’t giving themselves a chance to succeed.
I think that’s really the high level summary. We are very, very early in the AI process. The level of AI consumption and production will be orders of magnitude higher in just a few years. Hopefully people will even be doing fewer disastrous AI rollout efforts as best practices become more firmly established, but I wouldn’t hold my breath for that last one.
Talks
Note that the sessions are viewable on the GTC website in about a week for non-attendees. You can use the session code to find them.
Scaling Out and Across: Networking Innovations for Giga-Scale AI Systems [S81561]
This first talk was centered around Nvidia’s DGX Cloud offering and the many open source libraries they are shipping with it. Targeting 1M GPUs under DGX cloud deployment it’s at least a competitive offering when you look at how other cloud companies are struggling to get customers anything more than a consumer graphics card.
There’s also a mountain of open source projects being put forward by Nvidia. Definitely in their “let one hundred flowers bloom” stage. I suppose part of experience is just being around long enough to witness the repetition of cycles. AWS had an almost identical explosion of services and offerings at this stage in their growth cycle. Fast forward a few years and the herd had been severely culled, much to the detriment of anyone who locked in on anything too precarious. Which ones are precarious you might ask? Good luck.
We can hope that both Nvidia and the open sourced projects in general will be more resilient to this cycle. A GitHub repo being less subject to termination than a full cloud service. Unfortunately a “no longer supported” notice can be as much a death penalty to a project as a deprecated cloud service.
That being said there were a number of offerings I found interesting. Whether that leads to lifted ideas or direct implementation remains to be seen.
NVSentinel: Closed loop remediation for infrastructure issues. Easier said than done, but I think in the age of less human intervention and more AI consumption some form of this is inevitable. Recovering from a botched AI intervention is likely to be somewhat of a nightmare.
Fleet Intelligence: Tracking the health of GPUs and other resources. Likely something I should give a thorough review to make sure I’m not missing any useful metrics in my own health systems, even if I don’t plan to implement it directly.
Nvidia Exemplar: Cloud performance recipes to validate that what is being deployed is at least somewhat performance optimal. Seems the most directly useful as it doesn’t require long term maintenance but just allows you to make sure your environment wasn’t set up inefficiently.
Build a High-Performance Research Cluster [S81731]
This was a phenomenal talk put on by Chris Turpin and Jerome Vienne from Qube, a London Quant firm. They did a full data center buildout in Iceland to bring racks of GPUs online.
While very useful to me, I won’t go too deep in the weeds for this recap. Lots of direct commands that you can run to validate each layer of the data center install. The other thing they did that I found interesting was the creation of xml graphs in order to visualize effective health checks and data transfer processes. Definitely something I plan to use myself, and I’ll have to write more about that as it rolls out.
They also use Slurm for scheduling, which using that for a modern data center in 2026 has to be one of the more European things I’ve ever heard. Slurm is good for this use case as it allows for complex topology aware deployments, something that I think is missing nuance for other solutions on the market currently.
A very good talk, and I think also thought provoking on the creation of the Icelandic data center. Given the unenviable UK energy prices, and data center creation rules, it does seem reasonable to deal with the challenges they mentioned of six foot blizzards and lava eating their fiber cables. It’s worth thinking about for anyone doing a data center buildout. What’s your edge? How are you getting below market energy pricing to support these massively consuming GPUs?
I joke that I talk to the cloud providers and I want to go on prem, and then I talk to on prem data centers and hardware vendors and I want to go to the cloud. It’s not really a joke. For my current enterprise use cases we are going to need both, so I’ll get both forms of suffering. I expect both forms of pain to get better with time as best practices and scoped production finally begins to line up.
I think it’s important to keep in mind that the AI hardware space is really just getting started leading to so much of the pain. To oversimplify, the Hopper series that most people are still on is basically just a graphics card for AI. The Blackwell series is starting to be adapted specifically for the nuances of AI and has much better performance, as hard as it is to get right now. By the time Vera really hits mass market in a couple years we’ll be seeing levels of inference that make what we have now look like single digit percentages.
Inside NVIDIA DGX AI Factory: Accelerating Networking for AI Across Cloud, Core, and Edge [S81856]
This talk was primarily centered around the new BlueField 4 network architecture. Now properly moving from HDN to SDN (hardware to software defined network). This is just being announced but there’s some significant changes here. The fundamental offer of SDN with this new structure is to allow for multi-tenancy. I’m not currently aware of anyone who tries to offer true GPU multi-tenancy with the current network offerings. You need that fully compliant and isolated SDN to allow for say, secure containers to run on the same GPU without the ability to interfere with each other.
Provided that rolls out as they expect that will be quite a sea change for GPU offerings, possibly enabling more of a virtualization style GPU share as you’d expect from a KVM CPU/memory set up.
BlueField also enables DOCA MemOS for KV caches.
Achieve Truly Serverless GPUs With libfuse, CRIU, and CUDA-Checkpoint [S81424]
Another great talk with direct relevance to me. I got to meet fellow Kubernetes hater and Modal technical staff Charles Frye.
Modal is a company that worked to reinvent the wheel for a lot of modern execution concepts. Centered around extremely low latency execution and efficient use of resources. Techniques particularly useful in the AI era. Kubernetes is very popular for standing services, durability and managed networking, but when you are trying to run disparate docker images for execution rather than services, and especially at extreme scale, and in GPU applications, you really see the age of Kubernetes start to show.
The platform I am working on is effectively a more omnivorous take on an execution platform. Modal focuses on bringing customers to their cloud, as opposed to our approach of handling diverse high regulation (annoying) customer needs. Tradeoffs abound, but a lot of the core challenges from Modal are relevant, and they do excellent technical blogs.
This particular session was focused on drastically lowering GPU initialization times. Anyone who has used Cloud GPUs knows just how long it takes to get one ready if you aren’t using optimizations. Even requesting it alone can take half an hour + easily.
The talk was centered around 4 parts: Buffers of ready instances Lazy loading filesystems Restoring Linux processes Restoring CUDA context
The buffer aspect is straightforward, but easier said than done. You have the minimum amount of allocated GPUs, either on prem or persisted cloud instances. Beyond that you use a buffer of on demand + spot instances depending on the use case, ensuring that you always have enough buffer to avoid the painful spike of waiting for a new GPU to spin up. That requires a good understanding of your workloads and potential incoming spikes.
If you are stuck dealing with multiple non-standard pools this problem compounds. Predicting smaller more erratic workloads across non-compatible environments, these problems grow. I expect we’ll have to solve most of them by just throwing more buffer and baseline at the problem, especially when we are looking at strict regional compliance regulations, further increasing the difficulty of getting on demand instances on the fly.
Lazy loading of file systems is something Modal has put a lot of time in to in order to speed startup times. They use a custom FUSE filesystem with predictive capabilities for workloads and a maintained cache, using AZ cache for particularly large items like model weights. Key to avoid CDNs/blob storage if latencies are to be kept low.
There’s a lot to think about there from my side. SOCI helps on the platform to do lazy loading of the images themselves, giving a helpful startup boost. Definitely worth thinking about the best way to acquire large assets to the container on startup.
Restoring Linux processes and CUDA context is the final performance step and centers around gVisor. Primarily built for Kubernetes, gVisor is a great container isolation system that is already a state machine. This lets us get snapshotting of memory states built in. It can even capture CUDA checkpointing. Helpful for vLLM as well. This involves directly adjusting client code deployment to make sure that the right things are being snapshotted if needed, and large items like KV caches are omitted. Modal has some further plans on model weight caching that are likely to be useful.
To link back to the prior talk, they don’t do any shared GPU usage between customers, so TBD if BlueField enables this for them. Seems like a highly valuable offering, but I imagine the pioneer on that is going to have to really sand some rough edges.
Some referenced blogs for further reading: modal.com/blog/gpu-utilization-guide modal.com/blog/gpu-health modal.com/blog/reducto-case-study
We talked afterwards on how Modal manages their caches for minimal latency. Using local caches that are synced in the background to a central store, allowing for distribution while still having speed. Caches being non-guaranteed as the worst outcome is a slower startup. (Informally and allegedly.) I have something somewhat similar in the platform currently, but definitely need to make it more elegant.
Thanks again to Charles and Modal. It’s always good to hear the latest from people trying to get away from the Kubernetes shackles.
Accelerate Cloud Platforms for the Next Era of AI [S81788]
Oracle Cloud was there. They gave a talk.
How We Scaled Kimi K2.5 [S81695]
This talk was by Moonshot AI’s founder Zhilin Yang. I’m hard to impress, and it was a very impressive talk.
The core sales pitch of Kimi came from his statement “democratizing intelligence with open models”, making it be intelligence for everyone such that the weights can be used for whatever anyone wants. It’s a shame that America doesn’t have some concept of a company that was open about AI.
One of the main things Zhilin emphasized was that more efficiency means more intelligence. When everything is hardware constrained and scaling laws apply, the ability to get more out of the hardware you are stuck with has direct implications on the level of intelligence you get.
Challenges at scale led to Kimi releasing a number of new open solutions. OK-Clip, Kimi Linear, Delta Attention. He talked a lot about prior art and the importance of contributing to the next generation of improvements. There was a very academic aspect to it, which interestingly I didn’t even get from any of the talks by academics at the conference.
There was a segment on the mathematical improvements that it took to get the improvements for Kimi K2.5. It involved an elaborate matrix inversion that had to be mathematically equivalent. I’m not sure it was well received by the large auditorium he was speaking at, but it was very interesting.
Another highlight was the early integration with visual that Kimi used. Generally integration with visual training has a negative impact on language learning. By introducing visual at the very start and having special language like training for it, they found that it actually improved both language and vision. This points to a synthesis of operating in the world where helping the model to more fully encounter the world leads to increased intelligence. I talked about something similar in my temporarily defunct Emacs project. Effectively, the more you let the model be fully engaged with what it is doing, the better results you get, so it was interesting to see very large scale validation of the concept.
Zhilin gave a final note on academic papers, especially the state of modern academia. He called out a lack of rigor, lack of verifiable results and replication, and pointed out that this is the reason that new progress is so slow. Definitely a direct hit on American academia, and thought provoking on the America/China competition and collaboration in the AI race.
vLLM in 2026: Architectural Challenges and Performance Optimizations [S82059]
The co-creator of vLLM Woosuk Kwon gave a talk on what’s next for vLLM. Rather deep in the weeds, but the short version is lots of good stuff. A V2 of the core model runner, effectively solving some of their scale issues. NIXL for KV cache usage, a vLLM router that is an alternative for Nvidia’s Dynamo. This will give better support for long context multi-turn workloads as needed to support agents.
They also plan to transition the vLLM codebase to one that is agent ready. This is a topic particularly interesting to me so I’ll be observing their refactor. They want to put together clear and verifiable system contracts. The core goal of this is to allow for the complexity of model support and optimizations to be automatically done for new models. Having done a stubbed out custom model integrated with vLLM internally, there’s really a surprising amount of work that goes in to it, even without doing proper optimizations for an enterprise deployment. If they can properly refactor to allow for full agentic management of these processes, that’s a major progress milestone.
General Networking
I’m somewhat old fashioned and try to talk to some people instead of staring at my phone for the entire conference. I gave out some personal business cards with a picture of a condor on the back. It’s a very limited run so enjoy them if you got them, I definitely need to improve the design a lot before the next print. I’d also note, don’t print black business cards. You probably want to write on them.
A lot of the people I talked to were companies involved with AI, but also doing their own AI enterprise initiatives. There’s a recursive element to that. You need to deliver your specific AI value prop to customers on top of your old business, and you need to make your business AI enabled to better serve that value. The key difference is that adding features to their business offering is a core competency, but understanding how knowledge stores and collaboration have to change is not. It accounts for the frequently seen phenomenon of the AI enabled company that has a new button on their product, and yet somehow works the same if not worse behind the scenes.
It made me particularly glad to be part of a true AI startup when I heard about some of the enterprise restrictions. A lot of companies are built on unstructured siloed data. We’re familiar with this, and have set up document pipelines with RAG, layered agents to solve problems, etc. One group I talked to was in the process of solving their silos by centralizing their data and then letting Copilot rummage through it. Copilot mandated from on high of course. That describes the situation at a lot of companies and it’s unfortunate because it likely means a very bad initial AI experience, setting the company back significantly. Don’t let the sales team see I only had a personal card to give them or they’ll shoot me at the next onsite.
Building, Measuring, and Using AI Scientists [S81694]
A talk by Andrew White from Edison Scientific. Very interesting.
The opening idea of the talk is that scientific research offers an inexhaustible sink for resources. He used the comparison of cab drivers. If you automate cab driving, you have a finite usage of cabs, even accounting for second order effects.
Edison Scientific is working on an AI Scientist. Essentially an agent that has a world model to allow for hypothesis generation and experimentation. That agent can then use predictive models, APIs, and assign tasks to labs. They recently came out with LabBench2 as what he thinks is the final scientist evaluator. Essentially a stack of complex questions to verify if the agent can be as good or better than a human scientist.
Their primary model is Kosmos. Makes use of iteration and consensus sampling in order to guide the model through the proper scientific process. In a Kosmos run there are 120 sandbox environments, 3k papers, runs for ~24-48 hours and produces about 4TB of data. Not a trivial process to fully interact with that much data.
They have to take some interesting steps to handle paper reliability. Essentially part of what makes the run so difficult is that they can’t treat academic papers as reliable. They distill papers into a bullet point set of claims, and then research if any of those claims have been disputed elsewhere. Given the state of modern academia you can see how this would be quite painful.
I asked a question of my own on overfitting and p-hacking. The overfitting is combatted by the consensus modeling and multi-response approaches, allowing the worst of overfitting to be avoided. P-hacking was a tougher subject. Effectively being a known hard to avoid problem.
If you are less familiar with academic research, probability hacking is when unscrupulous labs (of which there are many) do not release their full data, or preregister experiments. Because of that they have access to a much larger set of data, that they can then selectively curate to present a seemingly statistically significant finding, that is not supported by the larger data.
p<0.05 effectively claims there is a less than 5% chance of the result being a product of statistical variance, and should be considered scientifically credible. You can see the issue with what happens at massive data scale. All of a sudden these 5% chances are going to happen all the time, and completely muddy the waters. Fortunately plenty of science is less sensitive to variance, but there are many profitable areas such as drug effectiveness and other clinical trials that rely heavily on fine margins of statistical significance.
Hopefully Edison Scientific can solve those issues. I think it’s particularly challenging for them given the current state of academia and research. There’s just a lot of unethical and unreliable behavior going on, and building a model to navigate that is much more than just a technical problem. It speaks to the same issues Zhilin Yang talked about, where the next steps are being held back by prior poor performance. Andrew seemed to think that AI would be the new factor that let us break through the diminishing returns of prior research, but time will tell.
An AI-Driven Autonomous Lab of the Future for Chemistry and Materials Science [S81790]
This was a panel talk, and I do not like panel talks as a general rule. Having 5 people take up a 30 minute time slot mandates the minimum possible depth. That being said, there was still some good material here. Focused primarily on the lab of the future, as in the prior talk, these new AI scientists need to be able to run experiments. The problem that they find is that people don’t work well for AI. Observation data is often stuck in notebooks, and communication can be incomplete. Narrow focus is often a particular issue when the AI wants as much broad data as possible.
Enter the concept of the autonomous lab. Essentially Lab as Code. Think Terraform or other Infrastructure as Code, but for lab structure and creation. An interesting point that was made is that once the labs are fully autonomous, human scale is no longer required. Most wet labs now are set up to work around human hands. Specific sized test tubes, research spaces, and so on. There’s no reason if it’s entirely code defined that it can’t be miniaturized, or massively expanded. I suspect this is easier said than done as the experiments begin to encounter micro and macro transport phenomena, but still interesting.
One of the more vocal panelists was the head of Radical AI, detailing the need for such labs, and the need to treat failures as the most important data points for model refining. They have open sourced some of their research models, as has Orbital Materials with Orb.
One of the novel applications this kind of research allows is the possibility for concurrent engineering. A space that our company is working towards as well. Right now companies rely on pregenerated materials to engineer around and solve problems. If you can produce novel materials fast enough, that allows you to partner material design and sourcing alongside the standard engineering problem.
The main takeaway here is the importance of closed loop systems. Humans needing to get away from being on the critical loop and into the observational stage. That doesn’t mean humans aren’t still essential to the process and operations, but that AI systems need to be able to recursively iterate without them.
The State of Open Source AI [S81791]
Another panel. Not much to note here. There’s a technical report on Nemotron 3, which is well worth reviewing for an end to end model build including data. The report linked here.
AI2 also released some data sets. Interestingly they found simming data to be more useful than real world data for 3D applications. This makes sense to me. 3D sims are accurate enough and allow for precise instrumentation and observation, as opposed to awkward limited observations of real world applications. Not to mention much cheaper.
Hugging Face released some skills that allow for models to understand how to fine tune other models.
A New Paradigm: Verifiable AI [S81489]
Not much to say on this one. Very buzzword heavy. Expect some form of AI verification to roll out. Giving certs to agents that attest their hardware, network, capabilities, etc seems reasonable. Will it be in the specific methodology proposed in the talk? We’ll see.
Similar with the Sovereign AI half of the talk. Lots of navigating of legal and international compliance. Strict attestation allows them to work in more data hostile environments.
Optimize KV Caches for LLM Inference: Dynamo KVBM, FlexKV, LMCache [S82033]
Supposedly the longest talk in GTC history. 90 minutes of KV Cache deepdives.
KV caching is a rapidly developing field critical to effective inference. Caching allows for computations to be reused saving compute via memory.
Somewhat easier said than done. Here are some numbers from the talk: One B300 GPU can generate 92.5 TB per day. 10% of the tokens can be reused 2.8 times 1% of the tokens reused 30+ times One speaker called KV cache the new big data of AI.
Because of this there’s a lot of interest in both academia and industry to determine the best way to alleviate the weight of these cache issues. vLLM defaults with PagedAttention and SGLang has RadixAttention. Nvidia’s KVBM with NIXL is also an option here. I’d note that vLLM is now making use of NIXL in their latest release so there’s some dating here.
Some tools that came up as relevant: AIPerf for benchmarking and mimicking KV Cache usage. Mooncake Trace to determine cache hit rates. Deployed FlexKV instances, compatible with vLLM. NitroFS for improved SSD performance with cache. CacheGen for compression of KV cache during transfer/distribution.
MLOps 202: From Models to Production AI Systems at Scale [S81662]
This talk was about two things. First, the new Vera Rubin platform is very good, and no, you may not have one. The other aspect was about how Nvidia regards Kubernetes as the operating system of AI. Now you might ask, as the number one Kubernetes hater, do I have an issue with that? For the record, I’m not the number one, I’m top 10% at best. Additionally, I think it’s a much more appropriate use case in this scenario than you frequently see.
Networked GPU clusters need a lot of support. A lot. It needs many moving parts to be consistently versioned, deployed, and verified. They also don’t need to worry about spreading cross network, or resource overheads considering the insignificant dent Kubernetes would make on a GPU pod.
The essential idea is, as they say, cattle not pets. Having disposable Kubernetes installs that keep complex service manifests in line is great. You can layer alternate orchestration systems on top where needed.
Nvidia offers a few new services. Dynamo is now a stable 1.0 API. There’s a GPU Operator, Network Operator, DOCA Platform Framework DPU Operator, and a NIM Operator.
This new platform along with the BlueField improvements allows for further GPU network share. Making use of GPU Memory Swap and Model Streamer to allow for better shared usage of GPUs.
Science and Engineering With AI Physics and Kit-CAE [S81781]
Not too much to add on this one. Just a view of some of the different physics applications Nvidia is currently supporting.
You can see it as a flow. Nvidia CUDA -> Simulation with HPC SDK -> PhysicsNeMo -> Digital Twins Omniverse with Kit-CAE and OpenUSD -> AI supported design efforts.
Also including tools like Nvidia Warp for computational physics.
Seems like a decent starting point, although it’s heavily focused on wind tunnel/basic flow applications right now as far as I can tell. There’s more on build.nvidia.com
https://build.nvidia.com/nvidia/digital-twins-for-fluid-simulation
Conclusion
Due to some health issues I haven’t made it out to a conference in a very long time. It was nice to get back out there, as much as I don’t like leaving the house these days.
It probably says something about how many hats I’m wearing at my current job that so many of these talks were directly relevant to my work. Lots of food for thought on improvements to make and possible platform additions, plenty to prototype and scope in the physical AI space, as well as getting prepared for further data center buildouts with extensive validation and scale concerns.
Thank you to everyone I met, especially those who suffered through my many questions, and if you made it this far, thank you for reading.