At the AI Infra Summit, it’s hard to miss the grand swathe of hardware accelerators coming to the market. It’s an interesting mix of established players and new entrants, almost solely focused on the compute engine at the heart of machine learning. There are a few companies looking into solving the networking story at datacenter and enterprise scale, and fewer still considering the control node of the modern AI server – the CPU.
At the show, as part of a pair of interviews, I sat down with Moshe Tanach, CEO of a startup called NeuReality. Their first generation NR1 product was designed as a head-node CPU replacement but instead of focusing purely on CPU duties, the chip has accelerators designed to improve the dataflow directly into the AI accelerators. As you’ll read in my discussion with Moshe, the orchestration of data into these systems is often an afterthought, and traditional off-the-shelf CPU solutions are simply not optimized for it. The NR1 is designed be the first stage in solving that issue.
At the show, NeuReality also announced its roadmap – a chiplet based platform called NR2, with the first part of that being an east-west SuperNIC that works with the system to improve utilization.
You can watch the full interview here, or read the transcript below.
Introduction and Problem Statement
Ian Cutress: So, here’s the thing about AI right now. If you’ve looked into only the headlines, you might believe that GPUs and AI accelerators were the entire story. They dominate the conversation. They drive stock markets and they are the reason entire data centers are being reimagined. They deserve that spotlight. They’re incredibly powerful at the heavy math that sits at the center of modern AI models and they have enabled the progress that everyone is talking about.
What does not get much attention is the fact that even the largest and most advanced hardware still spends too much time waiting, sometimes up to 70-80%. They’re waiting on a lot of different things, most noticeably data in the right format. The reason why it waits is two parts: where the data is and how the data is orchestrated. On one side you need fast memory and on the other you need a really good CPU orchestrating that data plane. In a server, CPUs remain essential. They are versatile, flexible, and at the core of almost every computing task you can name. If you need to run an operating system, schedule workloads, or manage complex general purpose jobs, the CPU is unmatched.
But AI inference has evolved into a workload with very different demands. It’s a messy mix of token preparation, data shuffling, networking, and orchestration. A regular CPU can handle those jobs, but there’s a trade-off in flexibility compared to optimization. They were never optimized for the pace required by large-scale AI inference. That mismatch creates an odd situation. A data center may invest millions in racks of accelerators. But much of that investment is left underused because the hardware is left waiting and the hosts are busy doing management tasks. It’s like hiring the best chef in the world, but then asking them also to seat customers, clean dishes, and manage behind the till. The chef is still brilliant, but the kitchen never runs at full speed.
The flip side of this is the networking. With most companies focusing on speed above all else, get the data to where the compute needs to go. That’s why Infiniband and Ethernet exist. There are two ways to approach this: dumb networking, which just moves the data, or smart networking, which helps compute the data. Solutions for the latter are complex and often problem dependent. Between the two, it all means that the industry’s main response so far has been to keep scaling the GPU and pushing the software to ever more parallelism. Every new generation arrives larger, faster, and more capable. That progress is real, but it does not fix the imbalance. The CPU is still managing orchestration, still tied to jobs that it was never its specialty. And as the network is shuttling that dumb data around, the accelerators are still waiting.
This has led some to ask whether there should be a set of chips designed specifically for those orchestration tasks. A dedicated chip that takes on the data preparation, the networking and the flow control in order to let accelerators flourish. CPUs will continue to do what they do best and GPUs will keep pushing the boundaries of the mathematics. But inference at scale might benefit from a processor whose entire role is managing the pipeline around the workload. That is the idea behind the AI CPU, and it is the concept that NeuReality has been developing. Their chip called the NR1 was built from the ground up to serve inference. The goal is to let the accelerators run flat out on the mathematics while this new processor handles the entire orchestration around them. It’s similar to adding a conductor to an orchestra. The musicians are already talented, the instruments already world-class, but without coordination, you’ll never hear the full performance. In an orchestra, it’s one-to-many. The modern inference workload requires one conductor per chip. We’ve seen some demonstrations with eight of these NR1 chips paired with eight Qualcomm Cloud AI 100 Ultra cards.
Now, to dig into this, we met up with Moshe Tanach, the CEO and co-founder of NeuReality at a recent conference. I initially went into this interview expecting to talk to him about the NR1, but as the conference started, the company announced their new AI super network product, something that we weren’t expecting. For context here, Moshe spent years at Intel and Marvell and a lot of the team at NeuReality heralds from big networking players. We asked him to walk through the NeuReality concept and why they see a growing market for their hardware.
The Problem of GPU Latency and Communication
Moshe Tanach: When we founded the company, we were looking at the data center as something that was going to transform. GPUs are becoming stronger and stronger, XPUs that do a better job in some cases, and they need a lot of data. They need a lot of data because they are very powerful. They can run a lot of request processing, but also because when we reach larger problems like a huge LLMs, Deepseek, ChatGPT and the like, while you train or while you run inference there’s a lot of east-west communication between those GPUs.
So the first thing we went after is if you look on the server, it has the CPU and the front end NIC defined as the “head node” that manages the server node. Then you have the GPUs that are doing most of the processing and then you have the scale out NICs. So we went after the AI Head node and we kind of fused the two functions together.
Ian Cutress: I remember the first time we met, it was Supercomputing 2022/2023 and you had some of the demo systems and I said, “Where’s your CPU?” you were partnered with I think AMD and Qualcomm and IBM at the time showing off some things and you were saying “you don’t need one because we got this thing that does it for you”.
Moshe Tanach: Exactly. Well, actually we have embedded CPUs. The first generation had the Neoverse N1 from Arm running. You have to give the same look and feel for the developers and for the IT guys. So we have an ARM server on a chip.
Ian Cutress: So you’re an ARM merchant silicon provider?
Moshe Tanach: I am, I am!
So the first problem we went after is how do you couple the GPU with something that can bring data very efficiently but also process it and prepare it for the GPU. Many times when you enable multiple engines inside a GPU, they interfere with each other. You enable the video decoder, your neural net processing drops. So we build this AI head node that does the data fetch, the data processing. It can reach the storage, it can get data from clients. And in the last few days we announced the second generation that addresses the east-west problem for large training pods. Hyperscalers need it, other big semiconductor companies that don’t have networking need it. So it’s a very important thing. I can share with you some of the research we’ve done with Deepseek. You know they released their article about how they trained and how they did this load balancing that you mentioned.
What they did was very smart, and split the work and doing it on a tile basis. How do you transform some of the data while you compute so you don’t lose GPU active time and they’ve done amazing work but the only reason they succeeded is they used a downgraded GPU – they used the H800.
Ian Cutress: If you fit somebody into a box, they’ll try and innovate out of the box.
Moshe Tanach: If they would have used the H200 which is much more powerful, that they don’t have access to, eg Blackwell and Ruben, then the 100% active time that they achieved would drop to around 83% active time with H200. If you go to Ruben it drops to less than 20%. So you need a much more powerful scale-out NIC to allow them to shorten the time of transfer between GPUs. And when you get to 800G or 1.6T, latency becomes much more of an issue because it consumes more time out of the transfer time.
Ian Cutress: Is that because we’re also sharding the workload differently as we scale up the networking?
Moshe Tanach: Yeah. “Time to first token” became a very important metric. The more agentic AI will continue and you will have agents using LLMs 50 times or 100 times for every request, so latency is king. So we’re splitting the prefill between 64 GPUs today just to shorten the time to first token. NVIDIA even announced that they have a new GPU for it. It’s amazing – it’s GDDR because the prefill is compute-bound, while the decode is memory-bound.
Ian Cutress: But what you’re saying is even though it’s a compute-bound workload if you have to split it across many compute chips you’re still network-bound at the end of the day.
Moshe Tanach: Yeah. Because the network is becoming the connectivity. It’s just data-center scale instead of on one board.
Ian Cutress: I do wonder why Nvidia didn’t add NVLink to that chip then.
Moshe Tanach: I think they are looking at it as another chip in a server that has Blackwell or Ruben with NVLink, It’s like a companion chip that will do a very efficient prefill but still if you want to split it between multiple of the of these you will go through the NVLink or Infiniband, or in our case with UEC, Ultra Ethernet.
Ian Cutress: I mean not everybody’s going to buy an Nvidia system at the end of the day.
Moshe Tanach: And not everybody can.
Ian Cutress: You need to be able to supply the market who either can’t afford or can’t wait for that hardware. We’re hearing that it takes 52 weeks to get an Nvidia GPU if you order one.
Moshe Tanach: You see a lot of investment in the hyperscalers doing their own XPUs. Google has a very successful one in the seventh generation already. All of them need networking solutions.
Ian Cutress: So, what are your customers asking from you?
Moshe Tanach: So in the first generation it was mostly enterprise and neocloud CSPs that wanted to squeeze more out of smaller models. In this generation we’re working closely with the hyperscalers and they’re asking for the highest performance. If you read the announcement it’s going to be 1.6 Terabits per second, and others are developing 800 gigabit per second, because they just need more. They’re not happy with the amount of scale out networking they get, and latency is everything. So we took the AI-NIC we had in the first generation as an embedded NIC. We boosted up to 1.6T, and we really redesigned the front-end packet processing and the transport for Ultra Ethernet for the theoretical limit of latency. We want to reach two microsecond end-to-end which means 500 nanoseconds for the NIC itself.
Ian Cutress: That works for not only language models but also things like recommendation engines where you know every millisecond and every percentage of accuracy really matters. So describe how the conversations from the do-everything NR1 evolved into the networking focused NR2?
Moshe Tanach: When you get the customers excited, they start to share with you what they really need. NR1 was incepted based on our data center experience. NR2 is actually what customers are asking for. A lot of it is around the balance between flexibility, like the DPU gives you, and you can program any transport, any acceleration, but it costs you latency because you go through the CPU or other accelerators. So the balance between flexibility and low latency, so we’re building multiple transport layers into the NIC. The main one, the ultra ethernet one is just a custom, very efficient, very low latency. Beside that, there’s a flexible transport engine for extras, where you are willing to pay more latency, right? But you need more sophisticated, more programming and more futureproof evolution.
Ian Cutress: Is that things like security or is that things like inline compute?
Moshe Tanach: First and foremost, it’s networking, just different transports like RoCE or UEC, and TCP sometimes. And the second one is in-network compute. Collective acceleration, non-math like others had promoted. But we also took the AI hypervisor that we developed into the first generation and we did a small version of it for the NIC. There are also DSP capabilities and custom compute, it’s very simple mathematics.
When you collect the data that you can do you can offload some of the compute that the GPU does today. The impact is not to free the GPU, the impact is to lower the amount of data that you need to bring because if I need to collect thousands of small tensors and compute them it’s a lot of traffic. If I can first compute them, I reduce the amount of output, and then I transfer so I shorten the time of transfer and I actually on the system level I improve efficiency.
Ian Cutress: So speak to me about this AI hypervisor. It seems like an interesting concept.
Moshe Tanach: The way most of the industry works today is that if you’re using a PyTorch runtime or Python or any other vLLM, usually the sequence itself between the different steps of compute, the control flow is managed by the CPU. The AI hypervisor allows you to understand this data flow, or the control flow that is attached to it, and offload it to a hypervisor. What is it? It’s a set of queue management capabilities, scheduling in hardware, dispatching. And now the data movement, the programming of the different engines, whether it’s the GPU, the DSP engines, the in-network compute, in the NIC, are offloaded to a piece of hardware that can parallelize everything.
Ian Cutress: So it’s a hypervisor that allows you to do some workload detection and management and then intercept where needed to the engines you have?
Moshe Tanach: And you natively expose it into Python with libraries or if you’re willing to invest the time, there’s an offline pre-compiled flow that you can just compile your use case and create the artifact for the hypervisor to run the sequence.
Ian Cutress: How easy is it to just pick up and play? We’re here at the AI Infrastructure Summit, and with the hundreds of hundreds of companies in front of us, a lot of them all speak about performance, latency, uptime, and tokens per second. But it’s not always easy to use.
Moshe Tanach: First of all, I have to admit that in a sense, their work is much harder. The ones that are trying to do another XPU are coming from behind and they need to compete with the amazing software stack that that NVIDIA gives you. We don’t have that problem. We are hosting those XPUs. So in many cases we hide their deficiencies because we optimize the use case for the customer and they get a complete server experience.
But in the second generation, the NR2 AI SuperNIC, the problem is even simpler because we’re not going after the complete problem. We’re going after the networking and the in-network compute problem. It’s plug-and-play. You just drop in the replacement NIC. Interfaces are very clear. You need to be very efficient in transporting the data. The in-network compute is just supporting the different CCL libraries and making sure that the library configures you, but the data path is completely offloaded to hardware.
Ian Cutress: So all those CCL libraries, these are the libraries that help all the networking talk to each other. They’re all ready to go from all the different vendors and it’s because it’s all UEC compliant is it?
Moshe Tanach: They can run on TCP or RoCEv2 or Infiniband or UEC – they’re agnostic in the upper level. It’s up to the vendor to implement the different support for the different protocols. It’s open-source, and a lot of people in the industry contribute to it. Our responsibility is to make sure that we’re compliant with the CCL library that the specific customer has chosen. If it’s with Nvidia, then it will be their promoted CCL.
Ian Cutress: You showed off NeuReality’s roadmap at a presentation yesterday, and it’s chiplets. The new NR2 SuperNIC is the first chiplet of the family that you’re going to offer to others as well. Tell me what that allows you to do from a design and product perspective.
Moshe Tanach: A modular approach [between companies] is hard, but it’s doable in my opinion. Back in 1997 every company would develop all the IP by themselves – today it’s a big business for Cadence and Synopsys, for ARM and others. Just moving to modular will take time because everybody’s saying “I’m supporting UCIe” but when you get to the details those two UCIe are developed by different companies that don’t connect.
For us, as a company with the NR1, the first market we went after is actually the datacenter CPU market. Replacing the CPU with a more advanced heterogeneous compute solution. When the market asked us to do the scale-out NIC alone, we said this is another product line, this is another market segment. It’s the NIC market in the data center. Should we do a monolithic solution for each one of the product lines?
We understood that the next generation of the AI CPU for us is going to be based on Arm’s Neoverse V3 CSS. It’s also going to need the same NIC. So first and foremost we went modular to simplify our development flow and be able to do a tick-tock cycle. You can also choose different process technologies for different tiles or different chiplets. So the first chip we’re going modular is the AI SuperNIC and it’s going to be a component in the second generation of the AI CPU.
It opened another thing for us, and you see a lot of companies like Rebellions for instance, they have implemented their compute tile and they have an I/O die. This I/O die is the piece that scales out or scales up. So it’s another business opportunity for us. A hyperscaler can integrate the AI SuperNIC the NR2 as a die.
We’re going to sell dies. We’re going to sell chips. It’s a new thing for us. We’re going to sell dies for integration in a package. Chips for modules like the Blackwell rack that has a 1U with CPU, ConnectX8 NIC, two GPUs. So, there’s a Connect X8 NIC there. We’re selling chips and we’re going to sell PCIe cards for OCP or for CEM cards. So, it’s going to have a variety of solutions for the customers.
Ian Cutress: Does this mean the business has expanded? Sounds like you need another 150 people to deal with that.
Moshe Tanach: Yeah, you’re right. Are you looking to invest?
Ian Cutress: So what you’re saying is you’re going through a funding round?
Moshe Tanach: Always!
Ian Cutress: Describe a little bit about NeuReality. I know you guys are based in Israel, you’re what, six or seven years old now?
Moshe Tanach: We’re actually almost six. We started in 2020 when we raised the first money. We have a very good engineering site in Poland in Krakow, which we’re very happy with. And the business team is here (in Silicon Valley).
Ian Cutress: Whenever somebody says to me, “Ian, you track so many startups, where are the customers?” What answer do you have for that?
Moshe Tanach: Today they’re a very big enterprise with their own datacenters and smaller cloud service providers that are looking to boost their longevity. I would say they’re selling metal as a service. They’re fighting on margins on top of what they buy from NVIDIA. They want to start selling more inference as a service, access to Llama or Mistral or fine-tuned models. That gives them more margins and actually it gives them the ability to invest in optimizing on the infrastructure level because it opens more margins for them. The price of the token is determined by the market but the more efficient you are on the infrastructure you can take more earnings.
Ian Cutress: If we look at, for example, the neocloud market that’s kind of booming right now and they’re also looking for customers as their “tokens as a service”. Do you find them willing to take risks given the nature of the competitive market they’re in?
Moshe Tanach: On the inference as a service, yes.
On the common huge deals that you read, like Nebius and CoreWeave and Lambda. It has to be NVIDIA. This is why Microsoft will make a $19 billion deal for I don’t know, five years. And it’s running training and inference there. They’re taking less risks.
But I’m very optimistic about NR2. Customers typically ask hyperscalers about which GPU, what is the software stack. A NIC is ‘somewhat’ an easier sell. You have a lot of gross margins there because it’s a high-end NIC. But it’s less complex, the sales funnel and the integration. I believe that NVIDIA will continue to lead, but we’re going to see more success stories like the TPU from Google. A lot of money is invested there. At the end of the day when you own the hardware you can do all kinds of optimizations and hyperscalers will get there. Even companies like the next hyperscalers, the OpenAIs of the world, Anthropic and those, I think this is the right path for the company to open this second product line and really invest in allowing all these XPUs to connect well together and achieve the best efficiency they can.
Ian Cutress: We’re here at the AI Infra Summit, and I can imagine the launch of the NR2 SuperNIC was the high priority for you. What else do you get out of an event like this?
Moshe Tanach: Well, we meet partners – we go to market with our partners. Every system we sell today has a GPU from Qualcomm or soon from another GPU vendor, and customers. Those kinds of areas bring specific customers. For me, it’s another opportunity to be in the US. Usually I do a two-week tour and I go visit customers across the States.
But another announcement that we’ve made, for anyone who uses our existing systems, is that we now run UEC on the NR1 [with a firmware update]. It just shows you the level of flexibility we have embedded in our AI NIC. We designed a network engine to be able to run AI over fabric protocol, our own proprietary protocol but standards like RoCEv2, like TCP. So boosting it to UEC there’s more we can do in custom acceleration.
But we have this flexibility, like a DPU, to run anything. It shows good advantages in latency. So when we do prefill disaggregation today with NR1 and Qualcomm, it already brings value to customers.
Ian Cutress: What does the next six months look like to you?
Moshe Tanach: Hard work! We’re bringing this NR2 to market. We’re planning to have it in customers hands in the second half of next year (2H26) and we want to make our customers happy. A lot of them are dealing with new upcoming LLMs and agents that introduce new APIs. We work hard on exposing new APIs that serve our customers. Customers that buy the NR1 inference system don’t have to deal with an open box and do everything themselves – we do it for them and we expose all capabilities through development APIs, deployment APIs and serving APIs. So that’s where engineering is focused. And the business thing.
