Senior Software Engineer, DGX Cloud AI Infrastructure

Austin

Friday, 05 June 2026

Lead bring-up, validation, and debugging of large-scale AI clusters, infrastructure, and end-to-end workloads, setting the standard for how the team operates. Bring up, tune, and benchmark AI pre-training, post-training, and inference workloads using Py. Torch, Ne. Mo / Megatron, Tensor. RT-LLM, and adjacent NVIDIA AI software stacks. Profile and optimize end-to-end workload performance across compute, memory, networking, and communication layers using tools such as Nsight Systems, NCCL tests, and custom microbenchmarks. Analyze scaling efficiency for distributed LLM workloads using data, tensor, pipeline, and expert parallelism across modern GPU clusters, and translate findings into concrete tuning guidance. Own root-cause analysis of complex failures — hangs, performance regressions, topology sensitivity in large distributed environments. Define and build the resilience and failure-attribution stack: detecting, triaging, and attributing node, fabric, and workload failures across the cluster at scale. Build repeatable benchmark suites, automation, acceptance criteria, and qualification workflows on new platforms. Tune runtime settings, communication parameters, and deployment configurations in close partnership with framework, systems, and platform teams. Deliver actionable, data-driven recommendations based on profiling, benchmark results, and cluster characterization. Mentor engineers, drive technical standards, and act as a force multiplier across the broader performance and infrastructure organization. What we need to see:Bachelor’s or Master’s in Computer Science or a related technical field (or equivalent experience).8 years of experience developing software infrastructure for large-scale AI or HPC systems, including a track record of technical leadership. Expertise debugging and triaging AI applications across the full stack — from the application layer down to the hardware. Deep hands-on experience with NCCL, CUDA-aware distributed execution, and debugging multi-GPU and multi-node workloads at scale. Proven track record of architecting, debugging, and scaling large-scale distributed systems. Expert-level Python and C/ C programming skills. Experience operating workloads in scheduled, containerized cluster environments. Excellent analytical, debugging, and communication skills, with the ability to influence across teams. Ways to stand out from the crowd:Demonstrated experience debugging and optimizing AI workloads at large scale. Deep familiarity with the RDMA software stack (NCCL, IB verbs, UCX, libfabric). Strong knowledge of GPU cluster fabrics and topology, including NV - Link, NVSwitch, PC - Ie, Ro. CE, and Infini. Band. Experience building acceptance tests, benchmark harnesses, regression gates, or cluster qualification tooling for AI platforms. Experience building resilience, fault-detection, or failure-attribution systems for datacenter-scale infrastructure. NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you’re creative, autonomous, and love a challenge, we want to hear from you.

Loading Similar Jobs...

JOBZ is an independent Job Search Engine. JOBZ is not an agent or representative and is not endorsed, sponsored or affiliated with any employer. JOBZ uses proprietary technology to keep the availability and accuracy of its job listings and their details. All trademarks, service marks, logos, domain names, job descriptions and other company descriptions / details are the property of their respective holder. JOBZ does not have its users apply for a job on the J-O-B-Z.com website. Additionally, JOBZ may provide a list of third-party job listings that may not be affiliated with any employer. Please make sure you understand and agree to the website's Terms & Conditions and Privacy Policies you are applying on as they may differ from ours and are not in our control.