Software Platform Support Engineer - GPU Cloud

Santa Clara

Tuesday, 21 April 2026

The NVIDIA DGX Cloud organization is looking for passionate software support engineers to partner closely with our internal customers to support them on our internal platforms. This partnership requires you to gain a deep understanding of the customer needs, how their application(s) work, assist them in troubleshooting issues, and create documentation to make it easier for users to troubleshoot issues themselves in an ambiguous / fast-moving environment. The support you provide will help our users have a better experience and help shape our platform. We expect you to have knowledge of supporting cloud-based deployments across compute, storage and networking environments. What will you be doing:Partner with multiple internal teams to provide Tier 1 support for complex cloud platforms. Define and improve operational workflows (runbooks, escalation paths, support processes)Triage/investigate root cause of customer issues and escalate as needed File bugs and report issues while working closely with the Site Reliability team. Build tooling to improve customer support process and visibility. Deeply understand user workloads and use cases Partner with multiple internal teams to give feedback to engineering teams and develop solutions to aid in their success. Be part of an on call rotation to support production systems What we need to see:BS/ MS degree in Computer science or related areas (or equivalent experience)2 yrs of experience with supporting distributed software systems, supporting end-user software platforms, and experience with Linux. Experience with Kubernetes, AWS, Azure, OCI, and GCP Background of Infrastructure, Networking, Storage, and DevOps scripting/tooling. Understanding of data storage technologies (databases, file, block, blob)Customer Service/ Support Experience. Willingness to work up and down the stack as well as across multiple teams Strong skills in troubleshooting and Communication Ways to stand out from the crowd:Experience with ML - Ops workflows or ML infrastructure Familiarity with GPU workloads or distributed training systems. SLURM or HPC previous experience. Strong drive to work with internal customers and make them successful. A drive to improve process with strong organizational skills

apply
 
Loading Similar Jobs...
JOBZ is an independent Job Search Engine. JOBZ is not an agent or representative and is not endorsed, sponsored or affiliated with any employer. JOBZ uses proprietary technology to keep the availability and accuracy of its job listings and their details. All trademarks, service marks, logos, domain names, job descriptions and other company descriptions / details are the property of their respective holder. JOBZ does not have its users apply for a job on the J-O-B-Z.com website. Additionally, JOBZ may provide a list of third-party job listings that may not be affiliated with any employer. Please make sure you understand and agree to the website's Terms & Conditions and Privacy Policies you are applying on as they may differ from ours and are not in our control.