Company:
Nerdleveltech
Location: irvine
Closing Date: 29/06/2026
Hours: Full Time
Type: Permanent
Job Description
Note: The job is a remote job and is open to candidates in USA. CyberCoders is creating the next generation of AI-optimized data center infrastructure. The Staff AI Support Operations Engineer will lead the Ops team, focusing on architecting and deploying AI compute clusters while providing expert support and building operational standards.
Responsibilities
- Collaborate with engineering teams to architect, deploy, and bring new AI compute clusters online while delivering expert-level support for existing high-density GPU environments
- Own NetBox and related internal systems, ensuring all infrastructure data is accurate, consistent, and reliably maintained
- Build and refine internal automation using Python, Ansible, and Terraform to eliminate manual workflows and modernize fragile legacy processes
- Serve as the highest technical escalation point for customer and internal issues prior to involvement from Platform or Network/Undercloud teams
- Transform tribal knowledge into clear, durable SOPs and technical documentation that establish the operational "gold standard"
- Raise the technical bar for the team through code reviews, architectural guidance, and mentorship as the organization scales
Skills
- Enterprise-Grade Server Proficiency: Advanced operational knowledge of HPE, Dell, and SuperMicro platforms, including IPMI, BMC, iDRAC workflows, and familiarity with Redfish-based management
- Core Engineering Toolkit: Mastery of Python, Ansible, and Terraform as primary tools for automation, orchestration, and infrastructure lifecycle management
- Linux Performance Engineering: Strong capability in diagnosing and tuning Linux systems, resolving performance bottlenecks, and optimizing workloads at the OS level
- Advanced Incident Resolution: Demonstrated experience serving as the final technical escalation point for complex, high-impact infrastructure failures
- Cloud-Native Operations: Proven production experience operating and troubleshooting Kubernetes environments
- Next-Generation GPU Hardware: Familiarity with NVIDIA Blackwell (B200/B300) or Hopper (H100/H200) architectures
- High-Performance Fabrics: Experience with InfiniBand or RoCE networking, and modern high-throughput storage platforms such as Weka or VAST Data
- Bare-Metal Provisioning: Exposure to OpenStack or Canonical MAAS for automated provisioning of physical infrastructure
Benefits
- BONUS
- RSUs
Share this job
Nerdleveltech
Useful Links