[Remote] AI Support Operations Engineer

Company: Nerdleveltech

Location: irvine

Closing Date: 29/06/2026

Hours: Full Time

Type: Permanent

Apply Now

Job Description

Note: The job is a remote job and is open to candidates in USA. CyberCoders is creating the next generation of AI-optimized data center infrastructure. The Staff AI Support Operations Engineer will lead the Ops team, focusing on architecting and deploying AI compute clusters while providing expert support and building operational standards.

Responsibilities

Collaborate with engineering teams to architect, deploy, and bring new AI compute clusters online while delivering expert-level support for existing high-density GPU environments
Own NetBox and related internal systems, ensuring all infrastructure data is accurate, consistent, and reliably maintained
Build and refine internal automation using Python, Ansible, and Terraform to eliminate manual workflows and modernize fragile legacy processes
Serve as the highest technical escalation point for customer and internal issues prior to involvement from Platform or Network/Undercloud teams
Transform tribal knowledge into clear, durable SOPs and technical documentation that establish the operational "gold standard"
Raise the technical bar for the team through code reviews, architectural guidance, and mentorship as the organization scales

Skills

Enterprise-Grade Server Proficiency: Advanced operational knowledge of HPE, Dell, and SuperMicro platforms, including IPMI, BMC, iDRAC workflows, and familiarity with Redfish-based management
Core Engineering Toolkit: Mastery of Python, Ansible, and Terraform as primary tools for automation, orchestration, and infrastructure lifecycle management
Linux Performance Engineering: Strong capability in diagnosing and tuning Linux systems, resolving performance bottlenecks, and optimizing workloads at the OS level
Advanced Incident Resolution: Demonstrated experience serving as the final technical escalation point for complex, high-impact infrastructure failures
Cloud-Native Operations: Proven production experience operating and troubleshooting Kubernetes environments
Next-Generation GPU Hardware: Familiarity with NVIDIA Blackwell (B200/B300) or Hopper (H100/H200) architectures
High-Performance Fabrics: Experience with InfiniBand or RoCE networking, and modern high-throughput storage platforms such as Weka or VAST Data
Bare-Metal Provisioning: Exposure to OpenStack or Canonical MAAS for automated provisioning of physical infrastructure