Company:
Bright Vision Technologies
Location: remote
Closing Date: 19/06/2026
Hours: Full Time
Type: Permanent
Job Description
Job Description:
- Design and operate GPU and accelerator infrastructure for training and inference, spanning on-prem clusters, cloud-managed services, and hybrid configurations
- Build scheduling, queueing, and resource-sharing systems that maximize accelerator utilization across many teams
- Integrate frameworks such as PyTorch, JAX, DeepSpeed, FSDP, Megatron-LM, and Ray Train into a unified platform offering
- Operate high-performance storage systems and data pipelines that keep accelerators fed with training data at near-line-rate
- Design networking architectures supporting RDMA, InfiniBand, NCCL, and high-bandwidth collective communication
- Build observability for AI workloads including utilization, throughput, training stability, and failure-mode analytics
- Implement checkpointing, restart, and fault-tolerance patterns for long-running training jobs at scale
- Drive cost optimization across compute, storage, and networking through scheduling, spot capacity, and right-sizing
- Develop developer tooling and paved-road workflows that let researchers launch experiments safely and efficiently
- Partner with research and applied ML teams to plan capacity for upcoming training runs
- Implement security controls, isolation, and access management for multi-tenant AI infrastructure
- Drive automation across cluster provisioning, lifecycle management, and configuration enforcement
- Maintain runbooks, capacity dashboards, and operational documentation for the AI platform
- Stay current with AI infrastructure research, accelerator hardware, and emerging open-source AI tooling.
Requirements:
- Bachelor’s or Master’s degree in Computer Science or a related field
- Six or more years of experience in infrastructure, platform, or HPC engineering
- Hands-on experience operating GPU clusters or large-scale ML training infrastructure
- Strong proficiency in Python and at least one systems language such as Go or C++
- Deep understanding of distributed training, accelerator architectures, and collective communication
- Experience with Kubernetes, Slurm, Ray, or similar scheduling systems for ML workloads
- Strong understanding of Linux internals, networking, and high-performance storage
- Experience with at least one major cloud provider’s ML infrastructure offerings
- Strong software engineering practices including testing, CI/CD, and code review
- Excellent communication and cross-functional collaboration skills.
Benefits:
- Comprehensive benefits
- Competitive compensation packages
- Supportive work-life balance
Share this job
Bright Vision Technologies
Useful Links