Remoteville

Remote Senior HPC AI Cluster Engineer Job in UK NVIDIA

Senior HPC AI Cluster Engineer NVIDIA
Computer ScienceArtificial IntelligenceGPFSInfinibandKernel-Based Virtual MachineLustreNatural Language ProcessingPythonRDMASlurm Workload Manager
Senior (5-8 years) - Expert (9+ years)
UK
33175+ employees
Hardware

Open for applications

Role


Who you are

  • A degree in Computer Science, Engineering, or a related field and 5+ years of experience
  • Knowledge of HPC and AI solution technologies from CPU’s and GPU’s to high speed interconnects and supporting software
  • Experience with job scheduling workloads and orchestration tools such as Slurm, K8s
  • Excellent knowledge of Windows and Linux networking and internals, ACLs and OS level security protection and common protocols
  • Experience with multiple storage solutions such as Lustre, GPFS, zfs and xfs
  • Python programming and bash scripting experience
  • Comfortable with automation and configuration management tools such as Jenkins, Ansible, Puppet/chef
  • Deep knowledge of Networking Protocols like InfiniBand, Ethernet
  • Deep understanding and experience with virtual systems e.g. VMware, Hyper-V, KVM, or Citrix
  • Familiarity with cloud computing platforms e.g. AWS, Azure, Google Cloud

Desirables

  • Knowledge of CPU and/or GPU architecture
  • Knowledge of Kubernetes, container related microservice technologies
  • Experience with GPU-focused hardware/software (DGX, Cuda)
  • Background with RDMA (InfiniBand or RoCE) fabrics



What the job involves

  • Design, implement and maintain large scale HPC/AI clusters with monitoring, logging and alerting
  • Manage Linux job/workload schedules and orchestration tools
  • Develop and maintain continuous integration and delivery pipelines
  • Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources
  • Deploy monitoring solutions for the servers, network and storage
  • Perform troubleshooting bottom up from bare metal, operating system, software stack and application level
  • Being a technical resource, develop, re-define and document standard methodologies to share with internal teams
  • Support Research & Development activities and engage in POCs/POVs for future improvements

Share this job

Hide company

More jobs at NVIDIA

Company


Company mission

NVIDIA is a pioneer in accelerated computing, reshaping the industry with GPUs and data-center-scale offerings that drive AI innovations and power the metaverse.




Company benefits

  • Equal opportunity employer
  • Diversity valued
  • Reasonable accommodation provided



Company values

  • Innovation
  • Excellence
  • Diversity
  • Collaboration



Company HQ

Santa Clara



Founders

Jensen Huang (CEO)
;