Overview
Remote
Full Time
Skills
High Performance Computing
Reliability Engineering
Incident Management
HTC
Collaboration
Research
Dashboard
IT Operations
Terraform
HPC
Linux Administration
TCP/IP
Computer Networking
Storage
Computer Cluster Management
LSF
GRID
Configuration Management
Ansible
Puppet
Agile
DevOps
Grafana
Scripting
Bash
OpenStack
Data Analysis
Management
Docker
Kubernetes
Cloud Computing
NOMAD
Programming Languages
Java
C++
Python
Ruby
Perl
File Systems
Weka
IBM GPFS
Leadership Development
Soft Skills
Google Cloud Platform
Google Cloud
Microsoft Azure
Amazon Web Services
LinkedIn
English
Job Details
We are seeking a Senior DevOps Engineer to enhance our high-performance computing services and collaborate closely with the scientific community to optimize research computing.
Join our team to build and operate cutting-edge HPC capabilities using automation and infrastructure-as-code. Apply now to contribute to innovative computational solutions in a dynamic environment.
To discover more about Cloud practice at EPAM Georgia, visit this page .
This position offers remote setup with the flexibility to work from any location in Georgia, whether it's your home, well-equipped offices in Tbilisi and Batumi or a coworking space in Kutaisi.
RESPONSIBILITIES
REQUIREMENTS
NICE TO HAVE
WE OFFER
Join our team to build and operate cutting-edge HPC capabilities using automation and infrastructure-as-code. Apply now to contribute to innovative computational solutions in a dynamic environment.
To discover more about Cloud practice at EPAM Georgia, visit this page .
This position offers remote setup with the flexibility to work from any location in Georgia, whether it's your home, well-equipped offices in Tbilisi and Batumi or a coworking space in Kutaisi.
RESPONSIBILITIES
- Design, implement, and maintain robust platform infrastructure using Infrastructure as Code tools such as Terraform
- Develop, deliver, and operate research computing services and applications
- Apply Site Reliability Engineering principles to manage HPC service deployment, monitoring, and incident response
- Solve complex technical problems related to HPC services and user applications
- Manage large-scale HPC, HTC, or BC computing environments for optimal performance
- Collaborate with scientific users to tailor HPC resources to research needs
- Automate deployment processes to ensure consistency across HPC infrastructure
- Maintain and administer large-scale cluster and server computing software such as Slurm, LSF, or Grid Engine
- Develop and maintain monitoring dashboards using tools like Grafana and Prometheus
- Work within a DevOps team environment following agile methodologies
- Operate and utilize virtualized private cloud resources such as OpenStack
- Administer large-scale parallel filesystems including Weka, GPFS, or Lustre
- Use configuration management tools like Ansible, Salt, or Puppet to manage IT operations
- Develop scripts and tools for HPC and DevOps platform operations using Bash and Python
REQUIREMENTS
- 3+ years of experience with DevOps processes and automation using Infrastructure as Code tools such as Terraform
- Hands-on experience operating or engineering large-scale HPC or similar computing environments
- Proven expertise in Linux system administration including TCP/IP networking and storage subsystems
- Experience administering large-scale cluster management software such as Slurm, LSF, or Grid Engine
- Knowledge of configuration management tools like Ansible, Salt, or Puppet
- Experience working in agile DevOps teams
- Ability to develop and maintain monitoring tools such as Grafana and Prometheus
- Experience with scripting languages such as Bash and Python for automation and tool development
- Strong experience managing virtualized private cloud environments like OpenStack
- Scientific degree or equivalent experience in computationally intensive scientific data analysis
- Proven ability to manage relationships with third-party suppliers
- Upper-intermediate proficiency in English (B2+)
NICE TO HAVE
- Experience with container technologies such as LXD, Singularity, Docker, or Kubernetes
- Operation and configuration experience with public cloud platforms like AWS, Azure, or Google Cloud Platform
- Experience with HashiCorp tools such as Vault, Consul, and Nomad
- Development experience with programming languages such as Java, C++, Python, Ruby, or Perl
- Experience with parallel filesystems like Weka, GPFS, or Lustre
WE OFFER
- We connect like-minded people
- Delivering innovative solutions to industry leaders, making a global impact
- Enjoyable working environment, whether it is the vibrant office or the comfort of your own home
- Opportunity to work abroad for up to two months per year
- Relocation opportunities within our offices in 55+ countries
- Corporate and social events
- We invest in your growth
- Leadership development, career advising, soft skills and well-being programs
- Certifications, including Google Cloud Platform, Azure and AWS
- Unlimited access to LinkedIn Learning and Get Abstract
- Free English classes with certified teachers
- We cover it all
- Participation in the Employee Stock Purchase Plan
- Monetary bonuses for engaging in the referral program
- Comprehensive medical & family care package
- Five trust days per year (sick leave without a medical certificate)
- Benefits package (sports activities, a variety of stores and services)
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.