Overview
Skills
Job Details
Must Skills: AWS, Terraform, Cassandra, IAM, Java/Python/Ansible.
Job Responsibilities:
Good understanding of infrastructure architecture including servers, storage, network, database, MQ and application components.
Infrastructure as code & automation: use Terraform, Ansible and Jenkins CI/CD for automation, containerize our environments (Kubernetes, Helm charts), and leverage PCF private cloud technologies. Implements and manages security best practices, including identity and access management (IAM), encryption, and network security.
Automates infrastructure provisioning, configuration, and deployment processes.
Managing infrastructure for high volume transaction-based systems Investigate infrastructure related issues, performance bottlenecks, take lead in identifying the root cause and apply best solution Using industry standard monitoring tools (Dynatrace, Splunk, Prometheus, Grafana etc.) to develop and maintain dashboards, alerting, reporting.
Experience in setting up end to end infrastructure monitoring and metrics, tracing for enterprise systems Understanding of Site Reliability Engineering concept (SLO, SLIs & Error Budget)
Demonstrated expertise in Incident/Problem/Change management process and procedures.
Able to troubleshoot priority incidents, facilitate blameless post-mortems and ensure permanent closure of incidents.
Required qualifications, capabilities, and skills Formal training or certification on software engineering and 6+ years applied experience in the areas of infrastructure, database systems Develops and maintains infrastructure as code (IaC) using tools such as Terraform, AWS CloudFormation.
Scripting skills for automation of configurations environments; comfortable working with shell/python scripts and Ansible playbooks.
Hands on utilizing scheduler software such Control-M or Autosys Knowledge in Cloud & DevOps Tools: Pivotal Cloud Foundry(PCF), Kubernetes Experience in Disaster Recovery planning and test execution. Familiarity with cloud-based database systems - Cockroach DB, Cassandra, and infrastructure platforms like PCF Experience in creating dashboards and reports for infrastructure systems, optimization and capacity planning.
Hands-on practical experience in Site Reliability Engineering and operational stability practices Solid understanding of agile methodologies such as CI/CD, Application Resiliency, and Security
Hands on experience in one of programming languages (Java, Python, Ansible Scripting) Knowledge in Cloud & DevOps Tools: Pivotal Cloud Foundry(PCF), Kubernetes, Terraform, Ansible.