Overview
Skills
Job Details
Title: Cloud Security Engineer / Site Reliability Engineer
Location: Alpharetta, GA or Berkeley Heights, NJ - Onsite
Duration: Long Term Contract
Open Positions:-2
Technical Skills
Programming and Scripting: Strong proficiency in languages like Python, Go, Bash, or Ruby. SREs often need to write automation scripts and build tooling.
Systems Administration: Deep understanding of operating systems (Linux/Unix), file systems, processes, and system configurations.
Infrastructure as Code (IaC): Experience with IaC tools like Terraform, Ansible, or Chef to manage infrastructure.
Cloud Computing: Knowledge of cloud platforms such as AWS, Azure, or Google Cloud Platform, including services like EC2, S3, Kubernetes, and serverless functions.
Containers and Orchestration: Expertise in containerization (Docker) and container orchestration (Kubernetes, OpenShift).
Networking: Understanding of networking concepts, including DNS, firewalls, load balancing, and VPNs.
Monitoring and Observability: Experience with monitoring and observability tools like Prometheus, Grafana, Datadog, or New Relic. Ability to set up and maintain monitoring dashboards, alerts, and logs.
Continuous Integration/Continuous Deployment (CI/CD): Familiarity with CI/CD tools like Jenkins, GitLab CI, GitHub Actions, or CircleCI.
A strong understanding of HashiCorp Vault and Terraform will make you stand out.
2. Problem-Solving and Troubleshooting
Incident Management: Ability to manage and respond to incidents, perform root cause analysis, and implement post-mortem reviews.
Automation: Focus on automating repetitive tasks to improve efficiency and reduce human error.
Performance Tuning: Skills in identifying and resolving performance bottlenecks in systems and applications.
3. Collaboration and Communication
Teamwork: Ability to work closely with cross-functional teams, including software engineers, product managers, and DevOps teams.
Documentation: Skill in creating clear and comprehensive documentation for systems, processes, and incident reports.
Communication: Effective communication skills for interacting with stakeholders and explaining technical concepts to non-technical audiences.
4. Reliability and Scalability
Service-Level Objectives (SLOs) and Service-Level Agreements (SLAs): Understanding of setting, monitoring, and maintaining SLOs and SLAs for system reliability.
Scalability: Knowledge of best practices for designing and scaling systems to handle increased loads and demands.
Redundancy and Resilience: Experience in designing systems with redundancy and fault tolerance to minimize downtime.
5. Security and Compliance
Security Best Practices: Understanding of security principles, such as access control, data encryption, and secure coding practices.
Compliance: Familiarity with compliance standards like GDPR, HIPAA, or PCI-DSS, depending on the industry.