Lead DevOps Engineer
100% Remote
12+ months
Candidate must be located in Northwoods would be open to candidates who are outside these 14 states- (OH, FL, NC, GA, SC, PA, WI, TX, MI) and avoiding these states (CA, WA, OR, CT, NY)
Client: Northwoods
Must have active linkedin profile.
Virtual interview. They are looking for a middle-aged Josh or Joe.
Must have 10/10 communication and very sharp. Must have all required tech skills experience.
At this point, we are going to pass on bothcandidates
First candidate came across as polished and likely a strong culture fit, but the team did not leave with enough confidence in his technical depth or hands-on ownership experience for this level of role. His answers often stayed high level without enough concrete execution examples.
Second candidate demonstrated stronger technical depth and more hands-on examples, but there were concerns around communication style, collaboration, and overall team fit. Ultimately, the team did not feel confident in the long-term fit for the environment and leadership style needed in this role.
A major focus for us moving forward is increasing confidence in both the candidate’s technical depth and hands-on capability and ability to adapt quickly, take ownership, and grow into the long-term leader we need in this space.
We are also refining our interview approach. The interview rounds will remain the same: an initial screening with me, a group technical/culture interview, and a final interview with Mitch as the hiring manager. After each round, the interview team is documenting summaries, concerns, and targeted follow-up questions so we are intentionally digging deeper into gaps or concerns in the next round rather than repeating the same conversations across interviews.
My goal moving forward is to provide much more specific and feedback on both resumes and interviews to help everyone. This has been a challenging search, and while we have seen strengths in different candidates, we have not yet found the right overall fit for what the team needs long term.
Lastly, Ben also updated the job description to better reflect the level and scope of the role we are actually trying to hire for. The previous version was more of a traditional senior DevOps leadership role focused on infrastructure, deployment, and reliability.
The updated version shifts more toward a principal-level platform engineering and DevOps leadership role that:
Owns developer platform and infrastructure strategy
Helps integrate AI/LLM tooling into engineering workflows
Drives observability, scalability, and operational clarity
Raises engineering and platform standards across the organization
The DevOps Engineer Lead owns the clarity, reliability, security, and repeatability of how our systems are built, deployed, and operated. This role designs and maintains automated, scalable, secure, and cost-effective infrastructure across production, development, and test environments.
This is a deeply hands-on role responsible for executing and improving deployments, observability, and core operational practices to reduce risk caused by opaque processes, undocumented knowledge, and single points of failure. The DevOps Engineer Lead transforms deployment and infrastructure from siloed knowledge into understandable, well-documented, and observable systems that engineers can confidently use and improve.
The role leads through practice by mentoring engineers, establishing standards, improving processes, and removing operational obstacles. Working independently and in close partnership with Engineering, this role reduces operational burden, increases delivery confidence, and builds platforms that scale reliably. This role provides technical leadership through ownership and execution and does not include formal people-management responsibilities.
Position Responsibilities:
Platform, Deployment, and Operations Ownership
Own day-to-day DevOps operations, including infrastructure health, monitoring, logging, patching, security posture, and maintenance, ensuring systems are observable and failures are diagnosable through strong metrics, logging, root-cause visibility, and effective incident response.
Own and execute deployment processes end-to-end, ensuring they are secure, repeatable, transparent, and well documented with clear failure signals and automated rollback strategies.
Design, build, and maintain automated, scalable, secure, and cost-effective infrastructure across production, development, and test environments
Build, operate, and continuously improve CI/CD pipelines with clear failure signals, recovery paths, and rollback strategies
Own application-level networking and infrastructure concerns, including network configuration, access controls, and connectivity required to support development and production environments.
Own all infrastructure and networking concerns, including the configuration and troubleshooting of site-tosite VPNs, firewall rules, and secure connectivity required for county-level integrations and remote access.
Security, Reliability, and Standards:
Access Analysis & Least Privilege: Perform regular access analysis across all systems, managing secrets, credentials, and IAM roles to ensure strict adherence to security best practices.
Audit Readiness & Evidence: Proactively support compliance requirements (such as SOC 2) by maintaining auditable operational practices and generating technical evidence/reports for software and security audits.
Vulnerability Management: Enforce security posture through proactive patching, encryption, and vulnerability management across web servers, load balancers, and data stores
Enablement, Leadership, and Continuous Improvement
Partner with software engineers during deployments and operational work to build shared understanding and enable safe, independent troubleshooting
Deploy, manage, and scale web and application servers, load balancers, queues, and caches through automated, repeatable workflows.
Identify, prioritize, and deliver improvements that reduce operational risk, remove bottlenecks, improve efficiency, and increase delivery confidence
Document systems and processes with a focus on explaining both how they work and why
Take proactive ownership of workload while ensuring strong coordination and transparency across the team
Perform other job-related duties as assigned to support departmental goals and continuous improvement initiatives.
Required Skills
Strong ability to understand and operate systems end-to-end-, including application architecture, infrastructure, and deployment workflows
Proven ability to troubleshoot and resolve complex production issues across infrastructure, CI/CD pipelines, Kubernetes, and runtime environments
Strong understanding of observability practices, including metrics, centralized logging, alerting, and root cause analysis
Deeply hands-on operator with sound technical judgment; able to assess situations quickly and clearly recommend solutions (“what we should do and why”)
Strong sense of ownership and accountability, with the ability to prioritize work that improves reliability, reduces risk, and ensures follow through
Ability to collaborate effectively with software engineers and communicate clearly with both technical and non-technical stakeholders
Ability to lead through influence by pairing, mentoring, documenting, and establishing practical standards
Self-starter comfortable operating in environments where structure must be built, not inherited, with a focus on clarity and execution
Strong security mindset, with hands-on experience in secrets management, access controls, encryption, patching, and vulnerability management
Hands-on experience of network topology, including the ability to configure and troubleshoot Site-to-Site VPNs, firewall rules, and hybrid-cloud connectivity.
Education and Experience:
10 years of hands-on experience in DevOps, infrastructure, or platform engineering supporting production systems at scale
Strong, hands-on experience operating workloads in AWS, with responsibility for reliability, security, and dayto-day operations
Proven production experience with Kubernetes, including deploying, operating, and troubleshooting containerized workloads
Strong programming experience with Python (or similar), with the ability to write and maintain production automation and work fluently in Python
Deep hands-on expertise with Terraform and infrastructure as code practices, with experience using broader DevOps tooling such as CloudFormation and Ansible
Strong proficiency with git based source control, including code reviews, collaborative workflows, and infrastructure/code ownership
Extensive experience building, operating, and improving CI/CD pipelines for provisioning, deployment, and scaling
Strong Linux/Unix expertise, including administration, scripting, troubleshooting, and operational monitoring in production environments
Hands-on experience implementing monitoring and log aggregation platforms (ELK, Graylog, Graphite, Prometheus, etc.)
Experience implementing test automation and AI assisted tooling to improve deployment quality, reliability, and operational efficiency
Experience deploying and managing application infrastructure such as web or application servers, load balancers, queues, and caches, with an emphasis on scalability and resiliency
Must be authorized to work in the U.S.
Nice to Have:
Hands-on experience with networking concepts such as VPNs, firewall rules, or hybrid cloud connectivity
Security experience in regulated or compliance driven environments (e.g., SOC 2 familiarity)
Database administration experience
Experience supporting or deploying 12-Factor applications