We are searching for a Site Reliability Engineer. The Site Reliability Engineer is part of a team that is responsible for the overall coordination and control of all infrastructure systems required to keep the business operational. The Site Reliability Engineer is responsible for developing software systems and automated solutions for operational aspects. They will assist in driving the technology roadmap of the team and ensuring all systems run as efficiently as possible.
Site Reliability Engineering revolves around monitoring, alerting, and automating. A Site Reliability Engineer monitor’s and helps stabilize services in production, sets and maintains acceptable performance and availability thresholds, and writes code that automates repetitive tasks. Additionally, SRE’s will keep an eye on our systems capacity and performance. Development efforts will focus on optimizing existing systems, building infrastructure-as-code, and eliminating work through automation.
Job Responsibilities
- Mentors and trains other team members.
- Leads architectural and roadmap initiatives from conception to execution
- Leads initiatives across any specifically assigned discipline noted above from conception to execution
- Evaluations of all existing services to improve design, security, stability, performance, or operational efficiency.
- Architecture development, planning, and implementation.
- Research technology trends and vendor products.Testing of new solutions and assisting with roadmap development.
- Monitor health of all servers and infrastructure components; perform breakfix troubleshooting and periodic preventive maintenance.
- Develop and maintain technical documentation.
- Ability to independently find problems, diagnose and design solutions.
- Track all demand in the bank’s ITSM tool
- Communicate changes to management and end-users as appropriate
Required Skills & Qualifications
Expert level knowledge in at least four of the following disciplines and working level knowledge in all of them:
- Design, implementation and support of monitoring and alerting platforms – Prometheus and Grafana.
- Design, implementation and support of automation platforms – Jenkins, Terraform, Ansible and Azure DevOps
- Design, implementation and support of Disaster Recovery strategies and high availability solutions.
- Design, implementation and support of server virtualization platforms – VMware and Microsoft Azure
- Design, implementation and support of converged server infrastructure
- Design, implementation and support of SAN administration using Fibre Channel
- Design, implementation and support of BackupRestore software and methodology
- Design, implementation and support of Microsoft network management solutions such as Active Directory, Group Policy, DNS, DFS, and certificate services.
- Reliability support of enterprise level electronic communication systems including Exchange, email archiving, message encryption and instant messaging.
- Design, implementation and support of Data Loss/Leak Prevention technology.
- Security best practices in a Microsoft Windows environment including NTFS permissions, system patchinghardening and anti-malware solutions.
- Design, implementation and support of server management including monitoring, alerting, patching, application deployments, imaging, and encryption.
- Reliability support of datacenter systems such as environmental controls and cabling standards.
Recent Comments