Full Time
Austin, TX, US
Posted 9 months ago

We are searching for a Site Reliability Engineer. The Site Reliability Engineer is part of a team that is responsible for the overall coordination and control of all infrastructure systems required to keep the business operational. The Site Reliability Engineer is responsible for developing software systems and automated solutions for operational aspects. They will assist in driving the technology roadmap of the team and ensuring all systems run as efficiently as possible.
Site Reliability Engineering revolves around monitoring, alerting, and automating. A Site Reliability Engineer monitor’s and helps stabilize services in production, sets and maintains acceptable performance and availability thresholds, and writes code that automates repetitive tasks. Additionally, SRE’s will keep an eye on our systems capacity and performance. Development efforts will focus on optimizing existing systems, building infrastructure-as-code, and eliminating work through automation.

Apply for Job

Job Responsibilities

Mentors and trains other team members.
Leads architectural and roadmap initiatives from conception to execution
Leads initiatives across any specifically assigned discipline noted above from conception to execution
Evaluations of all existing services to improve design, security, stability, performance, or operational efficiency.
Architecture development, planning, and implementation.
Research technology trends and vendor products.Testing of new solutions and assisting with roadmap development.
Monitor health of all servers and infrastructure components; perform breakfix troubleshooting and periodic preventive maintenance.
Develop and maintain technical documentation.
Ability to independently find problems, diagnose and design solutions.
Track all demand in the bank’s ITSM tool
Communicate changes to management and end-users as appropriate

Required Skills & Qualifications

Expert level knowledge in at least four of the following disciplines and working level knowledge in all of them:

Design, implementation and support of monitoring and alerting platforms – Prometheus and Grafana.
Design, implementation and support of automation platforms – Jenkins, Terraform, Ansible and Azure DevOps
Design, implementation and support of Disaster Recovery strategies and high availability solutions.
Design, implementation and support of server virtualization platforms – VMware and Microsoft Azure
Design, implementation and support of converged server infrastructure
Design, implementation and support of SAN administration using Fibre Channel
Design, implementation and support of BackupRestore software and methodology
Design, implementation and support of Microsoft network management solutions such as Active Directory, Group Policy, DNS, DFS, and certificate services.
Reliability support of enterprise level electronic communication systems including Exchange, email archiving, message encryption and instant messaging.
Design, implementation and support of Data Loss/Leak Prevention technology.
Security best practices in a Microsoft Windows environment including NTFS permissions, system patchinghardening and anti-malware solutions.
Design, implementation and support of server management including monitoring, alerting, patching, application deployments, imaging, and encryption.
Reliability support of datacenter systems such as environmental controls and cabling standards.

Apply for Job

info@ppaac.com

512-750-0778

SRE IV

Job Responsibilities

Required Skills & Qualifications

Recent Posts

Recent Comments