Sr. Manager, Site Reliability Engineering, Digital Home
- Responsible for recruiting, developing, and managing a team of engineers who are focused on Site and Service Reliability. This team is well versed in the technologies and Platforms used to deliver Content to a wide range of devices ranging from PCs to IOS/Android devices to traditional Set Top Boxes.
- The SRE group is focused on improving the availability and responsiveness of internal and external components and Platforms through the application of engineering best practices, tooling and instrumentation advances and cross organizational coordination. The SRE team helps drive efforts to improve triage time and bring down MTTR (Mean Time to Repair) and provides follow-up support to provide mitigation in the future.
- This individual will manage a team which may include exempt and non-exempt employees. They will provide subject matter guidance to employees as required and serve as a point of escalation.
- Passionate and driven to improve the customer experience through solving problems which impede reliability, resiliency and responsiveness.
- Develops processes and procedures to drive departmental efficiencies, assist in development, and meeting of departmental budget.
- Participates in the management of full life cycle product development to include analysis and planning related to product development, launch and deployment. Assist peer product development organizations with launch readiness and post-launch triage and analysis.
- Being proactive, evaluating multiple options and considering our customer's experience is key to our success.
- Drive the adoption and implementation of Operational Best Practices and associated tooling to improve resiliency and reliability. Improve tooling and instrumentation to accelerate triage and remediation.
- Conduct War Game exercises to simulate adverse conditions and situations in a controlled fashion. Collect and communicate learnings and areas of opportunity to advance Operational Stance and Preparedness.
- Document and detail areas of improvement to bolster architecture, design, technical requirements and service specifications. Present architecture, design, and technical choices to internal audiences.
- Demonstrates technical leadership and mentoring on the application of new technologies and systems management methodologies. Can tailor and adapt approach to build consensus and alignment across peer Operational and Engineering Groups.
- Monitors technical and engineering progress to ensure strategies, goals and objectives are met. Aligns operational plans with business objectives. Communicates changes to all affected personnel.
- Presents periodic updates to Senior Management on impairments, mitigation opportunities and progress.
- Evaluates various architectural solutions and implementations and supports development and deployment of solutions as determined by the SRE team.
- Establishes and maintains productive relationships with peer organizations and equipment and software vendors.
- Identifies trends, services and/or capabilities that may be beneficial to product offerings. Manages and forecasts resource needs to meet departmental objectives. Recommends action plans or solutions.
- Participates consultatively in the testing and certification process as needed.
- Ensures effective implementation of the department budget. Prepares financial statements and monthly forecasts and reports. Prepares and analyzes monthly financial performance and makes budget and new technology recommendations.
- Consistent exercise of independent judgment and discretion in matters of significance.
- Regular, consistent and punctual attendance.
- Other duties and responsibilities as assigned.
- 8+ years in a software development role, operations role, or closely related position
- 3+ years as a manager or team lead
- Experience administering Linux systems in a production environment
- Programming experience in one or more of the following languages: Go, Java, Python, Ruby, Shell
- Bachelor's Degree in Computer Science or a related field, or relevant work experience
- Excellent problem solving skills with a strong attention to detail
- Experience with distributed version control systems like Git or Mercurial
- Ability to dive deep into complex technical problems
- Experience with IaaS and PaaS providers such as AWS, OpenStack, Heroku, and CloudFoundry
- A sense of ownership, initiative, and drive
- Experience with enterprise monitoring solutions like InfluxDB, AppDynamics, Graphite, Racon, Grafana, Nagios, and Splunk
- Familiarity with continuous integration/deployment processes and tools such as Artifactory, Gerrit, Git, Jenkins, Maven and Nexus
- Experience with configuration management tools such as Ansible, CFEngine, Chef and Puppet
- Experience building tools for automation (building, testing, releasing, monitoring and alarming)
Comcast is an EOE/Veterans/Disabled/LGBT employer