Returning Candidate?

Director, Site Reliability Engineering

ID: 2024-10770
Date Posted: 4 months ago(6/24/2024 9:53 AM)
Company: Berkley Technology Services LLC
Primary Location: US-VA-Manassas
Category: Information Technology

Company Details

bts 2022 USE THIS ONE

Company URL: https://www.berkleytechnologyservices.com

Berkley Technology Services (BTS) is the dynamic technology solution for W. R. Berkley Corporation, a Fortune 500 Commercial Lines Insurance Company. With key locations in Urbandale, IA and Wilmington, DE, BTS provides innovative and customer-focused IT solutions to the majority of WRBC’s 60+ operating units across the globe. BTS’s wide reach ensures that ideas and opinions are considered at every level of the organization to guarantee we find the best solutions possible.

Driven by a commitment to collaboration, BTS acts as consultants to our customers and Operating Units by providing comprehensive solutions that not only address the challenge at hand, but proactively plan for the “What’s Next” in our industry and beyond.

With a culture centered on innovation and entrepreneurial spirit, BTS stands as a community of technology leaders with eyes toward the future -- leaders who truly care about growing not only their team members, but themselves, and take pride in their employees who shine. BTS offers endless ways to get involved and have the chance to grow your career into a wide range of roles you'd never known existed. Come join us as we push forward into the future of industry leading technological solutions.

Berkley Technology Services: Right Team, Right Technology, Simple and Secure.

Responsibilities

The Sr Director, Site Reliability Engineering (SRE) is responsible for developing and implementing a comprehensive strategy for site reliability, encompassing scalability, performance, and reliability improvements. The role will align SRE objectives with overall business goals and technology roadmaps. It will foster the spirit of continuous improvement to the SRE and position it to benefit the organizational objectives across the Berkley Corporation.

The person in this role is responsible for overseeing SRE team operations, ensuring the reliability and availability of key applications and supporting infrastructure. This role will work effectively with Service Management to enforce best practices for system reliability, monitoring, capacity planning, incident response, problem management, disaster recovery, change management, and workflow automation. They will also own and administer the tools and technologies necessary to generate a complete view of SRE metrics and improvement areas, including (but limited to) monitoring, logging, notification, dashboarding, and AIOps.

Team Performance Management:

Instantiate and build a robust SRE team over time and integrate SRE into Berkley’s product development and operational process.
Recruit, mentor, and develop a high-performing team of SRE professionals.
Monitor ongoing staff performance; identify and communicate opportunities for improvement.
Provide leadership and support to ensure projects are staffed appropriately and timelines are met.

Collaboration and Relationship Building:

Collaborate with the BTS IT Leadership Teams and other groups across the IT organization to drive a unified approach to site reliability that reduces downtime and minimizes outage business impact.
Foster strong relationships with delivery organization leadership to align SRE efforts with organizational goals. Work collaboratively with other business and IT leaders to ensure cross functional problems are addressed cohesively across the organization.
Work cross-functionally in partnership with software development teams to guide product development in creating resilient and durable software systems.
Collaborate with EA to institute design patterns for resilient systems and mechanisms for scoring applications against industry-recognized configurations (including active-active, active-passive, recover-from-scratch, and data replication scenarios).

Execution, Project, and Work Management:

Define, and track reliability and observability OKRs for infrastructure and key systems.
Implement robust monitoring and alerting systems to proactively identify potential issues, analyze system performance, and facilitate quick response to incidents.
Implement AIOps functionality to enable auto-response, self-healing, and anomaly trend analysis.
Drive the development and implementation of automation solutions to remove “toil”, streamline processes, reduce manual interventions, and enhance the overall efficiency of the product engineering and SRE teams.
Work closely with product, development, infrastructure, and architecture teams to conduct capacity planning, ensuring that systems can handle current and future demand. Anticipate growth and scalability requirements.
Establish and oversee effective high-severity incident response processes, ensure timely incident resolution, and conduct post-mortems to identify root causes and implement preventive measures.
Improve reliability by identifying and addressing gaps in our architecture, services, and tooling.
Oversee disaster recovery program for both on premise and Cloud-based Berkley solutions.
Performs other duties assigned.

Qualifications

A passion for technology and innovation in the end user computing space.
8+ years of experience in building/leading strong and flexible teams, managing large scale systems consumed by tens/hundreds of thousands of users.
8+ years of experience of Site Reliability Engineering and DevOps.
4+ years of experience in Disaster Recovery and/or Business Continuity.
Strong understanding of Cloud computing platforms (Azure preferred) including life-and-shift environments (VMs, etc.) and cloud-native setups (AKS, serverless, etc.).
Strong understanding and experience in automation tools and programming/scripting languages to develop and implement automated system reliability and performance solutions including infrastructure automation and configurations management tools (Ansible, Chef, Puppet).
Strong understanding of observability, monitoring, alerting, and logging tools and ability to design and implement effective monitoring and logging strategies.
Experience in designing and implementing on-premise, cloud, and hybrid resiliency solutions, disaster recovery, and business continuity planning.
Ability to drive critical issues and system design discussions and moderate between multiple technology teams.
Solid understanding of security best practices in on-premise, cloud, and hybrid environments along with Network technologies.
Working knowledge of CI/CD - preferably GitHub workflows and Actions.
Working knowledge of IaC automation tools (Terraform, Ansible, etc.)
Experience with Kubernetes and other auto-scaling tools and technologies.
Skilled at assessing and developing IT talent across multiple time zones and multiple business domains.
Exceptional written and verbal communication skills.
Ability to work independently in a fast-paced environment.
Bachelor's Degree and 8 years of experience or a combination of associate degree with 11 years of experience
Travel Requirement: Up to 25%

BTS Leadership Values:

Agile
Customer Centric
Ownership Mindset
Sense of Urgency
Servant Leadership
1BTS

Leadership Behavioral Attributes:

Flexibility
Customer Service Oriented
Operational Effectiveness
Personal Ownership
Quick Decision Making
Team Builder
Transformational Leadership

This company is an equal employment opportunity employer

Options

Create Profile & ApplyApply

Email this job to a friendRefer

Sorry the Share function is not working properly at this moment. Please refresh the page and try again later.

Share on your newsfeed

Connect With Us!

Not ready to apply? Connect with us for general consideration.

Application FAQs