Incident Response Manager - Data Center

San Jose

Thursday, 23 April 2026

Serve as the first responder in the IRC Operation Center, detecting and responding to events across infrastructure, facilities using tools such as Server Automation, Data Center Infrastructure Management, Network monitoring, Grafana, and related systems. - Respond promptly to events including but not limited to: - Environmental systems (e.g. high temperature, humidity, power fluctuations or failures) - IT infrastructure (e.g. server performance issues, network outages, system failures) - Facility and environmental alerts relevant to operations. - External Facing Services (e.g. colocation maintenance notices, service requests from CDN partners, and critical notifications) - Conduct detailed investigations to diagnose the root cause of events, assess their impact, and determine appropriate response actions. - Monitor and analyze detected events, accurately classify incidents based on potential or actual customer impact, and proactively communicate risks. - Coordinate timely escalations by notifying and collaborating with relevant support teams to ensure swift incident resolution. - Monitor incident response performance against agreed SLAs, ensuring timely alerts and notifications. - Manage incidents calmly and efficiently, performing in-depth investigations to determine root causes and impacts, while promptly engaging and coordinating with the designated resolver teams to facilitate timely resolution. - Draft detailed incident reports and conduct post-mortem reviews to document lessons learned. - Generate regular reports to deliver comprehensive insights into the effectiveness of incident response and recovery processes. - Analyze trends and patterns in events to identify opportunities for improvement and optimization - Own and drive the Incident, Problem, and Change Management processes in alignment with ITIL or internal ITSM frameworks. - Develop and maintain a comprehensive library of Standard Operating Procedures (SOPs), Methods of Procedure (MOPs), runbooks, and operational guides to ensure consistency and readiness across teams. - Lead or support continuous improvement projects aimed at enhancing incident response capabilities, operational security, system reliability, and overall infrastructure performance. Collaborate with cross-functional teams to implement engineering solutions and process optimizations. - Provide technical and operational leadership to the incident response center team, ensuring consistent performance and adherence to best practices. Requirements: Minimum Qualifications - Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related technical field. - Strong technical background with prioritized experience in Data Center Facility Operations Center (DC FOC) management. Experience in IT infrastructure, network operations, or systems monitoring is also desirable. - Proven ability to analyze complex systems, investigate incidents, and identify root causes effectively. - Familiarity with monitoring and alerting tools such as Grafana, Nagios, or similar platforms. - Experience in incident and problem management processes, with the ability to drive corrective actions and coordinate cross-functional teams. - Excellent troubleshooting skills and the ability to work in fast-paced environments during critical incidents. - Strong communication skills to draft reports, conduct reviews, and liaise with technical and non-technical stakeholders. Preferred Qualifications - 5 years of experience in IT environments—such as data centers or enterprise systems—combined with hands-on incident and problem management experience. - Proactive mindset with a focus on continuous improvement and operational excellence. - Proven ability to perform effectively and within tight time constraints to resolve issues and meet deliverables. - Hands-on experience with ticketing systems, monitoring tools such as Grafana, server infrastructure, and data center systems. - Working knowledge and/or certifications in one or more of the following: ITIL Foundation/ Comp. TIA Server / Schneider Electric Data Center Certified Associate (DCCA)/ Cisco Certified Network Associate (CCNA)/ Project Management Professional (PMP)/ Data Analytics and Visualization tools or methodologies - Demonstrated experience in driving or contributing to improvement projects focused on operational efficiency, security enhancements, or infrastructure reliability. - Ability to manage multiple tasks and projects, ensuring timely delivery and alignment with organizational goals. - This position is part of a team that provides 24/7 support and requires working scheduled shifts, which may include holidays.

apply
 
Loading Similar Jobs...
JOBZ is an independent Job Search Engine. JOBZ is not an agent or representative and is not endorsed, sponsored or affiliated with any employer. JOBZ uses proprietary technology to keep the availability and accuracy of its job listings and their details. All trademarks, service marks, logos, domain names, job descriptions and other company descriptions / details are the property of their respective holder. JOBZ does not have its users apply for a job on the J-O-B-Z.com website. Additionally, JOBZ may provide a list of third-party job listings that may not be affiliated with any employer. Please make sure you understand and agree to the website's Terms & Conditions and Privacy Policies you are applying on as they may differ from ours and are not in our control.