Sr. Engineer II, EPICS, NG-SIEM (Hybrid)

Austin

Friday, 29 May 2026

Our mission is to make all of our customers' security-relevant data continuously available for automated detection and response, threat hunting, and other Falcon platform use cases. To enable this, the systems behind NG-SIEM (next-generation security information and event management) are growing to accommodate The NG-SIEM platform comprises many decoupled components interacting across complex pipelines. As we scale, ensuring end-to-end health across ingest, search, and workflow execution requires deep cross-service expertise and coordinated action. You will be the engineer who builds the observability, automation, and scaling systems that keep the entire platform performing — not just individual components. You will join a distributed team of high-ownership technical leaders who share a strong passion for our mission: to stop breaches. This is a hybrid opportunity, with the expectation to be in our Austin, TX office 2-3 x a week. What You'll Do:End-to-end observability: Design, build, and maintain monitoring and synthetic test suites that provide deep visibility into the health of the entire NG-SIEM pipeline — from ingest through search and workflow execution — enabling rapid root cause analysis across component boundaries. Coordinated scaling: Engineer orchestrated scaling solutions that treat the NG-SIEM pipeline as a unified system, proportionally increasing resources across all dependent components (Kafka, ingest pipelines, downstream services) to eliminate cascading bottleneck patterns. Incident response engineering: Serve as a subject matter expert during platform-wide incidents (P 2 and above), applying cross-service knowledge to diagnose and resolve multi-component failures. Partake in follow-the-sun on-call rotations, providing incident commander coordination for critical platform-wide events. Capacity planning and cost management: Build and refine models for end-to-end capacity forecasting that account for all pipeline dimensions, including partner team dependencies (data services, GPS). Develop tooling to continuously track and surface cost drivers across the platform. Automation and runbooks: Transform manual standard operating procedures into automated remediation workflows — including pipeline-wide scaling responses, CID rebalancing, and infrastructure healing — with the goal of resolving issues before customers are impacted. Cross-team collaboration: Partner with cell-level teams, product engineering, GDI/3 PI, and external stakeholders (e.g., CSM) to triage SLO breaches, drive problem management for large reliability efforts, and ensure consistent communication during incidents. Platform improvements: Use your broad NG-SIEM knowledge to identify and drive systemic improvements across teams, contributing to the platform's long-term resilience and efficiency. What You'll Need:A passion for reliability engineering and curiosity about how large-scale running systems behave under pressure;10 years of experience in software engineering, site reliability engineering, or platform engineering, with significant time spent on large-scale distributed systems, and the ability to make pragmatic tradeoffs between short-term delivery needs and long-term platform goals;Strong proficiency in at least one systems programming language (Go, Java, Rust, or C ) and one scripting language (Python, Bash);Deep experience with end-to-end observability — building monitoring pipelines, defining SL - Is/ SL - Os, and creating dashboards that drive actionable insights across multi-service architectures;Demonstrated ability to diagnose and resolve complex incidents spanning multiple distributed components operating 24/7;Experience with coordinated capacity planning and scaling for systems with significant infrastructure footprints;Hands-on experience with streaming platforms (Kafka or similar) and understanding of back pressure, partition management, and consumer group dynamics at scale;Familiarity with infrastructure-as-code, CI/ CD pipelines, and automated deployment practices;A can-do attitude — you thrive collaborating in a team and are not afraid of taking on responsibilities;Strong written and verbal communication skills — you will lead incident communications and produce post-incident analyses that drive lasting improvements;Comfort working across time zones with globally distributed teams. Bonus Points:Experience in a similar reliability or platform engineering role at a hyperscaler (AWS, Azure, GCP) or large-scale Saas provider;Track record of building automated remediation and self-healing infrastructure;Experience with cost modeling and unit economics for large compute and storage footprints;Familiarity with cloud-native architectures and serverless computing paradigms;Hands-on experience operating platforms processing over 1 trillion events per day or more than 10 PB of data per day;Exposure to or experience with Log Management, cybersecurity products, or security operations workflows;Experience with disaster recovery planning and execution for multi-region systems.#LI-SS 1#HTF - This role will require the candidate to periodically undergo and pass additional background and fingerprint check(s) consistent with government customer requirements.

apply
 
Loading Similar Jobs...
JOBZ is an independent Job Search Engine. JOBZ is not an agent or representative and is not endorsed, sponsored or affiliated with any employer. JOBZ uses proprietary technology to keep the availability and accuracy of its job listings and their details. All trademarks, service marks, logos, domain names, job descriptions and other company descriptions / details are the property of their respective holder. JOBZ does not have its users apply for a job on the J-O-B-Z.com website. Additionally, JOBZ may provide a list of third-party job listings that may not be affiliated with any employer. Please make sure you understand and agree to the website's Terms & Conditions and Privacy Policies you are applying on as they may differ from ours and are not in our control.