Harness Incident Response (IR) Module
Overview
Harness Incident Response (IR) is a comprehensive incident management system that enables teams to detect, respond to, and resolve incidents efficiently. It integrates with various monitoring, alerting, and collaboration tools to provide a seamless incident resolution workflow.
Core Objects
AI Incident Response Agent
- AI-driven agent for detecting, analyzing, and responding to incidents.
- Supports voice and chat-based analysis to discover key events, capture them, and then provide summarization.
Alerts
- Alerts originate from webhooks and are referred to as Integrations.
- They serve as the primary trigger for incident detection and response.
- Alerts can be deduplicated, normalized, and correlated to prevent redundant alert creation.
- Alerts can be linked to On-Call Escalations and Alert Rules to determine response criteria.
- Alerts can also trigger Runbooks without escalating into an incident.
Incidents
- Incidents can be created in three ways:
- From Alerts – when an alert meets predefined criteria.
- Manually from the Web UI – allowing direct user intervention.
- Via Slack Command (
/ir new
) – enabling quick incident creation from Slack.
- Incidents link to Services, Runbooks, Fire Drills, and Change Events.
Actions
- Actions are automation tasks provided by Harness.
- Some require Delegates (e.g., ServiceNow), while others are built-in (e.g., GitHub actions).
- Examples of Actions:
- Post to Slack Channel – Configurable to specify message content and channel.
- Create Incident Slack Channel – Automatically generates a Slack workspace for an incident.
- Create Microsoft Teams Meeting – Initiates an incident resolution meeting.
Runbooks
- A Runbook is an automated playbook consisting of one or more Actions.
- Used for structured incident response workflows.
- Example: Major Incident Response Runbook
- Create an Incident-specific Slack Channel.
- Post an Incident Notification.
- Create a Zoom Bridge.
- Post bridge details in Slack.
- Use Services Impacted to invite on-call resources.
- Page the Service Team via PagerDuty.
- Runbooks can execute process actions or API interactive actions.
- Can trigger Harness Pipelines for rollback scenarios.
Delegates
- Delegates facilitate secure execution of Actions that interact with external systems.
- Required for most Actions to run successfully.
- Follows the Harness Delegate model to ensure scalability and security.
Fire Drills
- Fire Drills simulate real incidents to test team preparedness.
- Initiated manually or via chaos experiments.
- Used for training and proactive reliability testing.
- Can target application maps or specific services.
Application Maps
- Represents a group of interacting services.
- Enables users to manage and monitor services as a single entity.
- Supports testing, monitoring, deployment, and response workflows.
Change Events (Coming Soon)
- Captures system modifications that could impact reliability.
- Examples of Change Events:
- Code Changes (Git commits, pull requests, merges).
- Deployments (CI/CD executions, feature flag activations).
- Infrastructure Changes (Kubernetes updates, scaling events).
- Service Modifications (API changes, new dependencies).
- Third-Party Changes (Datadog alerts, ServiceNow updates).
On-Call (Coming Soon)
- Ensures availability of personnel for incident response.
- Includes:
- Schedules – Define rotations.
- Policies – Establish escalation rules.
- Notifications – Alert the right responders.
Relationships Between Objects
- Alerts → Incidents – Alerts can escalate into incidents.
- Changes → Incidents – Change Events can be root causes of incidents.
- Fire Drills → Incidents – Fire drills simulate or trigger incidents.
- Runbooks → Incidents – Runbooks provide structured response actions.
- Services link to:
- Incidents (for impact assessment).
- Alerts (for ownership resolution).
- Fire Drills (for reference).
- Change Events (for tracking modifications).
Dashboards & Reporting
The IR Overview Dashboard provides key incident metrics:
- Active Incidents – Ongoing incidents count.
- Subtitle: Mean Time to Resolve (MTTR) with trends.
- Recent Alerts – Count of triggered alerts.
- SLO Breaches – Number of breached SLOs.
- System Uptime – Percentage uptime of monitored services.
- Mean Time Between Failures (MTBF) – Measures system stability.
Integration Points
Harness IR integrates with various monitoring, alerting, and collaboration tools:
Webhook-Based Integrations:
- Monitoring & Alerting Systems:
- Datadog, New Relic, Splunk, Cloudwatch, Dynatrace, Stackdriver, Grafana, OpsGenie.
- CI/CD & Development Tools:
- GitHub, GitLab, Jenkins, Bitbucket, Octopus Deploy, Harness SLO.
- ITSM & Incident Management:
- ServiceNow, Jira, PagerDuty, VictorOps (Splunk On-Call), BigPanda.
- Manual & Custom Alert Sources:
- Custom Webhooks, Manual Alert Entries.
API-Based Integrations:
- Communication & Collaboration:
- Slack, Microsoft Teams, Zoom.
- Incident Response Automation:
- PagerDuty, OpsGenie, Harness Pipelines.
- Feature Flagging & Deployment Control:
- Split, GitHub Actions, Jenkins, Harness Pipelines.
- Observability & Monitoring Enhancements:
- Datadog, Grafana Incident.
Harness IR enables seamless, automated incident response through deep integrations, advanced AI capabilities, and structured workflows, ensuring rapid issue resolution and system reliability.