Harness Incident Response (IR) Module

Overview

Harness Incident Response (IR) is a comprehensive incident management system that enables teams to detect, respond to, and resolve incidents efficiently. It integrates with various monitoring, alerting, and collaboration tools to provide a seamless incident resolution workflow.

Core Objects

AI Incident Response Agent

AI-driven agent for detecting, analyzing, and responding to incidents.
Supports voice and chat-based analysis to discover key events, capture them, and then provide summarization.

Alerts

Alerts originate from webhooks and are referred to as Integrations.
They serve as the primary trigger for incident detection and response.
Alerts can be deduplicated, normalized, and correlated to prevent redundant alert creation.
Alerts can be linked to On-Call Escalations and Alert Rules to determine response criteria.
Alerts can also trigger Runbooks without escalating into an incident.

Incidents

Incidents can be created in three ways:
1. From Alerts – when an alert meets predefined criteria.
2. Manually from the Web UI – allowing direct user intervention.
3. Via Slack Command (/ir new) – enabling quick incident creation from Slack.
Incidents link to Services, Runbooks, Fire Drills, and Change Events.

Actions

Actions are automation tasks provided by Harness.
Some require Delegates (e.g., ServiceNow), while others are built-in (e.g., GitHub actions).
Examples of Actions:
- Post to Slack Channel – Configurable to specify message content and channel.
- Create Incident Slack Channel – Automatically generates a Slack workspace for an incident.
- Create Microsoft Teams Meeting – Initiates an incident resolution meeting.

Runbooks

A Runbook is an automated playbook consisting of one or more Actions.
Used for structured incident response workflows.
Example: Major Incident Response Runbook
1. Create an Incident-specific Slack Channel.
2. Post an Incident Notification.
3. Create a Zoom Bridge.
4. Post bridge details in Slack.
5. Use Services Impacted to invite on-call resources.
6. Page the Service Team via PagerDuty.
Runbooks can execute process actions or API interactive actions.
Can trigger Harness Pipelines for rollback scenarios.

Delegates

Delegates facilitate secure execution of Actions that interact with external systems.
Required for most Actions to run successfully.
Follows the Harness Delegate model to ensure scalability and security.

Fire Drills

Fire Drills simulate real incidents to test team preparedness.
Initiated manually or via chaos experiments.
Used for training and proactive reliability testing.
Can target application maps or specific services.

Application Maps

Represents a group of interacting services.
Enables users to manage and monitor services as a single entity.
Supports testing, monitoring, deployment, and response workflows.

Change Events (Coming Soon)

Captures system modifications that could impact reliability.
Examples of Change Events:
- Code Changes (Git commits, pull requests, merges).
- Deployments (CI/CD executions, feature flag activations).
- Infrastructure Changes (Kubernetes updates, scaling events).
- Service Modifications (API changes, new dependencies).
- Third-Party Changes (Datadog alerts, ServiceNow updates).

On-Call (Coming Soon)

Ensures availability of personnel for incident response.
Includes:
- Schedules – Define rotations.
- Policies – Establish escalation rules.
- Notifications – Alert the right responders.

Relationships Between Objects

Alerts → Incidents – Alerts can escalate into incidents.
Changes → Incidents – Change Events can be root causes of incidents.
Fire Drills → Incidents – Fire drills simulate or trigger incidents.
Runbooks → Incidents – Runbooks provide structured response actions.
Services link to:
- Incidents (for impact assessment).
- Alerts (for ownership resolution).
- Fire Drills (for reference).
- Change Events (for tracking modifications).

Dashboards & Reporting

The IR Overview Dashboard provides key incident metrics:

Active Incidents – Ongoing incidents count.
- Subtitle: Mean Time to Resolve (MTTR) with trends.
Recent Alerts – Count of triggered alerts.
SLO Breaches – Number of breached SLOs.

System Uptime – Percentage uptime of monitored services.
Mean Time Between Failures (MTBF) – Measures system stability.

Integration Points

Harness IR integrates with various monitoring, alerting, and collaboration tools:

Webhook-Based Integrations:

Monitoring & Alerting Systems:
- Datadog, New Relic, Splunk, Cloudwatch, Dynatrace, Stackdriver, Grafana, OpsGenie.
CI/CD & Development Tools:
- GitHub, GitLab, Jenkins, Bitbucket, Octopus Deploy, Harness SLO.
ITSM & Incident Management:
- ServiceNow, Jira, PagerDuty, VictorOps (Splunk On-Call), BigPanda.
Manual & Custom Alert Sources:
- Custom Webhooks, Manual Alert Entries.

API-Based Integrations:

Communication & Collaboration:
- Slack, Microsoft Teams, Zoom.
Incident Response Automation:
- PagerDuty, OpsGenie, Harness Pipelines.
Feature Flagging & Deployment Control:
- Split, GitHub Actions, Jenkins, Harness Pipelines.
Observability & Monitoring Enhancements:
- Datadog, Grafana Incident.

Harness IR enables seamless, automated incident response through deep integrations, advanced AI capabilities, and structured workflows, ensuring rapid issue resolution and system reliability.

Overview​

Core Objects​

AI Incident Response Agent​

Alerts​

Incidents​

Actions​

Runbooks​

Delegates​

Fire Drills​

Application Maps​

Change Events (Coming Soon)​

On-Call (Coming Soon)​

Relationships Between Objects​

Dashboards & Reporting​

Integration Points​

Webhook-Based Integrations:​

API-Based Integrations:​