Incident Response Guide
Overview
This guide outlines our incident response procedures for data quality issues, service disruptions, and customer-reported problems.
Incident Classification
Severity Levels
Critical (P1)
- Definition: Complete service outage or data corruption affecting multiple customers
- Response Time: Immediate
- Resolution Target: 3 business days
- Examples:
- Prices off by order of magnitude (10x, 100x errors)
- API completely unavailable
- Authentication system failure
Major (P2)
- Definition: Significant functionality impaired or incorrect data for specific commodities
- Response Time: <30 minutes
- Resolution Target: <4 hours
- Examples:
- Single commodity feed showing incorrect values
- WebSocket connections dropping frequently
- Rate limiting not functioning correctly
Minor (P3)
- Definition: Non-critical issues with workarounds available
- Response Time: 3 business days
- Resolution Target: <24 hours
- Examples:
- Formatting inconsistencies
- Minor documentation errors
- Delayed data updates (within SLA)
Response Procedures
1. Detection & Triage
- Monitor automated alerts from PriceValidationService
- Review customer reports via [email protected]
- Check system health dashboards
- Classify incident severity
2. Initial Response
- Acknowledge incident to reporting customer
- Create incident ticket with:
- Timestamp of detection
- Affected systems/commodities
- Initial impact assessment
- Assigned responder
3. Investigation
- Identify root cause
- Document affected data ranges
- Determine correction approach
- Estimate resolution time
4. Resolution
- Create data backup before changes
- Apply fixes (automated or manual)
- Verify corrections across all systems
- Test related functionality
5. Communication
- Update affected customers
- Provide resolution details
- Share preventive measures
- Document lessons learned
Recent Incident Examples
UK Natural Gas Price Correction (August 2025)
Issue: Values inconsistent due to decimal parsing errors Detection: Customer report Resolution Time: 3 business days Actions Taken:
- Corrected 62,343 price records
- Implemented enhanced validation
- Added unit clarification to documentation Prevention: Real-time validation now active
Escalation Matrix
Severity | Primary Contact | Escalation (30 min) | Executive (1 hour) |
---|---|---|---|
Critical | On-call Engineer | Engineering Lead | CTO |
Major | Support Team | Senior Engineer | VP Engineering |
Minor | Support Team | Engineer on Duty | Team Lead |
Customer Communication Templates
Initial Response
Subject: [INCIDENT] Investigating reported issue with [COMMODITY/SERVICE]
We've received your report and are actively investigating the issue with [specific problem].
Incident ID: [XXX-YYYY]
Severity: [Critical/Major/Minor]
Estimated Resolution: [Time]
We'll update you within [30 minutes/1 hour] with our findings.
Resolution Notice
Subject: [RESOLVED] Issue with [COMMODITY/SERVICE] has been corrected
The issue affecting [specific problem] has been resolved.
Summary:
- Root Cause: [Brief explanation]
- Data Affected: [Time range and scope]
- Actions Taken: [Corrections applied]
- Prevention: [Measures implemented]
All systems are now operating normally. Please verify your data and contact us if you notice any remaining issues.
Data Correction Procedures
Pre-Correction Checklist
- [ ] Identify exact time range affected
- [ ] Document number of records to modify
- [ ] Create backup table with timestamp
- [ ] Notify stakeholders of planned correction
- [ ] Prepare rollback procedure
Correction Process
- Execute backup creation
- Run correction scripts with validation
- Verify spot prices
- Check derived averages (3-min, hourly, daily)
- Confirm WebSocket feed accuracy
- Update audit logs
Post-Correction Validation
- [ ] Compare before/after statistics
- [ ] Run consistency checks
- [ ] Verify customer-reported issues resolved
- [ ] Update documentation if needed
- [ ] Schedule follow-up review
Monitoring & Prevention
Automated Monitoring
- Price range validators per commodity
- Variance detection (>50% changes)
- Feed availability checks
- API response time monitoring
- Error rate tracking
Manual Reviews
- Daily: Spot check high-volume commodities
- Weekly: Review all validation alerts
- Monthly: Analyze incident trends
- Quarterly: Update validation rules
Tools & Resources
Internal Tools
- PriceValidationService: Real-time anomaly detection
- Data correction scripts:
/scripts/data_corrections/
- Monitoring dashboards: Internal admin panel
- Backup procedures: Automated via database snapshots
External Communication
- Status Page: status.oilpriceapi.com
- Email: [email protected]
- Webhook notifications for corrections
- API response headers with incident IDs
Best Practices
- Always Backup First: Never modify production data without backup
- Test in Staging: Validate corrections in test environment
- Communicate Early: Notify customers as soon as issue confirmed
- Document Everything: Maintain detailed incident logs
- Review & Improve: Post-incident reviews within 48 hours
Contact Information
Support Team
- Email: [email protected]
- Response Hours: 24/7 for Critical, Business hours for others
- SLA: Based on customer tier and incident severity
Engineering Escalation
- On-Call: Via PagerDuty
- Slack: #incidents channel
- War Room: Zoom link in incident ticket