OilPriceAPI Documentation
GitHub
GitHub
  • Guides

    • Authentication
    • Testing & Development
    • Error Codes Reference
    • Webhook Signature Verification
    • Production Deployment Checklist
    • Service Level Agreement (SLA)
    • Rate Limiting & Response Headers
    • Data Quality and Validation
    • Troubleshooting Guide
    • Incident Response Guide

Incident Response Guide

Overview

This guide outlines our incident response procedures for data quality issues, service disruptions, and customer-reported problems.

Incident Classification

Severity Levels

Critical (P1)

  • Definition: Complete service outage or data corruption affecting multiple customers
  • Response Time: Immediate
  • Resolution Target: 3 business days
  • Examples:
    • Prices off by order of magnitude (10x, 100x errors)
    • API completely unavailable
    • Authentication system failure

Major (P2)

  • Definition: Significant functionality impaired or incorrect data for specific commodities
  • Response Time: <30 minutes
  • Resolution Target: <4 hours
  • Examples:
    • Single commodity feed showing incorrect values
    • WebSocket connections dropping frequently
    • Rate limiting not functioning correctly

Minor (P3)

  • Definition: Non-critical issues with workarounds available
  • Response Time: 3 business days
  • Resolution Target: <24 hours
  • Examples:
    • Formatting inconsistencies
    • Minor documentation errors
    • Delayed data updates (within SLA)

Response Procedures

1. Detection & Triage

  • Monitor automated alerts from PriceValidationService
  • Review customer reports via [email protected]
  • Check system health dashboards
  • Classify incident severity

2. Initial Response

  • Acknowledge incident to reporting customer
  • Create incident ticket with:
    • Timestamp of detection
    • Affected systems/commodities
    • Initial impact assessment
    • Assigned responder

3. Investigation

  • Identify root cause
  • Document affected data ranges
  • Determine correction approach
  • Estimate resolution time

4. Resolution

  • Create data backup before changes
  • Apply fixes (automated or manual)
  • Verify corrections across all systems
  • Test related functionality

5. Communication

  • Update affected customers
  • Provide resolution details
  • Share preventive measures
  • Document lessons learned

Recent Incident Examples

UK Natural Gas Price Correction (August 2025)

Issue: Values inconsistent due to decimal parsing errors Detection: Customer report Resolution Time: 3 business days Actions Taken:

  • Corrected 62,343 price records
  • Implemented enhanced validation
  • Added unit clarification to documentation Prevention: Real-time validation now active

Escalation Matrix

SeverityPrimary ContactEscalation (30 min)Executive (1 hour)
CriticalOn-call EngineerEngineering LeadCTO
MajorSupport TeamSenior EngineerVP Engineering
MinorSupport TeamEngineer on DutyTeam Lead

Customer Communication Templates

Initial Response

Subject: [INCIDENT] Investigating reported issue with [COMMODITY/SERVICE]

We've received your report and are actively investigating the issue with [specific problem].

Incident ID: [XXX-YYYY]
Severity: [Critical/Major/Minor]
Estimated Resolution: [Time]

We'll update you within [30 minutes/1 hour] with our findings.

Resolution Notice

Subject: [RESOLVED] Issue with [COMMODITY/SERVICE] has been corrected

The issue affecting [specific problem] has been resolved.

Summary:
- Root Cause: [Brief explanation]
- Data Affected: [Time range and scope]
- Actions Taken: [Corrections applied]
- Prevention: [Measures implemented]

All systems are now operating normally. Please verify your data and contact us if you notice any remaining issues.

Data Correction Procedures

Pre-Correction Checklist

  • [ ] Identify exact time range affected
  • [ ] Document number of records to modify
  • [ ] Create backup table with timestamp
  • [ ] Notify stakeholders of planned correction
  • [ ] Prepare rollback procedure

Correction Process

  1. Execute backup creation
  2. Run correction scripts with validation
  3. Verify spot prices
  4. Check derived averages (3-min, hourly, daily)
  5. Confirm WebSocket feed accuracy
  6. Update audit logs

Post-Correction Validation

  • [ ] Compare before/after statistics
  • [ ] Run consistency checks
  • [ ] Verify customer-reported issues resolved
  • [ ] Update documentation if needed
  • [ ] Schedule follow-up review

Monitoring & Prevention

Automated Monitoring

  • Price range validators per commodity
  • Variance detection (>50% changes)
  • Feed availability checks
  • API response time monitoring
  • Error rate tracking

Manual Reviews

  • Daily: Spot check high-volume commodities
  • Weekly: Review all validation alerts
  • Monthly: Analyze incident trends
  • Quarterly: Update validation rules

Tools & Resources

Internal Tools

  • PriceValidationService: Real-time anomaly detection
  • Data correction scripts: /scripts/data_corrections/
  • Monitoring dashboards: Internal admin panel
  • Backup procedures: Automated via database snapshots

External Communication

  • Status Page: status.oilpriceapi.com
  • Email: [email protected]
  • Webhook notifications for corrections
  • API response headers with incident IDs

Best Practices

  1. Always Backup First: Never modify production data without backup
  2. Test in Staging: Validate corrections in test environment
  3. Communicate Early: Notify customers as soon as issue confirmed
  4. Document Everything: Maintain detailed incident logs
  5. Review & Improve: Post-incident reviews within 48 hours

Contact Information

Support Team

  • Email: [email protected]
  • Response Hours: 24/7 for Critical, Business hours for others
  • SLA: Based on customer tier and incident severity

Engineering Escalation

  • On-Call: Via PagerDuty
  • Slack: #incidents channel
  • War Room: Zoom link in incident ticket
Prev
Troubleshooting Guide