stock-bot/apps/stock/data-ingestion/docs/OPERATION_TRACKER_ENHANCEMENTS.md
2025-07-01 15:35:56 -04:00

3.9 KiB

Operation Tracker Enhancements for Intraday Crawling

Summary of Changes

This document summarizes the enhancements made to the operation tracker to support sophisticated intraday data crawling with resumption capabilities.

Changes Made

1. Enhanced CrawlState Interface (types.ts)

Added new fields to track crawl progress:

  • newestDateReached: Track the most recent date processed
  • lastProcessedDate: For resumption after interruption
  • totalDaysProcessed: Progress tracking
  • targetOldestDate: The goal date to reach

2. Updated OperationTracker (OperationTracker.ts)

  • Modified updateSymbolOperation to handle new crawl state fields
  • Updated bulkUpdateSymbolOperations for proper Date handling
  • Enhanced markCrawlFinished to track both oldest and newest dates
  • Added getSymbolsForIntradayCrawl: Specialized method for intraday crawls with gap detection
  • Added isIntradayCrawlComplete: Check if a crawl has reached its target
  • Added new indexes for efficient querying on crawl state fields

3. New Intraday Crawl Action (intraday-crawl.action.ts)

Created a sophisticated crawl system with:

  • Bidirectional crawling: Handles both forward (new data) and backward (historical) gaps
  • Batch processing: Processes data in weekly batches by default
  • Resumption logic: Can resume from where it left off if interrupted
  • Gap detection: Automatically identifies missing date ranges
  • Completion tracking: Knows when the full history has been fetched

4. Integration with QM Handler

  • Added new operations: crawl-intraday-data and schedule-intraday-crawls
  • Added scheduled operation to automatically process incomplete crawls every 4 hours
  • Integrated with the existing operation registry system

5. Testing and Monitoring Tools

  • Created test script to verify crawl functionality
  • Created status checking script to monitor crawl progress
  • Added comprehensive documentation

How It Works

Initial Crawl Flow

  1. Symbol starts with no crawl state
  2. First crawl fetches today's data and sets newestDateReached
  3. Subsequent batches crawl backward in time
  4. Each batch updates oldestDateReached and lastProcessedDate
  5. When oldestDateReached <= targetOldestDate, crawl is marked finished

Resumption Flow

  1. Check if newestDateReached < yesterday (forward gap)
  2. If yes, fetch new data first to stay current
  3. Check if finished = false (backward gap)
  4. If yes, continue backward crawl from lastProcessedDate
  5. Process in batches until complete

Daily Update Flow

  1. For finished crawls, only check for forward gaps
  2. Fetch data from newestDateReached + 1 to today
  3. Update newestDateReached to maintain currency

Benefits

  1. Resilient: Can handle interruptions gracefully
  2. Efficient: Avoids re-fetching data
  3. Trackable: Clear progress visibility
  4. Scalable: Can handle thousands of symbols
  5. Flexible: Configurable batch sizes and target dates

Usage Examples

Check if symbol X needs crawling:

const tracker = operationRegistry.getTracker('qm');
const symbols = await tracker.getSymbolsForIntradayCrawl('intraday_bars', {
  limit: 1,
  targetOldestDate: new Date('2020-01-01')
});
const symbolX = symbols.find(s => s.symbol === 'X');

Start crawl for symbol X only:

await handler.crawlIntradayData({
  symbol: 'X',
  symbolId: symbolData.symbolId,
  qmSearchCode: symbolData.qmSearchCode,
  targetOldestDate: '2020-01-01'
});

Schedule crawls for never-run symbols:

await handler.scheduleIntradayCrawls({
  limit: 50,
  priorityMode: 'never_run',
  targetOldestDate: '2020-01-01'
});

Next Steps

  1. Monitor the crawl progress using the provided scripts
  2. Adjust batch sizes based on API rate limits
  3. Consider adding more sophisticated retry logic for failed batches
  4. Implement data validation to ensure quality