2025-07-01 15:35:56 -04:00

3.9 KiB

Raw Blame History

Operation Tracker Enhancements for Intraday Crawling

Summary of Changes

This document summarizes the enhancements made to the operation tracker to support sophisticated intraday data crawling with resumption capabilities.

Changes Made

1. Enhanced CrawlState Interface (`types.ts`)

Added new fields to track crawl progress:

newestDateReached: Track the most recent date processed
lastProcessedDate: For resumption after interruption
totalDaysProcessed: Progress tracking
targetOldestDate: The goal date to reach

2. Updated OperationTracker (`OperationTracker.ts`)

Modified updateSymbolOperation to handle new crawl state fields
Updated bulkUpdateSymbolOperations for proper Date handling
Enhanced markCrawlFinished to track both oldest and newest dates
Added getSymbolsForIntradayCrawl: Specialized method for intraday crawls with gap detection
Added isIntradayCrawlComplete: Check if a crawl has reached its target
Added new indexes for efficient querying on crawl state fields

3. New Intraday Crawl Action (`intraday-crawl.action.ts`)

Created a sophisticated crawl system with:

Bidirectional crawling: Handles both forward (new data) and backward (historical) gaps
Batch processing: Processes data in weekly batches by default
Resumption logic: Can resume from where it left off if interrupted
Gap detection: Automatically identifies missing date ranges
Completion tracking: Knows when the full history has been fetched

4. Integration with QM Handler

Added new operations: crawl-intraday-data and schedule-intraday-crawls
Added scheduled operation to automatically process incomplete crawls every 4 hours
Integrated with the existing operation registry system

5. Testing and Monitoring Tools

Created test script to verify crawl functionality
Created status checking script to monitor crawl progress
Added comprehensive documentation

How It Works

Initial Crawl Flow

Symbol starts with no crawl state
First crawl fetches today's data and sets newestDateReached
Subsequent batches crawl backward in time
Each batch updates oldestDateReached and lastProcessedDate
When oldestDateReached <= targetOldestDate, crawl is marked finished

Resumption Flow

Check if newestDateReached < yesterday (forward gap)
If yes, fetch new data first to stay current
Check if finished = false (backward gap)
If yes, continue backward crawl from lastProcessedDate
Process in batches until complete

Daily Update Flow

For finished crawls, only check for forward gaps
Fetch data from newestDateReached + 1 to today
Update newestDateReached to maintain currency

Benefits

Resilient: Can handle interruptions gracefully
Efficient: Avoids re-fetching data
Trackable: Clear progress visibility
Scalable: Can handle thousands of symbols
Flexible: Configurable batch sizes and target dates

Usage Examples

Check if symbol X needs crawling:

const tracker = operationRegistry.getTracker('qm');
const symbols = await tracker.getSymbolsForIntradayCrawl('intraday_bars', {
  limit: 1,
  targetOldestDate: new Date('2020-01-01')
});
const symbolX = symbols.find(s => s.symbol === 'X');

Start crawl for symbol X only:

await handler.crawlIntradayData({
  symbol: 'X',
  symbolId: symbolData.symbolId,
  qmSearchCode: symbolData.qmSearchCode,
  targetOldestDate: '2020-01-01'
});

Schedule crawls for never-run symbols:

await handler.scheduleIntradayCrawls({
  limit: 50,
  priorityMode: 'never_run',
  targetOldestDate: '2020-01-01'
});

Next Steps

Monitor the crawl progress using the provided scripts
Adjust batch sizes based on API rate limits
Consider adding more sophisticated retry logic for failed batches
Implement data validation to ensure quality

3.9 KiB Raw Blame History