3.9 KiB
3.9 KiB
Operation Tracker Enhancements for Intraday Crawling
Summary of Changes
This document summarizes the enhancements made to the operation tracker to support sophisticated intraday data crawling with resumption capabilities.
Changes Made
1. Enhanced CrawlState Interface (types.ts)
Added new fields to track crawl progress:
newestDateReached: Track the most recent date processedlastProcessedDate: For resumption after interruptiontotalDaysProcessed: Progress trackingtargetOldestDate: The goal date to reach
2. Updated OperationTracker (OperationTracker.ts)
- Modified
updateSymbolOperationto handle new crawl state fields - Updated
bulkUpdateSymbolOperationsfor proper Date handling - Enhanced
markCrawlFinishedto track both oldest and newest dates - Added
getSymbolsForIntradayCrawl: Specialized method for intraday crawls with gap detection - Added
isIntradayCrawlComplete: Check if a crawl has reached its target - Added new indexes for efficient querying on crawl state fields
3. New Intraday Crawl Action (intraday-crawl.action.ts)
Created a sophisticated crawl system with:
- Bidirectional crawling: Handles both forward (new data) and backward (historical) gaps
- Batch processing: Processes data in weekly batches by default
- Resumption logic: Can resume from where it left off if interrupted
- Gap detection: Automatically identifies missing date ranges
- Completion tracking: Knows when the full history has been fetched
4. Integration with QM Handler
- Added new operations:
crawl-intraday-dataandschedule-intraday-crawls - Added scheduled operation to automatically process incomplete crawls every 4 hours
- Integrated with the existing operation registry system
5. Testing and Monitoring Tools
- Created test script to verify crawl functionality
- Created status checking script to monitor crawl progress
- Added comprehensive documentation
How It Works
Initial Crawl Flow
- Symbol starts with no crawl state
- First crawl fetches today's data and sets
newestDateReached - Subsequent batches crawl backward in time
- Each batch updates
oldestDateReachedandlastProcessedDate - When
oldestDateReached <= targetOldestDate, crawl is marked finished
Resumption Flow
- Check if
newestDateReached < yesterday(forward gap) - If yes, fetch new data first to stay current
- Check if
finished = false(backward gap) - If yes, continue backward crawl from
lastProcessedDate - Process in batches until complete
Daily Update Flow
- For finished crawls, only check for forward gaps
- Fetch data from
newestDateReached + 1to today - Update
newestDateReachedto maintain currency
Benefits
- Resilient: Can handle interruptions gracefully
- Efficient: Avoids re-fetching data
- Trackable: Clear progress visibility
- Scalable: Can handle thousands of symbols
- Flexible: Configurable batch sizes and target dates
Usage Examples
Check if symbol X needs crawling:
const tracker = operationRegistry.getTracker('qm');
const symbols = await tracker.getSymbolsForIntradayCrawl('intraday_bars', {
limit: 1,
targetOldestDate: new Date('2020-01-01')
});
const symbolX = symbols.find(s => s.symbol === 'X');
Start crawl for symbol X only:
await handler.crawlIntradayData({
symbol: 'X',
symbolId: symbolData.symbolId,
qmSearchCode: symbolData.qmSearchCode,
targetOldestDate: '2020-01-01'
});
Schedule crawls for never-run symbols:
await handler.scheduleIntradayCrawls({
limit: 50,
priorityMode: 'never_run',
targetOldestDate: '2020-01-01'
});
Next Steps
- Monitor the crawl progress using the provided scripts
- Adjust batch sizes based on API rate limits
- Consider adding more sophisticated retry logic for failed batches
- Implement data validation to ensure quality