stock-bot/apps/stock/data-ingestion/docs/OPERATION_TRACKER_ENHANCEMENTS.md
2025-07-01 15:35:56 -04:00

107 lines
No EOL
3.9 KiB
Markdown

# Operation Tracker Enhancements for Intraday Crawling
## Summary of Changes
This document summarizes the enhancements made to the operation tracker to support sophisticated intraday data crawling with resumption capabilities.
## Changes Made
### 1. Enhanced CrawlState Interface (`types.ts`)
Added new fields to track crawl progress:
- `newestDateReached`: Track the most recent date processed
- `lastProcessedDate`: For resumption after interruption
- `totalDaysProcessed`: Progress tracking
- `targetOldestDate`: The goal date to reach
### 2. Updated OperationTracker (`OperationTracker.ts`)
- Modified `updateSymbolOperation` to handle new crawl state fields
- Updated `bulkUpdateSymbolOperations` for proper Date handling
- Enhanced `markCrawlFinished` to track both oldest and newest dates
- Added `getSymbolsForIntradayCrawl`: Specialized method for intraday crawls with gap detection
- Added `isIntradayCrawlComplete`: Check if a crawl has reached its target
- Added new indexes for efficient querying on crawl state fields
### 3. New Intraday Crawl Action (`intraday-crawl.action.ts`)
Created a sophisticated crawl system with:
- **Bidirectional crawling**: Handles both forward (new data) and backward (historical) gaps
- **Batch processing**: Processes data in weekly batches by default
- **Resumption logic**: Can resume from where it left off if interrupted
- **Gap detection**: Automatically identifies missing date ranges
- **Completion tracking**: Knows when the full history has been fetched
### 4. Integration with QM Handler
- Added new operations: `crawl-intraday-data` and `schedule-intraday-crawls`
- Added scheduled operation to automatically process incomplete crawls every 4 hours
- Integrated with the existing operation registry system
### 5. Testing and Monitoring Tools
- Created test script to verify crawl functionality
- Created status checking script to monitor crawl progress
- Added comprehensive documentation
## How It Works
### Initial Crawl Flow
1. Symbol starts with no crawl state
2. First crawl fetches today's data and sets `newestDateReached`
3. Subsequent batches crawl backward in time
4. Each batch updates `oldestDateReached` and `lastProcessedDate`
5. When `oldestDateReached <= targetOldestDate`, crawl is marked finished
### Resumption Flow
1. Check if `newestDateReached < yesterday` (forward gap)
2. If yes, fetch new data first to stay current
3. Check if `finished = false` (backward gap)
4. If yes, continue backward crawl from `lastProcessedDate`
5. Process in batches until complete
### Daily Update Flow
1. For finished crawls, only check for forward gaps
2. Fetch data from `newestDateReached + 1` to today
3. Update `newestDateReached` to maintain currency
## Benefits
1. **Resilient**: Can handle interruptions gracefully
2. **Efficient**: Avoids re-fetching data
3. **Trackable**: Clear progress visibility
4. **Scalable**: Can handle thousands of symbols
5. **Flexible**: Configurable batch sizes and target dates
## Usage Examples
### Check if symbol X needs crawling:
```typescript
const tracker = operationRegistry.getTracker('qm');
const symbols = await tracker.getSymbolsForIntradayCrawl('intraday_bars', {
limit: 1,
targetOldestDate: new Date('2020-01-01')
});
const symbolX = symbols.find(s => s.symbol === 'X');
```
### Start crawl for symbol X only:
```typescript
await handler.crawlIntradayData({
symbol: 'X',
symbolId: symbolData.symbolId,
qmSearchCode: symbolData.qmSearchCode,
targetOldestDate: '2020-01-01'
});
```
### Schedule crawls for never-run symbols:
```typescript
await handler.scheduleIntradayCrawls({
limit: 50,
priorityMode: 'never_run',
targetOldestDate: '2020-01-01'
});
```
## Next Steps
1. Monitor the crawl progress using the provided scripts
2. Adjust batch sizes based on API rate limits
3. Consider adding more sophisticated retry logic for failed batches
4. Implement data validation to ensure quality