107 lines
No EOL
3.9 KiB
Markdown
107 lines
No EOL
3.9 KiB
Markdown
# Operation Tracker Enhancements for Intraday Crawling
|
|
|
|
## Summary of Changes
|
|
|
|
This document summarizes the enhancements made to the operation tracker to support sophisticated intraday data crawling with resumption capabilities.
|
|
|
|
## Changes Made
|
|
|
|
### 1. Enhanced CrawlState Interface (`types.ts`)
|
|
Added new fields to track crawl progress:
|
|
- `newestDateReached`: Track the most recent date processed
|
|
- `lastProcessedDate`: For resumption after interruption
|
|
- `totalDaysProcessed`: Progress tracking
|
|
- `targetOldestDate`: The goal date to reach
|
|
|
|
### 2. Updated OperationTracker (`OperationTracker.ts`)
|
|
- Modified `updateSymbolOperation` to handle new crawl state fields
|
|
- Updated `bulkUpdateSymbolOperations` for proper Date handling
|
|
- Enhanced `markCrawlFinished` to track both oldest and newest dates
|
|
- Added `getSymbolsForIntradayCrawl`: Specialized method for intraday crawls with gap detection
|
|
- Added `isIntradayCrawlComplete`: Check if a crawl has reached its target
|
|
- Added new indexes for efficient querying on crawl state fields
|
|
|
|
### 3. New Intraday Crawl Action (`intraday-crawl.action.ts`)
|
|
Created a sophisticated crawl system with:
|
|
- **Bidirectional crawling**: Handles both forward (new data) and backward (historical) gaps
|
|
- **Batch processing**: Processes data in weekly batches by default
|
|
- **Resumption logic**: Can resume from where it left off if interrupted
|
|
- **Gap detection**: Automatically identifies missing date ranges
|
|
- **Completion tracking**: Knows when the full history has been fetched
|
|
|
|
### 4. Integration with QM Handler
|
|
- Added new operations: `crawl-intraday-data` and `schedule-intraday-crawls`
|
|
- Added scheduled operation to automatically process incomplete crawls every 4 hours
|
|
- Integrated with the existing operation registry system
|
|
|
|
### 5. Testing and Monitoring Tools
|
|
- Created test script to verify crawl functionality
|
|
- Created status checking script to monitor crawl progress
|
|
- Added comprehensive documentation
|
|
|
|
## How It Works
|
|
|
|
### Initial Crawl Flow
|
|
1. Symbol starts with no crawl state
|
|
2. First crawl fetches today's data and sets `newestDateReached`
|
|
3. Subsequent batches crawl backward in time
|
|
4. Each batch updates `oldestDateReached` and `lastProcessedDate`
|
|
5. When `oldestDateReached <= targetOldestDate`, crawl is marked finished
|
|
|
|
### Resumption Flow
|
|
1. Check if `newestDateReached < yesterday` (forward gap)
|
|
2. If yes, fetch new data first to stay current
|
|
3. Check if `finished = false` (backward gap)
|
|
4. If yes, continue backward crawl from `lastProcessedDate`
|
|
5. Process in batches until complete
|
|
|
|
### Daily Update Flow
|
|
1. For finished crawls, only check for forward gaps
|
|
2. Fetch data from `newestDateReached + 1` to today
|
|
3. Update `newestDateReached` to maintain currency
|
|
|
|
## Benefits
|
|
|
|
1. **Resilient**: Can handle interruptions gracefully
|
|
2. **Efficient**: Avoids re-fetching data
|
|
3. **Trackable**: Clear progress visibility
|
|
4. **Scalable**: Can handle thousands of symbols
|
|
5. **Flexible**: Configurable batch sizes and target dates
|
|
|
|
## Usage Examples
|
|
|
|
### Check if symbol X needs crawling:
|
|
```typescript
|
|
const tracker = operationRegistry.getTracker('qm');
|
|
const symbols = await tracker.getSymbolsForIntradayCrawl('intraday_bars', {
|
|
limit: 1,
|
|
targetOldestDate: new Date('2020-01-01')
|
|
});
|
|
const symbolX = symbols.find(s => s.symbol === 'X');
|
|
```
|
|
|
|
### Start crawl for symbol X only:
|
|
```typescript
|
|
await handler.crawlIntradayData({
|
|
symbol: 'X',
|
|
symbolId: symbolData.symbolId,
|
|
qmSearchCode: symbolData.qmSearchCode,
|
|
targetOldestDate: '2020-01-01'
|
|
});
|
|
```
|
|
|
|
### Schedule crawls for never-run symbols:
|
|
```typescript
|
|
await handler.scheduleIntradayCrawls({
|
|
limit: 50,
|
|
priorityMode: 'never_run',
|
|
targetOldestDate: '2020-01-01'
|
|
});
|
|
```
|
|
|
|
## Next Steps
|
|
|
|
1. Monitor the crawl progress using the provided scripts
|
|
2. Adjust batch sizes based on API rate limits
|
|
3. Consider adding more sophisticated retry logic for failed batches
|
|
4. Implement data validation to ensure quality |