work on qm filings
This commit is contained in:
parent
710577eb3d
commit
960daf4cad
17 changed files with 2319 additions and 32 deletions
107
apps/stock/data-ingestion/docs/OPERATION_TRACKER_ENHANCEMENTS.md
Normal file
107
apps/stock/data-ingestion/docs/OPERATION_TRACKER_ENHANCEMENTS.md
Normal file
|
|
@ -0,0 +1,107 @@
|
|||
# Operation Tracker Enhancements for Intraday Crawling
|
||||
|
||||
## Summary of Changes
|
||||
|
||||
This document summarizes the enhancements made to the operation tracker to support sophisticated intraday data crawling with resumption capabilities.
|
||||
|
||||
## Changes Made
|
||||
|
||||
### 1. Enhanced CrawlState Interface (`types.ts`)
|
||||
Added new fields to track crawl progress:
|
||||
- `newestDateReached`: Track the most recent date processed
|
||||
- `lastProcessedDate`: For resumption after interruption
|
||||
- `totalDaysProcessed`: Progress tracking
|
||||
- `targetOldestDate`: The goal date to reach
|
||||
|
||||
### 2. Updated OperationTracker (`OperationTracker.ts`)
|
||||
- Modified `updateSymbolOperation` to handle new crawl state fields
|
||||
- Updated `bulkUpdateSymbolOperations` for proper Date handling
|
||||
- Enhanced `markCrawlFinished` to track both oldest and newest dates
|
||||
- Added `getSymbolsForIntradayCrawl`: Specialized method for intraday crawls with gap detection
|
||||
- Added `isIntradayCrawlComplete`: Check if a crawl has reached its target
|
||||
- Added new indexes for efficient querying on crawl state fields
|
||||
|
||||
### 3. New Intraday Crawl Action (`intraday-crawl.action.ts`)
|
||||
Created a sophisticated crawl system with:
|
||||
- **Bidirectional crawling**: Handles both forward (new data) and backward (historical) gaps
|
||||
- **Batch processing**: Processes data in weekly batches by default
|
||||
- **Resumption logic**: Can resume from where it left off if interrupted
|
||||
- **Gap detection**: Automatically identifies missing date ranges
|
||||
- **Completion tracking**: Knows when the full history has been fetched
|
||||
|
||||
### 4. Integration with QM Handler
|
||||
- Added new operations: `crawl-intraday-data` and `schedule-intraday-crawls`
|
||||
- Added scheduled operation to automatically process incomplete crawls every 4 hours
|
||||
- Integrated with the existing operation registry system
|
||||
|
||||
### 5. Testing and Monitoring Tools
|
||||
- Created test script to verify crawl functionality
|
||||
- Created status checking script to monitor crawl progress
|
||||
- Added comprehensive documentation
|
||||
|
||||
## How It Works
|
||||
|
||||
### Initial Crawl Flow
|
||||
1. Symbol starts with no crawl state
|
||||
2. First crawl fetches today's data and sets `newestDateReached`
|
||||
3. Subsequent batches crawl backward in time
|
||||
4. Each batch updates `oldestDateReached` and `lastProcessedDate`
|
||||
5. When `oldestDateReached <= targetOldestDate`, crawl is marked finished
|
||||
|
||||
### Resumption Flow
|
||||
1. Check if `newestDateReached < yesterday` (forward gap)
|
||||
2. If yes, fetch new data first to stay current
|
||||
3. Check if `finished = false` (backward gap)
|
||||
4. If yes, continue backward crawl from `lastProcessedDate`
|
||||
5. Process in batches until complete
|
||||
|
||||
### Daily Update Flow
|
||||
1. For finished crawls, only check for forward gaps
|
||||
2. Fetch data from `newestDateReached + 1` to today
|
||||
3. Update `newestDateReached` to maintain currency
|
||||
|
||||
## Benefits
|
||||
|
||||
1. **Resilient**: Can handle interruptions gracefully
|
||||
2. **Efficient**: Avoids re-fetching data
|
||||
3. **Trackable**: Clear progress visibility
|
||||
4. **Scalable**: Can handle thousands of symbols
|
||||
5. **Flexible**: Configurable batch sizes and target dates
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Check if symbol X needs crawling:
|
||||
```typescript
|
||||
const tracker = operationRegistry.getTracker('qm');
|
||||
const symbols = await tracker.getSymbolsForIntradayCrawl('intraday_bars', {
|
||||
limit: 1,
|
||||
targetOldestDate: new Date('2020-01-01')
|
||||
});
|
||||
const symbolX = symbols.find(s => s.symbol === 'X');
|
||||
```
|
||||
|
||||
### Start crawl for symbol X only:
|
||||
```typescript
|
||||
await handler.crawlIntradayData({
|
||||
symbol: 'X',
|
||||
symbolId: symbolData.symbolId,
|
||||
qmSearchCode: symbolData.qmSearchCode,
|
||||
targetOldestDate: '2020-01-01'
|
||||
});
|
||||
```
|
||||
|
||||
### Schedule crawls for never-run symbols:
|
||||
```typescript
|
||||
await handler.scheduleIntradayCrawls({
|
||||
limit: 50,
|
||||
priorityMode: 'never_run',
|
||||
targetOldestDate: '2020-01-01'
|
||||
});
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. Monitor the crawl progress using the provided scripts
|
||||
2. Adjust batch sizes based on API rate limits
|
||||
3. Consider adding more sophisticated retry logic for failed batches
|
||||
4. Implement data validation to ensure quality
|
||||
Loading…
Add table
Add a link
Reference in a new issue