work on qm filings
This commit is contained in:
parent
710577eb3d
commit
960daf4cad
17 changed files with 2319 additions and 32 deletions
122
apps/stock/data-ingestion/docs/INTRADAY_CRAWL.md
Normal file
122
apps/stock/data-ingestion/docs/INTRADAY_CRAWL.md
Normal file
|
|
@ -0,0 +1,122 @@
|
|||
# Intraday Crawl System
|
||||
|
||||
## Overview
|
||||
|
||||
The intraday crawl system is designed to handle large-scale historical data collection with proper resumption support. It tracks the oldest and newest dates reached, allowing it to resume from where it left off if interrupted.
|
||||
|
||||
## Key Features
|
||||
|
||||
1. **Bidirectional Crawling**: Can crawl both forward (for new data) and backward (for historical data)
|
||||
2. **Resumption Support**: Tracks progress and can resume from where it left off
|
||||
3. **Gap Detection**: Automatically detects gaps in data coverage
|
||||
4. **Batch Processing**: Processes data in configurable batches (default: 7 days)
|
||||
5. **Completion Tracking**: Knows when a symbol's full history has been fetched
|
||||
|
||||
## Crawl State Fields
|
||||
|
||||
The system tracks the following state for each symbol:
|
||||
|
||||
```typescript
|
||||
interface CrawlState {
|
||||
finished: boolean; // Whether crawl is complete
|
||||
oldestDateReached?: Date; // Oldest date we've fetched
|
||||
newestDateReached?: Date; // Newest date we've fetched
|
||||
lastProcessedDate?: Date; // Last date processed (for resumption)
|
||||
totalDaysProcessed?: number; // Total days processed so far
|
||||
lastCrawlDirection?: 'forward' | 'backward';
|
||||
targetOldestDate?: Date; // Target date to reach
|
||||
}
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
### Initial Crawl
|
||||
1. Starts from today and fetches current data
|
||||
2. Then begins crawling backward in weekly batches
|
||||
3. Continues until it reaches the target oldest date (default: 2020-01-01)
|
||||
4. Marks as finished when complete
|
||||
|
||||
### Resumption After Interruption
|
||||
1. Checks for forward gap: If `newestDateReached < yesterday`, fetches new data first
|
||||
2. Checks for backward gap: If not finished and `oldestDateReached > targetOldestDate`, continues backward crawl
|
||||
3. Resumes from `lastProcessedDate` to avoid re-fetching data
|
||||
|
||||
### Daily Updates
|
||||
Once a symbol is fully crawled:
|
||||
- Only needs to fetch new data (forward crawl)
|
||||
- Much faster as it's typically just 1-2 days of data
|
||||
|
||||
## Usage
|
||||
|
||||
### Manual Crawl for Single Symbol
|
||||
```typescript
|
||||
await handler.crawlIntradayData({
|
||||
symbol: 'AAPL',
|
||||
symbolId: 12345,
|
||||
qmSearchCode: 'AAPL',
|
||||
targetOldestDate: '2020-01-01',
|
||||
batchSize: 7 // Days per batch
|
||||
});
|
||||
```
|
||||
|
||||
### Schedule Crawls for Multiple Symbols
|
||||
```typescript
|
||||
await handler.scheduleIntradayCrawls({
|
||||
limit: 50,
|
||||
targetOldestDate: '2020-01-01',
|
||||
priorityMode: 'incomplete' // or 'never_run', 'stale', 'all'
|
||||
});
|
||||
```
|
||||
|
||||
### Check Crawl Status
|
||||
```typescript
|
||||
const tracker = handler.operationRegistry.getTracker('qm');
|
||||
const isComplete = await tracker.isIntradayCrawlComplete('AAPL', 'intraday_bars', new Date('2020-01-01'));
|
||||
```
|
||||
|
||||
### Get Symbols Needing Crawl
|
||||
```typescript
|
||||
const symbols = await tracker.getSymbolsForIntradayCrawl('intraday_bars', {
|
||||
limit: 100,
|
||||
targetOldestDate: new Date('2020-01-01'),
|
||||
includeNewDataGaps: true // Include symbols needing updates
|
||||
});
|
||||
```
|
||||
|
||||
## Priority Modes
|
||||
|
||||
- **never_run**: Symbols that have never been crawled (highest priority)
|
||||
- **incomplete**: Symbols with unfinished crawls
|
||||
- **stale**: Symbols with complete crawls but new data available
|
||||
- **all**: All symbols needing any processing
|
||||
|
||||
## Scheduled Operations
|
||||
|
||||
The system includes scheduled operations:
|
||||
- `schedule-intraday-crawls-batch`: Runs every 4 hours, processes incomplete crawls
|
||||
|
||||
## Monitoring
|
||||
|
||||
Use the provided scripts to monitor crawl progress:
|
||||
|
||||
```bash
|
||||
# Check overall status
|
||||
bun run scripts/check-intraday-status.ts
|
||||
|
||||
# Test crawl for specific symbol
|
||||
bun run test/intraday-crawl.test.ts
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
1. **Rate Limiting**: Delays between API calls to avoid rate limits
|
||||
2. **Weekend Skipping**: Automatically skips weekends to save API calls
|
||||
3. **Batch Size**: Configurable batch size (default 7 days) balances progress vs memory
|
||||
4. **Priority Scheduling**: Higher priority for current data updates
|
||||
|
||||
## Error Handling
|
||||
|
||||
- Failed batches don't stop the entire crawl
|
||||
- Errors are logged and stored in the operation status
|
||||
- Partial success is tracked separately from complete failure
|
||||
- Session failures trigger automatic session rotation
|
||||
107
apps/stock/data-ingestion/docs/OPERATION_TRACKER_ENHANCEMENTS.md
Normal file
107
apps/stock/data-ingestion/docs/OPERATION_TRACKER_ENHANCEMENTS.md
Normal file
|
|
@ -0,0 +1,107 @@
|
|||
# Operation Tracker Enhancements for Intraday Crawling
|
||||
|
||||
## Summary of Changes
|
||||
|
||||
This document summarizes the enhancements made to the operation tracker to support sophisticated intraday data crawling with resumption capabilities.
|
||||
|
||||
## Changes Made
|
||||
|
||||
### 1. Enhanced CrawlState Interface (`types.ts`)
|
||||
Added new fields to track crawl progress:
|
||||
- `newestDateReached`: Track the most recent date processed
|
||||
- `lastProcessedDate`: For resumption after interruption
|
||||
- `totalDaysProcessed`: Progress tracking
|
||||
- `targetOldestDate`: The goal date to reach
|
||||
|
||||
### 2. Updated OperationTracker (`OperationTracker.ts`)
|
||||
- Modified `updateSymbolOperation` to handle new crawl state fields
|
||||
- Updated `bulkUpdateSymbolOperations` for proper Date handling
|
||||
- Enhanced `markCrawlFinished` to track both oldest and newest dates
|
||||
- Added `getSymbolsForIntradayCrawl`: Specialized method for intraday crawls with gap detection
|
||||
- Added `isIntradayCrawlComplete`: Check if a crawl has reached its target
|
||||
- Added new indexes for efficient querying on crawl state fields
|
||||
|
||||
### 3. New Intraday Crawl Action (`intraday-crawl.action.ts`)
|
||||
Created a sophisticated crawl system with:
|
||||
- **Bidirectional crawling**: Handles both forward (new data) and backward (historical) gaps
|
||||
- **Batch processing**: Processes data in weekly batches by default
|
||||
- **Resumption logic**: Can resume from where it left off if interrupted
|
||||
- **Gap detection**: Automatically identifies missing date ranges
|
||||
- **Completion tracking**: Knows when the full history has been fetched
|
||||
|
||||
### 4. Integration with QM Handler
|
||||
- Added new operations: `crawl-intraday-data` and `schedule-intraday-crawls`
|
||||
- Added scheduled operation to automatically process incomplete crawls every 4 hours
|
||||
- Integrated with the existing operation registry system
|
||||
|
||||
### 5. Testing and Monitoring Tools
|
||||
- Created test script to verify crawl functionality
|
||||
- Created status checking script to monitor crawl progress
|
||||
- Added comprehensive documentation
|
||||
|
||||
## How It Works
|
||||
|
||||
### Initial Crawl Flow
|
||||
1. Symbol starts with no crawl state
|
||||
2. First crawl fetches today's data and sets `newestDateReached`
|
||||
3. Subsequent batches crawl backward in time
|
||||
4. Each batch updates `oldestDateReached` and `lastProcessedDate`
|
||||
5. When `oldestDateReached <= targetOldestDate`, crawl is marked finished
|
||||
|
||||
### Resumption Flow
|
||||
1. Check if `newestDateReached < yesterday` (forward gap)
|
||||
2. If yes, fetch new data first to stay current
|
||||
3. Check if `finished = false` (backward gap)
|
||||
4. If yes, continue backward crawl from `lastProcessedDate`
|
||||
5. Process in batches until complete
|
||||
|
||||
### Daily Update Flow
|
||||
1. For finished crawls, only check for forward gaps
|
||||
2. Fetch data from `newestDateReached + 1` to today
|
||||
3. Update `newestDateReached` to maintain currency
|
||||
|
||||
## Benefits
|
||||
|
||||
1. **Resilient**: Can handle interruptions gracefully
|
||||
2. **Efficient**: Avoids re-fetching data
|
||||
3. **Trackable**: Clear progress visibility
|
||||
4. **Scalable**: Can handle thousands of symbols
|
||||
5. **Flexible**: Configurable batch sizes and target dates
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Check if symbol X needs crawling:
|
||||
```typescript
|
||||
const tracker = operationRegistry.getTracker('qm');
|
||||
const symbols = await tracker.getSymbolsForIntradayCrawl('intraday_bars', {
|
||||
limit: 1,
|
||||
targetOldestDate: new Date('2020-01-01')
|
||||
});
|
||||
const symbolX = symbols.find(s => s.symbol === 'X');
|
||||
```
|
||||
|
||||
### Start crawl for symbol X only:
|
||||
```typescript
|
||||
await handler.crawlIntradayData({
|
||||
symbol: 'X',
|
||||
symbolId: symbolData.symbolId,
|
||||
qmSearchCode: symbolData.qmSearchCode,
|
||||
targetOldestDate: '2020-01-01'
|
||||
});
|
||||
```
|
||||
|
||||
### Schedule crawls for never-run symbols:
|
||||
```typescript
|
||||
await handler.scheduleIntradayCrawls({
|
||||
limit: 50,
|
||||
priorityMode: 'never_run',
|
||||
targetOldestDate: '2020-01-01'
|
||||
});
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. Monitor the crawl progress using the provided scripts
|
||||
2. Adjust batch sizes based on API rate limits
|
||||
3. Consider adding more sophisticated retry logic for failed batches
|
||||
4. Implement data validation to ensure quality
|
||||
Loading…
Add table
Add a link
Reference in a new issue