stock-bot/apps/stock/data-ingestion/docs/INTRADAY_CRAWL.md
2025-07-01 15:35:56 -04:00

122 lines
No EOL
4.1 KiB
Markdown

# Intraday Crawl System
## Overview
The intraday crawl system is designed to handle large-scale historical data collection with proper resumption support. It tracks the oldest and newest dates reached, allowing it to resume from where it left off if interrupted.
## Key Features
1. **Bidirectional Crawling**: Can crawl both forward (for new data) and backward (for historical data)
2. **Resumption Support**: Tracks progress and can resume from where it left off
3. **Gap Detection**: Automatically detects gaps in data coverage
4. **Batch Processing**: Processes data in configurable batches (default: 7 days)
5. **Completion Tracking**: Knows when a symbol's full history has been fetched
## Crawl State Fields
The system tracks the following state for each symbol:
```typescript
interface CrawlState {
finished: boolean; // Whether crawl is complete
oldestDateReached?: Date; // Oldest date we've fetched
newestDateReached?: Date; // Newest date we've fetched
lastProcessedDate?: Date; // Last date processed (for resumption)
totalDaysProcessed?: number; // Total days processed so far
lastCrawlDirection?: 'forward' | 'backward';
targetOldestDate?: Date; // Target date to reach
}
```
## How It Works
### Initial Crawl
1. Starts from today and fetches current data
2. Then begins crawling backward in weekly batches
3. Continues until it reaches the target oldest date (default: 2020-01-01)
4. Marks as finished when complete
### Resumption After Interruption
1. Checks for forward gap: If `newestDateReached < yesterday`, fetches new data first
2. Checks for backward gap: If not finished and `oldestDateReached > targetOldestDate`, continues backward crawl
3. Resumes from `lastProcessedDate` to avoid re-fetching data
### Daily Updates
Once a symbol is fully crawled:
- Only needs to fetch new data (forward crawl)
- Much faster as it's typically just 1-2 days of data
## Usage
### Manual Crawl for Single Symbol
```typescript
await handler.crawlIntradayData({
symbol: 'AAPL',
symbolId: 12345,
qmSearchCode: 'AAPL',
targetOldestDate: '2020-01-01',
batchSize: 7 // Days per batch
});
```
### Schedule Crawls for Multiple Symbols
```typescript
await handler.scheduleIntradayCrawls({
limit: 50,
targetOldestDate: '2020-01-01',
priorityMode: 'incomplete' // or 'never_run', 'stale', 'all'
});
```
### Check Crawl Status
```typescript
const tracker = handler.operationRegistry.getTracker('qm');
const isComplete = await tracker.isIntradayCrawlComplete('AAPL', 'intraday_bars', new Date('2020-01-01'));
```
### Get Symbols Needing Crawl
```typescript
const symbols = await tracker.getSymbolsForIntradayCrawl('intraday_bars', {
limit: 100,
targetOldestDate: new Date('2020-01-01'),
includeNewDataGaps: true // Include symbols needing updates
});
```
## Priority Modes
- **never_run**: Symbols that have never been crawled (highest priority)
- **incomplete**: Symbols with unfinished crawls
- **stale**: Symbols with complete crawls but new data available
- **all**: All symbols needing any processing
## Scheduled Operations
The system includes scheduled operations:
- `schedule-intraday-crawls-batch`: Runs every 4 hours, processes incomplete crawls
## Monitoring
Use the provided scripts to monitor crawl progress:
```bash
# Check overall status
bun run scripts/check-intraday-status.ts
# Test crawl for specific symbol
bun run test/intraday-crawl.test.ts
```
## Performance Considerations
1. **Rate Limiting**: Delays between API calls to avoid rate limits
2. **Weekend Skipping**: Automatically skips weekends to save API calls
3. **Batch Size**: Configurable batch size (default 7 days) balances progress vs memory
4. **Priority Scheduling**: Higher priority for current data updates
## Error Handling
- Failed batches don't stop the entire crawl
- Errors are logged and stored in the operation status
- Partial success is tracked separately from complete failure
- Session failures trigger automatic session rotation