122 lines
No EOL
4.1 KiB
Markdown
122 lines
No EOL
4.1 KiB
Markdown
# Intraday Crawl System
|
|
|
|
## Overview
|
|
|
|
The intraday crawl system is designed to handle large-scale historical data collection with proper resumption support. It tracks the oldest and newest dates reached, allowing it to resume from where it left off if interrupted.
|
|
|
|
## Key Features
|
|
|
|
1. **Bidirectional Crawling**: Can crawl both forward (for new data) and backward (for historical data)
|
|
2. **Resumption Support**: Tracks progress and can resume from where it left off
|
|
3. **Gap Detection**: Automatically detects gaps in data coverage
|
|
4. **Batch Processing**: Processes data in configurable batches (default: 7 days)
|
|
5. **Completion Tracking**: Knows when a symbol's full history has been fetched
|
|
|
|
## Crawl State Fields
|
|
|
|
The system tracks the following state for each symbol:
|
|
|
|
```typescript
|
|
interface CrawlState {
|
|
finished: boolean; // Whether crawl is complete
|
|
oldestDateReached?: Date; // Oldest date we've fetched
|
|
newestDateReached?: Date; // Newest date we've fetched
|
|
lastProcessedDate?: Date; // Last date processed (for resumption)
|
|
totalDaysProcessed?: number; // Total days processed so far
|
|
lastCrawlDirection?: 'forward' | 'backward';
|
|
targetOldestDate?: Date; // Target date to reach
|
|
}
|
|
```
|
|
|
|
## How It Works
|
|
|
|
### Initial Crawl
|
|
1. Starts from today and fetches current data
|
|
2. Then begins crawling backward in weekly batches
|
|
3. Continues until it reaches the target oldest date (default: 2020-01-01)
|
|
4. Marks as finished when complete
|
|
|
|
### Resumption After Interruption
|
|
1. Checks for forward gap: If `newestDateReached < yesterday`, fetches new data first
|
|
2. Checks for backward gap: If not finished and `oldestDateReached > targetOldestDate`, continues backward crawl
|
|
3. Resumes from `lastProcessedDate` to avoid re-fetching data
|
|
|
|
### Daily Updates
|
|
Once a symbol is fully crawled:
|
|
- Only needs to fetch new data (forward crawl)
|
|
- Much faster as it's typically just 1-2 days of data
|
|
|
|
## Usage
|
|
|
|
### Manual Crawl for Single Symbol
|
|
```typescript
|
|
await handler.crawlIntradayData({
|
|
symbol: 'AAPL',
|
|
symbolId: 12345,
|
|
qmSearchCode: 'AAPL',
|
|
targetOldestDate: '2020-01-01',
|
|
batchSize: 7 // Days per batch
|
|
});
|
|
```
|
|
|
|
### Schedule Crawls for Multiple Symbols
|
|
```typescript
|
|
await handler.scheduleIntradayCrawls({
|
|
limit: 50,
|
|
targetOldestDate: '2020-01-01',
|
|
priorityMode: 'incomplete' // or 'never_run', 'stale', 'all'
|
|
});
|
|
```
|
|
|
|
### Check Crawl Status
|
|
```typescript
|
|
const tracker = handler.operationRegistry.getTracker('qm');
|
|
const isComplete = await tracker.isIntradayCrawlComplete('AAPL', 'intraday_bars', new Date('2020-01-01'));
|
|
```
|
|
|
|
### Get Symbols Needing Crawl
|
|
```typescript
|
|
const symbols = await tracker.getSymbolsForIntradayCrawl('intraday_bars', {
|
|
limit: 100,
|
|
targetOldestDate: new Date('2020-01-01'),
|
|
includeNewDataGaps: true // Include symbols needing updates
|
|
});
|
|
```
|
|
|
|
## Priority Modes
|
|
|
|
- **never_run**: Symbols that have never been crawled (highest priority)
|
|
- **incomplete**: Symbols with unfinished crawls
|
|
- **stale**: Symbols with complete crawls but new data available
|
|
- **all**: All symbols needing any processing
|
|
|
|
## Scheduled Operations
|
|
|
|
The system includes scheduled operations:
|
|
- `schedule-intraday-crawls-batch`: Runs every 4 hours, processes incomplete crawls
|
|
|
|
## Monitoring
|
|
|
|
Use the provided scripts to monitor crawl progress:
|
|
|
|
```bash
|
|
# Check overall status
|
|
bun run scripts/check-intraday-status.ts
|
|
|
|
# Test crawl for specific symbol
|
|
bun run test/intraday-crawl.test.ts
|
|
```
|
|
|
|
## Performance Considerations
|
|
|
|
1. **Rate Limiting**: Delays between API calls to avoid rate limits
|
|
2. **Weekend Skipping**: Automatically skips weekends to save API calls
|
|
3. **Batch Size**: Configurable batch size (default 7 days) balances progress vs memory
|
|
4. **Priority Scheduling**: Higher priority for current data updates
|
|
|
|
## Error Handling
|
|
|
|
- Failed batches don't stop the entire crawl
|
|
- Errors are logged and stored in the operation status
|
|
- Partial success is tracked separately from complete failure
|
|
- Session failures trigger automatic session rotation |