4.1 KiB
4.1 KiB
Intraday Crawl System
Overview
The intraday crawl system is designed to handle large-scale historical data collection with proper resumption support. It tracks the oldest and newest dates reached, allowing it to resume from where it left off if interrupted.
Key Features
- Bidirectional Crawling: Can crawl both forward (for new data) and backward (for historical data)
- Resumption Support: Tracks progress and can resume from where it left off
- Gap Detection: Automatically detects gaps in data coverage
- Batch Processing: Processes data in configurable batches (default: 7 days)
- Completion Tracking: Knows when a symbol's full history has been fetched
Crawl State Fields
The system tracks the following state for each symbol:
interface CrawlState {
finished: boolean; // Whether crawl is complete
oldestDateReached?: Date; // Oldest date we've fetched
newestDateReached?: Date; // Newest date we've fetched
lastProcessedDate?: Date; // Last date processed (for resumption)
totalDaysProcessed?: number; // Total days processed so far
lastCrawlDirection?: 'forward' | 'backward';
targetOldestDate?: Date; // Target date to reach
}
How It Works
Initial Crawl
- Starts from today and fetches current data
- Then begins crawling backward in weekly batches
- Continues until it reaches the target oldest date (default: 2020-01-01)
- Marks as finished when complete
Resumption After Interruption
- Checks for forward gap: If
newestDateReached < yesterday, fetches new data first - Checks for backward gap: If not finished and
oldestDateReached > targetOldestDate, continues backward crawl - Resumes from
lastProcessedDateto avoid re-fetching data
Daily Updates
Once a symbol is fully crawled:
- Only needs to fetch new data (forward crawl)
- Much faster as it's typically just 1-2 days of data
Usage
Manual Crawl for Single Symbol
await handler.crawlIntradayData({
symbol: 'AAPL',
symbolId: 12345,
qmSearchCode: 'AAPL',
targetOldestDate: '2020-01-01',
batchSize: 7 // Days per batch
});
Schedule Crawls for Multiple Symbols
await handler.scheduleIntradayCrawls({
limit: 50,
targetOldestDate: '2020-01-01',
priorityMode: 'incomplete' // or 'never_run', 'stale', 'all'
});
Check Crawl Status
const tracker = handler.operationRegistry.getTracker('qm');
const isComplete = await tracker.isIntradayCrawlComplete('AAPL', 'intraday_bars', new Date('2020-01-01'));
Get Symbols Needing Crawl
const symbols = await tracker.getSymbolsForIntradayCrawl('intraday_bars', {
limit: 100,
targetOldestDate: new Date('2020-01-01'),
includeNewDataGaps: true // Include symbols needing updates
});
Priority Modes
- never_run: Symbols that have never been crawled (highest priority)
- incomplete: Symbols with unfinished crawls
- stale: Symbols with complete crawls but new data available
- all: All symbols needing any processing
Scheduled Operations
The system includes scheduled operations:
schedule-intraday-crawls-batch: Runs every 4 hours, processes incomplete crawls
Monitoring
Use the provided scripts to monitor crawl progress:
# Check overall status
bun run scripts/check-intraday-status.ts
# Test crawl for specific symbol
bun run test/intraday-crawl.test.ts
Performance Considerations
- Rate Limiting: Delays between API calls to avoid rate limits
- Weekend Skipping: Automatically skips weekends to save API calls
- Batch Size: Configurable batch size (default 7 days) balances progress vs memory
- Priority Scheduling: Higher priority for current data updates
Error Handling
- Failed batches don't stop the entire crawl
- Errors are logged and stored in the operation status
- Partial success is tracked separately from complete failure
- Session failures trigger automatic session rotation