# Intraday Crawl System ## Overview The intraday crawl system is designed to handle large-scale historical data collection with proper resumption support. It tracks the oldest and newest dates reached, allowing it to resume from where it left off if interrupted. ## Key Features 1. **Bidirectional Crawling**: Can crawl both forward (for new data) and backward (for historical data) 2. **Resumption Support**: Tracks progress and can resume from where it left off 3. **Gap Detection**: Automatically detects gaps in data coverage 4. **Batch Processing**: Processes data in configurable batches (default: 7 days) 5. **Completion Tracking**: Knows when a symbol's full history has been fetched ## Crawl State Fields The system tracks the following state for each symbol: ```typescript interface CrawlState { finished: boolean; // Whether crawl is complete oldestDateReached?: Date; // Oldest date we've fetched newestDateReached?: Date; // Newest date we've fetched lastProcessedDate?: Date; // Last date processed (for resumption) totalDaysProcessed?: number; // Total days processed so far lastCrawlDirection?: 'forward' | 'backward'; targetOldestDate?: Date; // Target date to reach } ``` ## How It Works ### Initial Crawl 1. Starts from today and fetches current data 2. Then begins crawling backward in weekly batches 3. Continues until it reaches the target oldest date (default: 2020-01-01) 4. Marks as finished when complete ### Resumption After Interruption 1. Checks for forward gap: If `newestDateReached < yesterday`, fetches new data first 2. Checks for backward gap: If not finished and `oldestDateReached > targetOldestDate`, continues backward crawl 3. Resumes from `lastProcessedDate` to avoid re-fetching data ### Daily Updates Once a symbol is fully crawled: - Only needs to fetch new data (forward crawl) - Much faster as it's typically just 1-2 days of data ## Usage ### Manual Crawl for Single Symbol ```typescript await handler.crawlIntradayData({ symbol: 'AAPL', symbolId: 12345, qmSearchCode: 'AAPL', targetOldestDate: '2020-01-01', batchSize: 7 // Days per batch }); ``` ### Schedule Crawls for Multiple Symbols ```typescript await handler.scheduleIntradayCrawls({ limit: 50, targetOldestDate: '2020-01-01', priorityMode: 'incomplete' // or 'never_run', 'stale', 'all' }); ``` ### Check Crawl Status ```typescript const tracker = handler.operationRegistry.getTracker('qm'); const isComplete = await tracker.isIntradayCrawlComplete('AAPL', 'intraday_bars', new Date('2020-01-01')); ``` ### Get Symbols Needing Crawl ```typescript const symbols = await tracker.getSymbolsForIntradayCrawl('intraday_bars', { limit: 100, targetOldestDate: new Date('2020-01-01'), includeNewDataGaps: true // Include symbols needing updates }); ``` ## Priority Modes - **never_run**: Symbols that have never been crawled (highest priority) - **incomplete**: Symbols with unfinished crawls - **stale**: Symbols with complete crawls but new data available - **all**: All symbols needing any processing ## Scheduled Operations The system includes scheduled operations: - `schedule-intraday-crawls-batch`: Runs every 4 hours, processes incomplete crawls ## Monitoring Use the provided scripts to monitor crawl progress: ```bash # Check overall status bun run scripts/check-intraday-status.ts # Test crawl for specific symbol bun run test/intraday-crawl.test.ts ``` ## Performance Considerations 1. **Rate Limiting**: Delays between API calls to avoid rate limits 2. **Weekend Skipping**: Automatically skips weekends to save API calls 3. **Batch Size**: Configurable batch size (default 7 days) balances progress vs memory 4. **Priority Scheduling**: Higher priority for current data updates ## Error Handling - Failed batches don't stop the entire crawl - Errors are logged and stored in the operation status - Partial success is tracked separately from complete failure - Session failures trigger automatic session rotation