stock-bot/apps/stock/data-ingestion/docs/INTRADAY_CRAWL.md
2025-07-01 15:35:56 -04:00

4.1 KiB

Intraday Crawl System

Overview

The intraday crawl system is designed to handle large-scale historical data collection with proper resumption support. It tracks the oldest and newest dates reached, allowing it to resume from where it left off if interrupted.

Key Features

  1. Bidirectional Crawling: Can crawl both forward (for new data) and backward (for historical data)
  2. Resumption Support: Tracks progress and can resume from where it left off
  3. Gap Detection: Automatically detects gaps in data coverage
  4. Batch Processing: Processes data in configurable batches (default: 7 days)
  5. Completion Tracking: Knows when a symbol's full history has been fetched

Crawl State Fields

The system tracks the following state for each symbol:

interface CrawlState {
  finished: boolean;              // Whether crawl is complete
  oldestDateReached?: Date;       // Oldest date we've fetched
  newestDateReached?: Date;       // Newest date we've fetched  
  lastProcessedDate?: Date;       // Last date processed (for resumption)
  totalDaysProcessed?: number;    // Total days processed so far
  lastCrawlDirection?: 'forward' | 'backward';
  targetOldestDate?: Date;        // Target date to reach
}

How It Works

Initial Crawl

  1. Starts from today and fetches current data
  2. Then begins crawling backward in weekly batches
  3. Continues until it reaches the target oldest date (default: 2020-01-01)
  4. Marks as finished when complete

Resumption After Interruption

  1. Checks for forward gap: If newestDateReached < yesterday, fetches new data first
  2. Checks for backward gap: If not finished and oldestDateReached > targetOldestDate, continues backward crawl
  3. Resumes from lastProcessedDate to avoid re-fetching data

Daily Updates

Once a symbol is fully crawled:

  • Only needs to fetch new data (forward crawl)
  • Much faster as it's typically just 1-2 days of data

Usage

Manual Crawl for Single Symbol

await handler.crawlIntradayData({
  symbol: 'AAPL',
  symbolId: 12345,
  qmSearchCode: 'AAPL',
  targetOldestDate: '2020-01-01',
  batchSize: 7  // Days per batch
});

Schedule Crawls for Multiple Symbols

await handler.scheduleIntradayCrawls({
  limit: 50,
  targetOldestDate: '2020-01-01',
  priorityMode: 'incomplete'  // or 'never_run', 'stale', 'all'
});

Check Crawl Status

const tracker = handler.operationRegistry.getTracker('qm');
const isComplete = await tracker.isIntradayCrawlComplete('AAPL', 'intraday_bars', new Date('2020-01-01'));

Get Symbols Needing Crawl

const symbols = await tracker.getSymbolsForIntradayCrawl('intraday_bars', {
  limit: 100,
  targetOldestDate: new Date('2020-01-01'),
  includeNewDataGaps: true  // Include symbols needing updates
});

Priority Modes

  • never_run: Symbols that have never been crawled (highest priority)
  • incomplete: Symbols with unfinished crawls
  • stale: Symbols with complete crawls but new data available
  • all: All symbols needing any processing

Scheduled Operations

The system includes scheduled operations:

  • schedule-intraday-crawls-batch: Runs every 4 hours, processes incomplete crawls

Monitoring

Use the provided scripts to monitor crawl progress:

# Check overall status
bun run scripts/check-intraday-status.ts

# Test crawl for specific symbol
bun run test/intraday-crawl.test.ts

Performance Considerations

  1. Rate Limiting: Delays between API calls to avoid rate limits
  2. Weekend Skipping: Automatically skips weekends to save API calls
  3. Batch Size: Configurable batch size (default 7 days) balances progress vs memory
  4. Priority Scheduling: Higher priority for current data updates

Error Handling

  • Failed batches don't stop the entire crawl
  • Errors are logged and stored in the operation status
  • Partial success is tracked separately from complete failure
  • Session failures trigger automatic session rotation