2025-07-01 15:35:56 -04:00

4.1 KiB

Raw Blame History

Intraday Crawl System

Overview

The intraday crawl system is designed to handle large-scale historical data collection with proper resumption support. It tracks the oldest and newest dates reached, allowing it to resume from where it left off if interrupted.

Key Features

Bidirectional Crawling: Can crawl both forward (for new data) and backward (for historical data)
Resumption Support: Tracks progress and can resume from where it left off
Gap Detection: Automatically detects gaps in data coverage
Batch Processing: Processes data in configurable batches (default: 7 days)
Completion Tracking: Knows when a symbol's full history has been fetched

Crawl State Fields

The system tracks the following state for each symbol:

interface CrawlState {
  finished: boolean;              // Whether crawl is complete
  oldestDateReached?: Date;       // Oldest date we've fetched
  newestDateReached?: Date;       // Newest date we've fetched  
  lastProcessedDate?: Date;       // Last date processed (for resumption)
  totalDaysProcessed?: number;    // Total days processed so far
  lastCrawlDirection?: 'forward' | 'backward';
  targetOldestDate?: Date;        // Target date to reach
}

How It Works

Initial Crawl

Starts from today and fetches current data
Then begins crawling backward in weekly batches
Continues until it reaches the target oldest date (default: 2020-01-01)
Marks as finished when complete

Resumption After Interruption

Checks for forward gap: If newestDateReached < yesterday, fetches new data first
Checks for backward gap: If not finished and oldestDateReached > targetOldestDate, continues backward crawl
Resumes from lastProcessedDate to avoid re-fetching data

Daily Updates

Once a symbol is fully crawled:

Only needs to fetch new data (forward crawl)
Much faster as it's typically just 1-2 days of data

Usage

Manual Crawl for Single Symbol

await handler.crawlIntradayData({
  symbol: 'AAPL',
  symbolId: 12345,
  qmSearchCode: 'AAPL',
  targetOldestDate: '2020-01-01',
  batchSize: 7  // Days per batch
});

Schedule Crawls for Multiple Symbols

await handler.scheduleIntradayCrawls({
  limit: 50,
  targetOldestDate: '2020-01-01',
  priorityMode: 'incomplete'  // or 'never_run', 'stale', 'all'
});

Check Crawl Status

const tracker = handler.operationRegistry.getTracker('qm');
const isComplete = await tracker.isIntradayCrawlComplete('AAPL', 'intraday_bars', new Date('2020-01-01'));

Get Symbols Needing Crawl

const symbols = await tracker.getSymbolsForIntradayCrawl('intraday_bars', {
  limit: 100,
  targetOldestDate: new Date('2020-01-01'),
  includeNewDataGaps: true  // Include symbols needing updates
});

Priority Modes

never_run: Symbols that have never been crawled (highest priority)
incomplete: Symbols with unfinished crawls
stale: Symbols with complete crawls but new data available
all: All symbols needing any processing

Scheduled Operations

The system includes scheduled operations:

schedule-intraday-crawls-batch: Runs every 4 hours, processes incomplete crawls

Monitoring

Use the provided scripts to monitor crawl progress:

# Check overall status
bun run scripts/check-intraday-status.ts

# Test crawl for specific symbol
bun run test/intraday-crawl.test.ts

Performance Considerations

Rate Limiting: Delays between API calls to avoid rate limits
Weekend Skipping: Automatically skips weekends to save API calls
Batch Size: Configurable batch size (default 7 days) balances progress vs memory
Priority Scheduling: Higher priority for current data updates

Error Handling

Failed batches don't stop the entire crawl
Errors are logged and stored in the operation status
Partial success is tracked separately from complete failure
Session failures trigger automatic session rotation

4.1 KiB Raw Blame History