stock-bot/apps/stock/data-ingestion/docs/INTRADAY_CRAWL.md

# Intraday Crawl System

## Overview

The intraday crawl system is designed to handle large-scale historical data collection with proper resumption support. It tracks the oldest and newest dates reached, allowing it to resume from where it left off if interrupted.

## Key Features

1. **Bidirectional Crawling**: Can crawl both forward (for new data) and backward (for historical data)
2. **Resumption Support**: Tracks progress and can resume from where it left off
3. **Gap Detection**: Automatically detects gaps in data coverage
4. **Batch Processing**: Processes data in configurable batches (default: 7 days)
5. **Completion Tracking**: Knows when a symbol's full history has been fetched

## Crawl State Fields

The system tracks the following state for each symbol:

```typescript
interface CrawlState {
  finished: boolean;              // Whether crawl is complete
  oldestDateReached?: Date;       // Oldest date we've fetched
  newestDateReached?: Date;       // Newest date we've fetched
  lastProcessedDate?: Date;       // Last date processed (for resumption)
  totalDaysProcessed?: number;    // Total days processed so far
  lastCrawlDirection?: 'forward' | 'backward';
  targetOldestDate?: Date;        // Target date to reach
}
```

## How It Works

### Initial Crawl
1. Starts from today and fetches current data
2. Then begins crawling backward in weekly batches
3. Continues until it reaches the target oldest date (default: 2020-01-01)
4. Marks as finished when complete

### Resumption After Interruption
1. Checks for forward gap: If `newestDateReached < yesterday`, fetches new data first
2. Checks for backward gap: If not finished and `oldestDateReached > targetOldestDate`, continues backward crawl
3. Resumes from `lastProcessedDate` to avoid re-fetching data

### Daily Updates
Once a symbol is fully crawled:
- Only needs to fetch new data (forward crawl)
- Much faster as it's typically just 1-2 days of data

## Usage

### Manual Crawl for Single Symbol
```typescript
await handler.crawlIntradayData({
  symbol: 'AAPL',
  symbolId: 12345,
  qmSearchCode: 'AAPL',
  targetOldestDate: '2020-01-01',
  batchSize: 7  // Days per batch
});
```

### Schedule Crawls for Multiple Symbols
```typescript
await handler.scheduleIntradayCrawls({
  limit: 50,
  targetOldestDate: '2020-01-01',
  priorityMode: 'incomplete'  // or 'never_run', 'stale', 'all'
});
```

### Check Crawl Status
```typescript
const tracker = handler.operationRegistry.getTracker('qm');
const isComplete = await tracker.isIntradayCrawlComplete('AAPL', 'intraday_bars', new Date('2020-01-01'));
```

### Get Symbols Needing Crawl
```typescript
const symbols = await tracker.getSymbolsForIntradayCrawl('intraday_bars', {
  limit: 100,
  targetOldestDate: new Date('2020-01-01'),
  includeNewDataGaps: true  // Include symbols needing updates
});
```

## Priority Modes

- **never_run**: Symbols that have never been crawled (highest priority)
- **incomplete**: Symbols with unfinished crawls
- **stale**: Symbols with complete crawls but new data available
- **all**: All symbols needing any processing

## Scheduled Operations

The system includes scheduled operations:
- `schedule-intraday-crawls-batch`: Runs every 4 hours, processes incomplete crawls

## Monitoring

Use the provided scripts to monitor crawl progress:

```bash
# Check overall status
bun run scripts/check-intraday-status.ts

# Test crawl for specific symbol
bun run test/intraday-crawl.test.ts
```

## Performance Considerations

1. **Rate Limiting**: Delays between API calls to avoid rate limits
2. **Weekend Skipping**: Automatically skips weekends to save API calls
3. **Batch Size**: Configurable batch size (default 7 days) balances progress vs memory
4. **Priority Scheduling**: Higher priority for current data updates

## Error Handling

- Failed batches don't stop the entire crawl
- Errors are logged and stored in the operation status
- Partial success is tracked separately from complete failure
- Session failures trigger automatic session rotation