work on qm filings

2025-07-01 15:35:56 -04:00 · 2025-07-01 15:35:56 -04:00 · 960daf4cad
commit 960daf4cad
parent 710577eb3d
17 changed files with 2319 additions and 32 deletions
--- a/apps/stock/data-ingestion/docs/INTRADAY_CRAWL.md
+++ b/apps/stock/data-ingestion/docs/INTRADAY_CRAWL.md
@ -0,0 +1,122 @@
+# Intraday Crawl System
+
+## Overview
+
+The intraday crawl system is designed to handle large-scale historical data collection with proper resumption support. It tracks the oldest and newest dates reached, allowing it to resume from where it left off if interrupted.
+
+## Key Features
+
+1. **Bidirectional Crawling**: Can crawl both forward (for new data) and backward (for historical data)
+2. **Resumption Support**: Tracks progress and can resume from where it left off
+3. **Gap Detection**: Automatically detects gaps in data coverage
+4. **Batch Processing**: Processes data in configurable batches (default: 7 days)
+5. **Completion Tracking**: Knows when a symbol's full history has been fetched
+
+## Crawl State Fields
+
+The system tracks the following state for each symbol:
+
+```typescript
+interface CrawlState {
+  finished: boolean;              // Whether crawl is complete
+  oldestDateReached?: Date;       // Oldest date we've fetched
+  newestDateReached?: Date;       // Newest date we've fetched  
+  lastProcessedDate?: Date;       // Last date processed (for resumption)
+  totalDaysProcessed?: number;    // Total days processed so far
+  lastCrawlDirection?: 'forward' | 'backward';
+  targetOldestDate?: Date;        // Target date to reach
+}
+```
+
+## How It Works
+
+### Initial Crawl
+1. Starts from today and fetches current data
+2. Then begins crawling backward in weekly batches
+3. Continues until it reaches the target oldest date (default: 2020-01-01)
+4. Marks as finished when complete
+
+### Resumption After Interruption
+1. Checks for forward gap: If `newestDateReached < yesterday`, fetches new data first
+2. Checks for backward gap: If not finished and `oldestDateReached > targetOldestDate`, continues backward crawl
+3. Resumes from `lastProcessedDate` to avoid re-fetching data
+
+### Daily Updates
+Once a symbol is fully crawled:
+- Only needs to fetch new data (forward crawl)
+- Much faster as it's typically just 1-2 days of data
+
+## Usage
+
+### Manual Crawl for Single Symbol
+```typescript
+await handler.crawlIntradayData({
+  symbol: 'AAPL',
+  symbolId: 12345,
+  qmSearchCode: 'AAPL',
+  targetOldestDate: '2020-01-01',
+  batchSize: 7  // Days per batch
+});
+```
+
+### Schedule Crawls for Multiple Symbols
+```typescript
+await handler.scheduleIntradayCrawls({
+  limit: 50,
+  targetOldestDate: '2020-01-01',
+  priorityMode: 'incomplete'  // or 'never_run', 'stale', 'all'
+});
+```
+
+### Check Crawl Status
+```typescript
+const tracker = handler.operationRegistry.getTracker('qm');
+const isComplete = await tracker.isIntradayCrawlComplete('AAPL', 'intraday_bars', new Date('2020-01-01'));
+```
+
+### Get Symbols Needing Crawl
+```typescript
+const symbols = await tracker.getSymbolsForIntradayCrawl('intraday_bars', {
+  limit: 100,
+  targetOldestDate: new Date('2020-01-01'),
+  includeNewDataGaps: true  // Include symbols needing updates
+});
+```
+
+## Priority Modes
+
+- **never_run**: Symbols that have never been crawled (highest priority)
+- **incomplete**: Symbols with unfinished crawls
+- **stale**: Symbols with complete crawls but new data available
+- **all**: All symbols needing any processing
+
+## Scheduled Operations
+
+The system includes scheduled operations:
+- `schedule-intraday-crawls-batch`: Runs every 4 hours, processes incomplete crawls
+
+## Monitoring
+
+Use the provided scripts to monitor crawl progress:
+
+```bash
+# Check overall status
+bun run scripts/check-intraday-status.ts
+
+# Test crawl for specific symbol
+bun run test/intraday-crawl.test.ts
+```
+
+## Performance Considerations
+
+1. **Rate Limiting**: Delays between API calls to avoid rate limits
+2. **Weekend Skipping**: Automatically skips weekends to save API calls
+3. **Batch Size**: Configurable batch size (default 7 days) balances progress vs memory
+4. **Priority Scheduling**: Higher priority for current data updates
+
+## Error Handling
+
+- Failed batches don't stop the entire crawl
+- Errors are logged and stored in the operation status
+- Partial success is tracked separately from complete failure
+- Session failures trigger automatic session rotation