work on qm filings

2025-07-01 15:35:56 -04:00 · 2025-07-01 15:35:56 -04:00 · 960daf4cad
commit 960daf4cad
parent 710577eb3d
17 changed files with 2319 additions and 32 deletions
--- a/apps/stock/data-ingestion/docs/INTRADAY_CRAWL.md
+++ b/apps/stock/data-ingestion/docs/INTRADAY_CRAWL.md
@ -0,0 +1,122 @@
+# Intraday Crawl System
+
+## Overview
+
+The intraday crawl system is designed to handle large-scale historical data collection with proper resumption support. It tracks the oldest and newest dates reached, allowing it to resume from where it left off if interrupted.
+
+## Key Features
+
+1. **Bidirectional Crawling**: Can crawl both forward (for new data) and backward (for historical data)
+2. **Resumption Support**: Tracks progress and can resume from where it left off
+3. **Gap Detection**: Automatically detects gaps in data coverage
+4. **Batch Processing**: Processes data in configurable batches (default: 7 days)
+5. **Completion Tracking**: Knows when a symbol's full history has been fetched
+
+## Crawl State Fields
+
+The system tracks the following state for each symbol:
+
+```typescript
+interface CrawlState {
+  finished: boolean;              // Whether crawl is complete
+  oldestDateReached?: Date;       // Oldest date we've fetched
+  newestDateReached?: Date;       // Newest date we've fetched  
+  lastProcessedDate?: Date;       // Last date processed (for resumption)
+  totalDaysProcessed?: number;    // Total days processed so far
+  lastCrawlDirection?: 'forward' | 'backward';
+  targetOldestDate?: Date;        // Target date to reach
+}
+```
+
+## How It Works
+
+### Initial Crawl
+1. Starts from today and fetches current data
+2. Then begins crawling backward in weekly batches
+3. Continues until it reaches the target oldest date (default: 2020-01-01)
+4. Marks as finished when complete
+
+### Resumption After Interruption
+1. Checks for forward gap: If `newestDateReached < yesterday`, fetches new data first
+2. Checks for backward gap: If not finished and `oldestDateReached > targetOldestDate`, continues backward crawl
+3. Resumes from `lastProcessedDate` to avoid re-fetching data
+
+### Daily Updates
+Once a symbol is fully crawled:
+- Only needs to fetch new data (forward crawl)
+- Much faster as it's typically just 1-2 days of data
+
+## Usage
+
+### Manual Crawl for Single Symbol
+```typescript
+await handler.crawlIntradayData({
+  symbol: 'AAPL',
+  symbolId: 12345,
+  qmSearchCode: 'AAPL',
+  targetOldestDate: '2020-01-01',
+  batchSize: 7  // Days per batch
+});
+```
+
+### Schedule Crawls for Multiple Symbols
+```typescript
+await handler.scheduleIntradayCrawls({
+  limit: 50,
+  targetOldestDate: '2020-01-01',
+  priorityMode: 'incomplete'  // or 'never_run', 'stale', 'all'
+});
+```
+
+### Check Crawl Status
+```typescript
+const tracker = handler.operationRegistry.getTracker('qm');
+const isComplete = await tracker.isIntradayCrawlComplete('AAPL', 'intraday_bars', new Date('2020-01-01'));
+```
+
+### Get Symbols Needing Crawl
+```typescript
+const symbols = await tracker.getSymbolsForIntradayCrawl('intraday_bars', {
+  limit: 100,
+  targetOldestDate: new Date('2020-01-01'),
+  includeNewDataGaps: true  // Include symbols needing updates
+});
+```
+
+## Priority Modes
+
+- **never_run**: Symbols that have never been crawled (highest priority)
+- **incomplete**: Symbols with unfinished crawls
+- **stale**: Symbols with complete crawls but new data available
+- **all**: All symbols needing any processing
+
+## Scheduled Operations
+
+The system includes scheduled operations:
+- `schedule-intraday-crawls-batch`: Runs every 4 hours, processes incomplete crawls
+
+## Monitoring
+
+Use the provided scripts to monitor crawl progress:
+
+```bash
+# Check overall status
+bun run scripts/check-intraday-status.ts
+
+# Test crawl for specific symbol
+bun run test/intraday-crawl.test.ts
+```
+
+## Performance Considerations
+
+1. **Rate Limiting**: Delays between API calls to avoid rate limits
+2. **Weekend Skipping**: Automatically skips weekends to save API calls
+3. **Batch Size**: Configurable batch size (default 7 days) balances progress vs memory
+4. **Priority Scheduling**: Higher priority for current data updates
+
+## Error Handling
+
+- Failed batches don't stop the entire crawl
+- Errors are logged and stored in the operation status
+- Partial success is tracked separately from complete failure
+- Session failures trigger automatic session rotation
--- a/apps/stock/data-ingestion/docs/OPERATION_TRACKER_ENHANCEMENTS.md
+++ b/apps/stock/data-ingestion/docs/OPERATION_TRACKER_ENHANCEMENTS.md
@ -0,0 +1,107 @@
+# Operation Tracker Enhancements for Intraday Crawling
+
+## Summary of Changes
+
+This document summarizes the enhancements made to the operation tracker to support sophisticated intraday data crawling with resumption capabilities.
+
+## Changes Made
+
+### 1. Enhanced CrawlState Interface (`types.ts`)
+Added new fields to track crawl progress:
+- `newestDateReached`: Track the most recent date processed
+- `lastProcessedDate`: For resumption after interruption
+- `totalDaysProcessed`: Progress tracking
+- `targetOldestDate`: The goal date to reach
+
+### 2. Updated OperationTracker (`OperationTracker.ts`)
+- Modified `updateSymbolOperation` to handle new crawl state fields
+- Updated `bulkUpdateSymbolOperations` for proper Date handling
+- Enhanced `markCrawlFinished` to track both oldest and newest dates
+- Added `getSymbolsForIntradayCrawl`: Specialized method for intraday crawls with gap detection
+- Added `isIntradayCrawlComplete`: Check if a crawl has reached its target
+- Added new indexes for efficient querying on crawl state fields
+
+### 3. New Intraday Crawl Action (`intraday-crawl.action.ts`)
+Created a sophisticated crawl system with:
+- **Bidirectional crawling**: Handles both forward (new data) and backward (historical) gaps
+- **Batch processing**: Processes data in weekly batches by default
+- **Resumption logic**: Can resume from where it left off if interrupted
+- **Gap detection**: Automatically identifies missing date ranges
+- **Completion tracking**: Knows when the full history has been fetched
+
+### 4. Integration with QM Handler
+- Added new operations: `crawl-intraday-data` and `schedule-intraday-crawls`
+- Added scheduled operation to automatically process incomplete crawls every 4 hours
+- Integrated with the existing operation registry system
+
+### 5. Testing and Monitoring Tools
+- Created test script to verify crawl functionality
+- Created status checking script to monitor crawl progress
+- Added comprehensive documentation
+
+## How It Works
+
+### Initial Crawl Flow
+1. Symbol starts with no crawl state
+2. First crawl fetches today's data and sets `newestDateReached`
+3. Subsequent batches crawl backward in time
+4. Each batch updates `oldestDateReached` and `lastProcessedDate`
+5. When `oldestDateReached <= targetOldestDate`, crawl is marked finished
+
+### Resumption Flow
+1. Check if `newestDateReached < yesterday` (forward gap)
+2. If yes, fetch new data first to stay current
+3. Check if `finished = false` (backward gap)
+4. If yes, continue backward crawl from `lastProcessedDate`
+5. Process in batches until complete
+
+### Daily Update Flow
+1. For finished crawls, only check for forward gaps
+2. Fetch data from `newestDateReached + 1` to today
+3. Update `newestDateReached` to maintain currency
+
+## Benefits
+
+1. **Resilient**: Can handle interruptions gracefully
+2. **Efficient**: Avoids re-fetching data
+3. **Trackable**: Clear progress visibility
+4. **Scalable**: Can handle thousands of symbols
+5. **Flexible**: Configurable batch sizes and target dates
+
+## Usage Examples
+
+### Check if symbol X needs crawling:
+```typescript
+const tracker = operationRegistry.getTracker('qm');
+const symbols = await tracker.getSymbolsForIntradayCrawl('intraday_bars', {
+  limit: 1,
+  targetOldestDate: new Date('2020-01-01')
+});
+const symbolX = symbols.find(s => s.symbol === 'X');
+```
+
+### Start crawl for symbol X only:
+```typescript
+await handler.crawlIntradayData({
+  symbol: 'X',
+  symbolId: symbolData.symbolId,
+  qmSearchCode: symbolData.qmSearchCode,
+  targetOldestDate: '2020-01-01'
+});
+```
+
+### Schedule crawls for never-run symbols:
+```typescript
+await handler.scheduleIntradayCrawls({
+  limit: 50,
+  priorityMode: 'never_run',
+  targetOldestDate: '2020-01-01'
+});
+```
+
+## Next Steps
+
+1. Monitor the crawl progress using the provided scripts
+2. Adjust batch sizes based on API rate limits
+3. Consider adding more sophisticated retry logic for failed batches
+4. Implement data validation to ensure quality