work on qm filings

This commit is contained in:
Boki 2025-07-01 15:35:56 -04:00
parent 710577eb3d
commit 960daf4cad
17 changed files with 2319 additions and 32 deletions

View file

@ -0,0 +1,122 @@
# Intraday Crawl System
## Overview
The intraday crawl system is designed to handle large-scale historical data collection with proper resumption support. It tracks the oldest and newest dates reached, allowing it to resume from where it left off if interrupted.
## Key Features
1. **Bidirectional Crawling**: Can crawl both forward (for new data) and backward (for historical data)
2. **Resumption Support**: Tracks progress and can resume from where it left off
3. **Gap Detection**: Automatically detects gaps in data coverage
4. **Batch Processing**: Processes data in configurable batches (default: 7 days)
5. **Completion Tracking**: Knows when a symbol's full history has been fetched
## Crawl State Fields
The system tracks the following state for each symbol:
```typescript
interface CrawlState {
finished: boolean; // Whether crawl is complete
oldestDateReached?: Date; // Oldest date we've fetched
newestDateReached?: Date; // Newest date we've fetched
lastProcessedDate?: Date; // Last date processed (for resumption)
totalDaysProcessed?: number; // Total days processed so far
lastCrawlDirection?: 'forward' | 'backward';
targetOldestDate?: Date; // Target date to reach
}
```
## How It Works
### Initial Crawl
1. Starts from today and fetches current data
2. Then begins crawling backward in weekly batches
3. Continues until it reaches the target oldest date (default: 2020-01-01)
4. Marks as finished when complete
### Resumption After Interruption
1. Checks for forward gap: If `newestDateReached < yesterday`, fetches new data first
2. Checks for backward gap: If not finished and `oldestDateReached > targetOldestDate`, continues backward crawl
3. Resumes from `lastProcessedDate` to avoid re-fetching data
### Daily Updates
Once a symbol is fully crawled:
- Only needs to fetch new data (forward crawl)
- Much faster as it's typically just 1-2 days of data
## Usage
### Manual Crawl for Single Symbol
```typescript
await handler.crawlIntradayData({
symbol: 'AAPL',
symbolId: 12345,
qmSearchCode: 'AAPL',
targetOldestDate: '2020-01-01',
batchSize: 7 // Days per batch
});
```
### Schedule Crawls for Multiple Symbols
```typescript
await handler.scheduleIntradayCrawls({
limit: 50,
targetOldestDate: '2020-01-01',
priorityMode: 'incomplete' // or 'never_run', 'stale', 'all'
});
```
### Check Crawl Status
```typescript
const tracker = handler.operationRegistry.getTracker('qm');
const isComplete = await tracker.isIntradayCrawlComplete('AAPL', 'intraday_bars', new Date('2020-01-01'));
```
### Get Symbols Needing Crawl
```typescript
const symbols = await tracker.getSymbolsForIntradayCrawl('intraday_bars', {
limit: 100,
targetOldestDate: new Date('2020-01-01'),
includeNewDataGaps: true // Include symbols needing updates
});
```
## Priority Modes
- **never_run**: Symbols that have never been crawled (highest priority)
- **incomplete**: Symbols with unfinished crawls
- **stale**: Symbols with complete crawls but new data available
- **all**: All symbols needing any processing
## Scheduled Operations
The system includes scheduled operations:
- `schedule-intraday-crawls-batch`: Runs every 4 hours, processes incomplete crawls
## Monitoring
Use the provided scripts to monitor crawl progress:
```bash
# Check overall status
bun run scripts/check-intraday-status.ts
# Test crawl for specific symbol
bun run test/intraday-crawl.test.ts
```
## Performance Considerations
1. **Rate Limiting**: Delays between API calls to avoid rate limits
2. **Weekend Skipping**: Automatically skips weekends to save API calls
3. **Batch Size**: Configurable batch size (default 7 days) balances progress vs memory
4. **Priority Scheduling**: Higher priority for current data updates
## Error Handling
- Failed batches don't stop the entire crawl
- Errors are logged and stored in the operation status
- Partial success is tracked separately from complete failure
- Session failures trigger automatic session rotation

View file

@ -0,0 +1,107 @@
# Operation Tracker Enhancements for Intraday Crawling
## Summary of Changes
This document summarizes the enhancements made to the operation tracker to support sophisticated intraday data crawling with resumption capabilities.
## Changes Made
### 1. Enhanced CrawlState Interface (`types.ts`)
Added new fields to track crawl progress:
- `newestDateReached`: Track the most recent date processed
- `lastProcessedDate`: For resumption after interruption
- `totalDaysProcessed`: Progress tracking
- `targetOldestDate`: The goal date to reach
### 2. Updated OperationTracker (`OperationTracker.ts`)
- Modified `updateSymbolOperation` to handle new crawl state fields
- Updated `bulkUpdateSymbolOperations` for proper Date handling
- Enhanced `markCrawlFinished` to track both oldest and newest dates
- Added `getSymbolsForIntradayCrawl`: Specialized method for intraday crawls with gap detection
- Added `isIntradayCrawlComplete`: Check if a crawl has reached its target
- Added new indexes for efficient querying on crawl state fields
### 3. New Intraday Crawl Action (`intraday-crawl.action.ts`)
Created a sophisticated crawl system with:
- **Bidirectional crawling**: Handles both forward (new data) and backward (historical) gaps
- **Batch processing**: Processes data in weekly batches by default
- **Resumption logic**: Can resume from where it left off if interrupted
- **Gap detection**: Automatically identifies missing date ranges
- **Completion tracking**: Knows when the full history has been fetched
### 4. Integration with QM Handler
- Added new operations: `crawl-intraday-data` and `schedule-intraday-crawls`
- Added scheduled operation to automatically process incomplete crawls every 4 hours
- Integrated with the existing operation registry system
### 5. Testing and Monitoring Tools
- Created test script to verify crawl functionality
- Created status checking script to monitor crawl progress
- Added comprehensive documentation
## How It Works
### Initial Crawl Flow
1. Symbol starts with no crawl state
2. First crawl fetches today's data and sets `newestDateReached`
3. Subsequent batches crawl backward in time
4. Each batch updates `oldestDateReached` and `lastProcessedDate`
5. When `oldestDateReached <= targetOldestDate`, crawl is marked finished
### Resumption Flow
1. Check if `newestDateReached < yesterday` (forward gap)
2. If yes, fetch new data first to stay current
3. Check if `finished = false` (backward gap)
4. If yes, continue backward crawl from `lastProcessedDate`
5. Process in batches until complete
### Daily Update Flow
1. For finished crawls, only check for forward gaps
2. Fetch data from `newestDateReached + 1` to today
3. Update `newestDateReached` to maintain currency
## Benefits
1. **Resilient**: Can handle interruptions gracefully
2. **Efficient**: Avoids re-fetching data
3. **Trackable**: Clear progress visibility
4. **Scalable**: Can handle thousands of symbols
5. **Flexible**: Configurable batch sizes and target dates
## Usage Examples
### Check if symbol X needs crawling:
```typescript
const tracker = operationRegistry.getTracker('qm');
const symbols = await tracker.getSymbolsForIntradayCrawl('intraday_bars', {
limit: 1,
targetOldestDate: new Date('2020-01-01')
});
const symbolX = symbols.find(s => s.symbol === 'X');
```
### Start crawl for symbol X only:
```typescript
await handler.crawlIntradayData({
symbol: 'X',
symbolId: symbolData.symbolId,
qmSearchCode: symbolData.qmSearchCode,
targetOldestDate: '2020-01-01'
});
```
### Schedule crawls for never-run symbols:
```typescript
await handler.scheduleIntradayCrawls({
limit: 50,
priorityMode: 'never_run',
targetOldestDate: '2020-01-01'
});
```
## Next Steps
1. Monitor the crawl progress using the provided scripts
2. Adjust batch sizes based on API rate limits
3. Consider adding more sophisticated retry logic for failed batches
4. Implement data validation to ensure quality