work on qm filings
This commit is contained in:
parent
710577eb3d
commit
960daf4cad
17 changed files with 2319 additions and 32 deletions
122
apps/stock/data-ingestion/docs/INTRADAY_CRAWL.md
Normal file
122
apps/stock/data-ingestion/docs/INTRADAY_CRAWL.md
Normal file
|
|
@ -0,0 +1,122 @@
|
|||
# Intraday Crawl System
|
||||
|
||||
## Overview
|
||||
|
||||
The intraday crawl system is designed to handle large-scale historical data collection with proper resumption support. It tracks the oldest and newest dates reached, allowing it to resume from where it left off if interrupted.
|
||||
|
||||
## Key Features
|
||||
|
||||
1. **Bidirectional Crawling**: Can crawl both forward (for new data) and backward (for historical data)
|
||||
2. **Resumption Support**: Tracks progress and can resume from where it left off
|
||||
3. **Gap Detection**: Automatically detects gaps in data coverage
|
||||
4. **Batch Processing**: Processes data in configurable batches (default: 7 days)
|
||||
5. **Completion Tracking**: Knows when a symbol's full history has been fetched
|
||||
|
||||
## Crawl State Fields
|
||||
|
||||
The system tracks the following state for each symbol:
|
||||
|
||||
```typescript
|
||||
interface CrawlState {
|
||||
finished: boolean; // Whether crawl is complete
|
||||
oldestDateReached?: Date; // Oldest date we've fetched
|
||||
newestDateReached?: Date; // Newest date we've fetched
|
||||
lastProcessedDate?: Date; // Last date processed (for resumption)
|
||||
totalDaysProcessed?: number; // Total days processed so far
|
||||
lastCrawlDirection?: 'forward' | 'backward';
|
||||
targetOldestDate?: Date; // Target date to reach
|
||||
}
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
### Initial Crawl
|
||||
1. Starts from today and fetches current data
|
||||
2. Then begins crawling backward in weekly batches
|
||||
3. Continues until it reaches the target oldest date (default: 2020-01-01)
|
||||
4. Marks as finished when complete
|
||||
|
||||
### Resumption After Interruption
|
||||
1. Checks for forward gap: If `newestDateReached < yesterday`, fetches new data first
|
||||
2. Checks for backward gap: If not finished and `oldestDateReached > targetOldestDate`, continues backward crawl
|
||||
3. Resumes from `lastProcessedDate` to avoid re-fetching data
|
||||
|
||||
### Daily Updates
|
||||
Once a symbol is fully crawled:
|
||||
- Only needs to fetch new data (forward crawl)
|
||||
- Much faster as it's typically just 1-2 days of data
|
||||
|
||||
## Usage
|
||||
|
||||
### Manual Crawl for Single Symbol
|
||||
```typescript
|
||||
await handler.crawlIntradayData({
|
||||
symbol: 'AAPL',
|
||||
symbolId: 12345,
|
||||
qmSearchCode: 'AAPL',
|
||||
targetOldestDate: '2020-01-01',
|
||||
batchSize: 7 // Days per batch
|
||||
});
|
||||
```
|
||||
|
||||
### Schedule Crawls for Multiple Symbols
|
||||
```typescript
|
||||
await handler.scheduleIntradayCrawls({
|
||||
limit: 50,
|
||||
targetOldestDate: '2020-01-01',
|
||||
priorityMode: 'incomplete' // or 'never_run', 'stale', 'all'
|
||||
});
|
||||
```
|
||||
|
||||
### Check Crawl Status
|
||||
```typescript
|
||||
const tracker = handler.operationRegistry.getTracker('qm');
|
||||
const isComplete = await tracker.isIntradayCrawlComplete('AAPL', 'intraday_bars', new Date('2020-01-01'));
|
||||
```
|
||||
|
||||
### Get Symbols Needing Crawl
|
||||
```typescript
|
||||
const symbols = await tracker.getSymbolsForIntradayCrawl('intraday_bars', {
|
||||
limit: 100,
|
||||
targetOldestDate: new Date('2020-01-01'),
|
||||
includeNewDataGaps: true // Include symbols needing updates
|
||||
});
|
||||
```
|
||||
|
||||
## Priority Modes
|
||||
|
||||
- **never_run**: Symbols that have never been crawled (highest priority)
|
||||
- **incomplete**: Symbols with unfinished crawls
|
||||
- **stale**: Symbols with complete crawls but new data available
|
||||
- **all**: All symbols needing any processing
|
||||
|
||||
## Scheduled Operations
|
||||
|
||||
The system includes scheduled operations:
|
||||
- `schedule-intraday-crawls-batch`: Runs every 4 hours, processes incomplete crawls
|
||||
|
||||
## Monitoring
|
||||
|
||||
Use the provided scripts to monitor crawl progress:
|
||||
|
||||
```bash
|
||||
# Check overall status
|
||||
bun run scripts/check-intraday-status.ts
|
||||
|
||||
# Test crawl for specific symbol
|
||||
bun run test/intraday-crawl.test.ts
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
1. **Rate Limiting**: Delays between API calls to avoid rate limits
|
||||
2. **Weekend Skipping**: Automatically skips weekends to save API calls
|
||||
3. **Batch Size**: Configurable batch size (default 7 days) balances progress vs memory
|
||||
4. **Priority Scheduling**: Higher priority for current data updates
|
||||
|
||||
## Error Handling
|
||||
|
||||
- Failed batches don't stop the entire crawl
|
||||
- Errors are logged and stored in the operation status
|
||||
- Partial success is tracked separately from complete failure
|
||||
- Session failures trigger automatic session rotation
|
||||
Loading…
Add table
Add a link
Reference in a new issue