Commit Graph

6 Commits

Author SHA1 Message Date
Soldier
c395a57b38 Add recurring job scheduling with frequency
Add frequency_minutes field to schedule recurring jobs. Jobs with frequency > 0 run repeatedly at specified intervals, automatically rescheduling after each execution. One-time jobs (frequency = 0) remain unchanged. Status transitions from pending to active for recurring jobs.
2025-11-16 09:17:30 +00:00
Soldier
985d340855 Add raw HTML archiving for historical re-parsing
Store complete HTML response in raw_html column before extraction. Enables re-running selectors on historical scrapes when sites change their DOM structure or CSS classes.
2025-11-16 08:43:46 +00:00
Soldier
405f9ca173 Add flexible CSS selector extraction
Replace hardcoded title extraction with user-defined CSS selectors using goquery. Users specify selector in job JSON to extract any HTML elements. Worker extracts text content plus src/href attributes. Webhook payload includes extracted content and URL.
2025-11-16 08:33:19 +00:00
Soldier
1ce45cfe97 Add URL scraping with ethical web crawling
Replace sleep with actual URL fetching. Worker scrapes HTML title from URLs, respects robots.txt, and includes proper User-Agent headers. Scraped titles stored in SQLite and sent via webhook callback.
2025-11-16 08:18:31 +00:00
Soldier
018d699e31 Add webhook callback support
Add webhook_url column to jobs table. POST /jobs endpoint accepts JSON payload with optional webhook_url. After job completion, worker POSTs to webhook with status and duration.
2025-11-16 08:01:53 +00:00
Soldier
40d194beb1 Add SQLite persistence and worker
Add jobs table with ID, status, and created_at fields. POST /jobs endpoint creates pending jobs in SQLite. Worker polls every 5s for pending jobs, processes them with 2s delay, and marks as done.
2025-11-16 07:50:59 +00:00