# AlpenQueue: A lightweight, self-hosted task queue in Go A simple, self-hosted web scraping task queue with SQLite persistence, flexible CSS selectors, and webhook callbacks. Perfect for monitoring websites, extracting content, and scheduling recurring scrapes. ## Features - **Flexible CSS Selector Extraction** - Extract any content using CSS selectors (title, .class, img[src], etc.) - **Ethical Web Crawling** - Respects robots.txt and includes proper User-Agent headers - **Raw HTML Archiving** - Stores complete HTML for historical re-parsing when sites change - **Webhook Callbacks** - POST extracted content to your endpoint when jobs complete - **Recurring Job Scheduling** - Run jobs at specified intervals (every N minutes) - **SQLite Persistence** - All jobs stored in SQLite for reliability and historical access - **Simple HTTP API** - Single endpoint to create and manage jobs ## Quick Start ### Prerequisites - Go 1.25 or later - SQLite (automatically included via go-sqlite3) ### Installation ```bash git clone https://github.com/yourusername/alpenqueue.git cd AlpenQueue go build -o alpenqueue ./cmd/alpenqueue ``` ### Running the Server ```bash ./alpenqueue # Server starts on :8080 ``` Or run directly with Go: ```bash cd cmd/alpenqueue go run main.go ``` ### Your First Job ```bash curl -X POST http://localhost:8080/jobs \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com", "selector": "title" }' ``` The worker will scrape the page, extract the title, and store the result in SQLite. ## API Documentation ### POST /jobs Create a new scraping job. **Request Body:** ```json { "url": "https://example.com", "selector": "title", "webhook_url": "https://my-site.com/webhook", "frequency_minutes": 20 } ``` **Fields:** | Field | Type | Required | Description | |-------|------|----------|-------------| | `url` | string | Yes | URL to scrape | | `selector` | string | No | CSS selector (default: "title") | | `webhook_url` | string | No | URL to POST results to | | `frequency_minutes` | int | No | Recurring interval in minutes (0 = one-time) | **Response:** ``` HTTP/1.1 201 Created Job 1 created ``` ### Webhook Payload When a job completes, AlpenQueue POSTs to your webhook_url: ```json { "status": "ok", "took": "1.2s", "url": "https://example.com", "content": "Example Domain" } ``` **Status values:** - `ok` - Successfully scraped - `blocked` - Blocked by robots.txt - `error` - HTTP or parsing error ## Usage Examples ### One-Time Scraping Extract the main heading from a page: ```bash curl -X POST http://localhost:8080/jobs \ -H "Content-Type: application/json" \ -d '{ "url": "https://news.ycombinator.com", "selector": ".titleline > a" }' ``` ### Recurring Jobs Monitor a price every 20 minutes: ```bash curl -X POST http://localhost:8080/jobs \ -H "Content-Type: application/json" \ -d '{ "url": "https://store.com/product", "selector": ".price", "frequency_minutes": 20, "webhook_url": "https://my-api.com/price-alert" }' ``` ### Extract Images Get all image sources: ```bash curl -X POST http://localhost:8080/jobs \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com/gallery", "selector": "img" }' ``` The worker extracts both text content and `src`/`href` attributes automatically. ### Extract Links Get all download links: ```bash curl -X POST http://localhost:8080/jobs \ -H "Content-Type: application/json" \ -d '{ "url": "https://releases.example.com", "selector": "a.download-link" }' ``` ## Architecture ``` ┌─────────────┐ │ HTTP API │ POST /jobs │ :8080 │ └──────┬──────┘ │ ▼ ┌─────────────┐ │ SQLite │ jobs table (pending/active/done) │ Database │ - url, selector, frequency └──────┬──────┘ - raw_html, extracted_content │ - next_run_at ▼ ┌─────────────┐ │ Worker │ Polls every 5s │ Goroutine │ - Fetches pending jobs └─────────────┘ - Respects robots.txt │ - Extracts content with goquery ▼ - POSTs webhook ┌─────────────┐ - Updates next_run_at or marks done │ Webhooks │ └─────────────┘ ``` ### Job Lifecycle 1. **pending** - Newly created, waiting for worker 2. **active** - Recurring job, scheduled for next run 3. **done** - One-time job completed ### Worker Behavior - Polls database every 5 seconds for jobs where `next_run_at <= now` - Fetches URL with 30s timeout - Checks robots.txt before each request - Saves complete HTML to `raw_html` column - Extracts content using goquery CSS selector - POSTs webhook if configured - For recurring jobs: updates `next_run_at` and status to `active` - For one-time jobs: marks status as `done` ## Development ### Project Structure ``` AlpenQueue/ ├── cmd/ │ └── alpenqueue/ │ └── main.go # HTTP server and routes ├── pkg/ │ ├── db/ │ │ └── db.go # SQLite schema and queries │ └── worker/ │ └── worker.go # Job processing and scraping ├── go.mod └── README.md ``` ### Building from Source ```bash go build -o alpenqueue ./cmd/alpenqueue ``` ### Dependencies - `github.com/mattn/go-sqlite3` - SQLite driver - `github.com/PuerkitoBio/goquery` - CSS selector parsing - `github.com/temoto/robotstxt` - robots.txt parsing - `golang.org/x/net/html` - HTML parsing ## Configuration ### Database Location By default, the database is created at `./alpenqueue.db` in the working directory. Modify in `main.go`: ```go database, err := db.Init("./alpenqueue.db") ``` ### Server Port Default port is `:8080`. Change in `main.go`: ```go http.ListenAndServe(":8080", nil) ``` ### Worker Poll Interval Worker polls every 5 seconds. Modify in `worker.go`: ```go time.Sleep(5 * time.Second) ``` ### User-Agent Set in `worker.go`: ```go const userAgent = "AlpenQueue/1.0 (+https://github.com/yourusername/alpenqueue)" ``` ## Ethical Web Scraping AlpenQueue follows ethical scraping practices: - **robots.txt compliance** - Checks and respects disallow rules - **User-Agent identification** - Clearly identifies itself in requests - **Rate limiting** - Configurable poll interval prevents hammering servers - **Timeout enforcement** - 30-second request timeout prevents hanging Always ensure you have permission to scrape websites and comply with their terms of service. ## License MIT License - See LICENSE file for details ## Contributing Contributions welcome! Please open an issue or pull request.