Complete documentation including features, quick start guide, API reference, usage examples, architecture diagrams, and configuration options. Fixed typo in title (lightweigt → lightweight).
6.7 KiB
AlpenQueue: A lightweight, self-hosted task queue in Go
A simple, self-hosted web scraping task queue with SQLite persistence, flexible CSS selectors, and webhook callbacks. Perfect for monitoring websites, extracting content, and scheduling recurring scrapes.
Features
- Flexible CSS Selector Extraction - Extract any content using CSS selectors (title, .class, img[src], etc.)
- Ethical Web Crawling - Respects robots.txt and includes proper User-Agent headers
- Raw HTML Archiving - Stores complete HTML for historical re-parsing when sites change
- Webhook Callbacks - POST extracted content to your endpoint when jobs complete
- Recurring Job Scheduling - Run jobs at specified intervals (every N minutes)
- SQLite Persistence - All jobs stored in SQLite for reliability and historical access
- Simple HTTP API - Single endpoint to create and manage jobs
Quick Start
Prerequisites
- Go 1.25 or later
- SQLite (automatically included via go-sqlite3)
Installation
git clone https://github.com/yourusername/alpenqueue.git
cd AlpenQueue
go build -o alpenqueue ./cmd/alpenqueue
Running the Server
./alpenqueue
# Server starts on :8080
Or run directly with Go:
cd cmd/alpenqueue
go run main.go
Your First Job
curl -X POST http://localhost:8080/jobs \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"selector": "title"
}'
The worker will scrape the page, extract the title, and store the result in SQLite.
API Documentation
POST /jobs
Create a new scraping job.
Request Body:
{
"url": "https://example.com",
"selector": "title",
"webhook_url": "https://my-site.com/webhook",
"frequency_minutes": 20
}
Fields:
| Field | Type | Required | Description |
|---|---|---|---|
url |
string | Yes | URL to scrape |
selector |
string | No | CSS selector (default: "title") |
webhook_url |
string | No | URL to POST results to |
frequency_minutes |
int | No | Recurring interval in minutes (0 = one-time) |
Response:
HTTP/1.1 201 Created
Job 1 created
Webhook Payload
When a job completes, AlpenQueue POSTs to your webhook_url:
{
"status": "ok",
"took": "1.2s",
"url": "https://example.com",
"content": "Example Domain"
}
Status values:
ok- Successfully scrapedblocked- Blocked by robots.txterror- HTTP or parsing error
Usage Examples
One-Time Scraping
Extract the main heading from a page:
curl -X POST http://localhost:8080/jobs \
-H "Content-Type: application/json" \
-d '{
"url": "https://news.ycombinator.com",
"selector": ".titleline > a"
}'
Recurring Jobs
Monitor a price every 20 minutes:
curl -X POST http://localhost:8080/jobs \
-H "Content-Type: application/json" \
-d '{
"url": "https://store.com/product",
"selector": ".price",
"frequency_minutes": 20,
"webhook_url": "https://my-api.com/price-alert"
}'
Extract Images
Get all image sources:
curl -X POST http://localhost:8080/jobs \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/gallery",
"selector": "img"
}'
The worker extracts both text content and src/href attributes automatically.
Extract Links
Get all download links:
curl -X POST http://localhost:8080/jobs \
-H "Content-Type: application/json" \
-d '{
"url": "https://releases.example.com",
"selector": "a.download-link"
}'
Architecture
┌─────────────┐
│ HTTP API │ POST /jobs
│ :8080 │
└──────┬──────┘
│
▼
┌─────────────┐
│ SQLite │ jobs table (pending/active/done)
│ Database │ - url, selector, frequency
└──────┬──────┘ - raw_html, extracted_content
│ - next_run_at
▼
┌─────────────┐
│ Worker │ Polls every 5s
│ Goroutine │ - Fetches pending jobs
└─────────────┘ - Respects robots.txt
│ - Extracts content with goquery
▼ - POSTs webhook
┌─────────────┐ - Updates next_run_at or marks done
│ Webhooks │
└─────────────┘
Job Lifecycle
- pending - Newly created, waiting for worker
- active - Recurring job, scheduled for next run
- done - One-time job completed
Worker Behavior
- Polls database every 5 seconds for jobs where
next_run_at <= now - Fetches URL with 30s timeout
- Checks robots.txt before each request
- Saves complete HTML to
raw_htmlcolumn - Extracts content using goquery CSS selector
- POSTs webhook if configured
- For recurring jobs: updates
next_run_atand status toactive - For one-time jobs: marks status as
done
Development
Project Structure
AlpenQueue/
├── cmd/
│ └── alpenqueue/
│ └── main.go # HTTP server and routes
├── pkg/
│ ├── db/
│ │ └── db.go # SQLite schema and queries
│ └── worker/
│ └── worker.go # Job processing and scraping
├── go.mod
└── README.md
Building from Source
go build -o alpenqueue ./cmd/alpenqueue
Dependencies
github.com/mattn/go-sqlite3- SQLite drivergithub.com/PuerkitoBio/goquery- CSS selector parsinggithub.com/temoto/robotstxt- robots.txt parsinggolang.org/x/net/html- HTML parsing
Configuration
Database Location
By default, the database is created at ./alpenqueue.db in the working directory. Modify in main.go:
database, err := db.Init("./alpenqueue.db")
Server Port
Default port is :8080. Change in main.go:
http.ListenAndServe(":8080", nil)
Worker Poll Interval
Worker polls every 5 seconds. Modify in worker.go:
time.Sleep(5 * time.Second)
User-Agent
Set in worker.go:
const userAgent = "AlpenQueue/1.0 (+https://github.com/yourusername/alpenqueue)"
Ethical Web Scraping
AlpenQueue follows ethical scraping practices:
- robots.txt compliance - Checks and respects disallow rules
- User-Agent identification - Clearly identifies itself in requests
- Rate limiting - Configurable poll interval prevents hammering servers
- Timeout enforcement - 30-second request timeout prevents hanging
Always ensure you have permission to scrape websites and comply with their terms of service.
License
MIT License - See LICENSE file for details
Contributing
Contributions welcome! Please open an issue or pull request.