Complete documentation including features, quick start guide, API reference, usage examples, architecture diagrams, and configuration options. Fixed typo in title (lightweigt → lightweight).
292 lines
6.7 KiB
Markdown
292 lines
6.7 KiB
Markdown
# AlpenQueue: A lightweight, self-hosted task queue in Go
|
|
|
|
A simple, self-hosted web scraping task queue with SQLite persistence, flexible CSS selectors, and webhook callbacks. Perfect for monitoring websites, extracting content, and scheduling recurring scrapes.
|
|
|
|
## Features
|
|
|
|
- **Flexible CSS Selector Extraction** - Extract any content using CSS selectors (title, .class, img[src], etc.)
|
|
- **Ethical Web Crawling** - Respects robots.txt and includes proper User-Agent headers
|
|
- **Raw HTML Archiving** - Stores complete HTML for historical re-parsing when sites change
|
|
- **Webhook Callbacks** - POST extracted content to your endpoint when jobs complete
|
|
- **Recurring Job Scheduling** - Run jobs at specified intervals (every N minutes)
|
|
- **SQLite Persistence** - All jobs stored in SQLite for reliability and historical access
|
|
- **Simple HTTP API** - Single endpoint to create and manage jobs
|
|
|
|
## Quick Start
|
|
|
|
### Prerequisites
|
|
|
|
- Go 1.25 or later
|
|
- SQLite (automatically included via go-sqlite3)
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
git clone https://github.com/yourusername/alpenqueue.git
|
|
cd AlpenQueue
|
|
go build -o alpenqueue ./cmd/alpenqueue
|
|
```
|
|
|
|
### Running the Server
|
|
|
|
```bash
|
|
./alpenqueue
|
|
# Server starts on :8080
|
|
```
|
|
|
|
Or run directly with Go:
|
|
|
|
```bash
|
|
cd cmd/alpenqueue
|
|
go run main.go
|
|
```
|
|
|
|
### Your First Job
|
|
|
|
```bash
|
|
curl -X POST http://localhost:8080/jobs \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"url": "https://example.com",
|
|
"selector": "title"
|
|
}'
|
|
```
|
|
|
|
The worker will scrape the page, extract the title, and store the result in SQLite.
|
|
|
|
## API Documentation
|
|
|
|
### POST /jobs
|
|
|
|
Create a new scraping job.
|
|
|
|
**Request Body:**
|
|
|
|
```json
|
|
{
|
|
"url": "https://example.com",
|
|
"selector": "title",
|
|
"webhook_url": "https://my-site.com/webhook",
|
|
"frequency_minutes": 20
|
|
}
|
|
```
|
|
|
|
**Fields:**
|
|
|
|
| Field | Type | Required | Description |
|
|
|-------|------|----------|-------------|
|
|
| `url` | string | Yes | URL to scrape |
|
|
| `selector` | string | No | CSS selector (default: "title") |
|
|
| `webhook_url` | string | No | URL to POST results to |
|
|
| `frequency_minutes` | int | No | Recurring interval in minutes (0 = one-time) |
|
|
|
|
**Response:**
|
|
|
|
```
|
|
HTTP/1.1 201 Created
|
|
Job 1 created
|
|
```
|
|
|
|
### Webhook Payload
|
|
|
|
When a job completes, AlpenQueue POSTs to your webhook_url:
|
|
|
|
```json
|
|
{
|
|
"status": "ok",
|
|
"took": "1.2s",
|
|
"url": "https://example.com",
|
|
"content": "Example Domain"
|
|
}
|
|
```
|
|
|
|
**Status values:**
|
|
- `ok` - Successfully scraped
|
|
- `blocked` - Blocked by robots.txt
|
|
- `error` - HTTP or parsing error
|
|
|
|
## Usage Examples
|
|
|
|
### One-Time Scraping
|
|
|
|
Extract the main heading from a page:
|
|
|
|
```bash
|
|
curl -X POST http://localhost:8080/jobs \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"url": "https://news.ycombinator.com",
|
|
"selector": ".titleline > a"
|
|
}'
|
|
```
|
|
|
|
### Recurring Jobs
|
|
|
|
Monitor a price every 20 minutes:
|
|
|
|
```bash
|
|
curl -X POST http://localhost:8080/jobs \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"url": "https://store.com/product",
|
|
"selector": ".price",
|
|
"frequency_minutes": 20,
|
|
"webhook_url": "https://my-api.com/price-alert"
|
|
}'
|
|
```
|
|
|
|
### Extract Images
|
|
|
|
Get all image sources:
|
|
|
|
```bash
|
|
curl -X POST http://localhost:8080/jobs \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"url": "https://example.com/gallery",
|
|
"selector": "img"
|
|
}'
|
|
```
|
|
|
|
The worker extracts both text content and `src`/`href` attributes automatically.
|
|
|
|
### Extract Links
|
|
|
|
Get all download links:
|
|
|
|
```bash
|
|
curl -X POST http://localhost:8080/jobs \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"url": "https://releases.example.com",
|
|
"selector": "a.download-link"
|
|
}'
|
|
```
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────┐
|
|
│ HTTP API │ POST /jobs
|
|
│ :8080 │
|
|
└──────┬──────┘
|
|
│
|
|
▼
|
|
┌─────────────┐
|
|
│ SQLite │ jobs table (pending/active/done)
|
|
│ Database │ - url, selector, frequency
|
|
└──────┬──────┘ - raw_html, extracted_content
|
|
│ - next_run_at
|
|
▼
|
|
┌─────────────┐
|
|
│ Worker │ Polls every 5s
|
|
│ Goroutine │ - Fetches pending jobs
|
|
└─────────────┘ - Respects robots.txt
|
|
│ - Extracts content with goquery
|
|
▼ - POSTs webhook
|
|
┌─────────────┐ - Updates next_run_at or marks done
|
|
│ Webhooks │
|
|
└─────────────┘
|
|
```
|
|
|
|
### Job Lifecycle
|
|
|
|
1. **pending** - Newly created, waiting for worker
|
|
2. **active** - Recurring job, scheduled for next run
|
|
3. **done** - One-time job completed
|
|
|
|
### Worker Behavior
|
|
|
|
- Polls database every 5 seconds for jobs where `next_run_at <= now`
|
|
- Fetches URL with 30s timeout
|
|
- Checks robots.txt before each request
|
|
- Saves complete HTML to `raw_html` column
|
|
- Extracts content using goquery CSS selector
|
|
- POSTs webhook if configured
|
|
- For recurring jobs: updates `next_run_at` and status to `active`
|
|
- For one-time jobs: marks status as `done`
|
|
|
|
## Development
|
|
|
|
### Project Structure
|
|
|
|
```
|
|
AlpenQueue/
|
|
├── cmd/
|
|
│ └── alpenqueue/
|
|
│ └── main.go # HTTP server and routes
|
|
├── pkg/
|
|
│ ├── db/
|
|
│ │ └── db.go # SQLite schema and queries
|
|
│ └── worker/
|
|
│ └── worker.go # Job processing and scraping
|
|
├── go.mod
|
|
└── README.md
|
|
```
|
|
|
|
### Building from Source
|
|
|
|
```bash
|
|
go build -o alpenqueue ./cmd/alpenqueue
|
|
```
|
|
|
|
### Dependencies
|
|
|
|
- `github.com/mattn/go-sqlite3` - SQLite driver
|
|
- `github.com/PuerkitoBio/goquery` - CSS selector parsing
|
|
- `github.com/temoto/robotstxt` - robots.txt parsing
|
|
- `golang.org/x/net/html` - HTML parsing
|
|
|
|
## Configuration
|
|
|
|
### Database Location
|
|
|
|
By default, the database is created at `./alpenqueue.db` in the working directory. Modify in `main.go`:
|
|
|
|
```go
|
|
database, err := db.Init("./alpenqueue.db")
|
|
```
|
|
|
|
### Server Port
|
|
|
|
Default port is `:8080`. Change in `main.go`:
|
|
|
|
```go
|
|
http.ListenAndServe(":8080", nil)
|
|
```
|
|
|
|
### Worker Poll Interval
|
|
|
|
Worker polls every 5 seconds. Modify in `worker.go`:
|
|
|
|
```go
|
|
time.Sleep(5 * time.Second)
|
|
```
|
|
|
|
### User-Agent
|
|
|
|
Set in `worker.go`:
|
|
|
|
```go
|
|
const userAgent = "AlpenQueue/1.0 (+https://github.com/yourusername/alpenqueue)"
|
|
```
|
|
|
|
## Ethical Web Scraping
|
|
|
|
AlpenQueue follows ethical scraping practices:
|
|
|
|
- **robots.txt compliance** - Checks and respects disallow rules
|
|
- **User-Agent identification** - Clearly identifies itself in requests
|
|
- **Rate limiting** - Configurable poll interval prevents hammering servers
|
|
- **Timeout enforcement** - 30-second request timeout prevents hanging
|
|
|
|
Always ensure you have permission to scrape websites and comply with their terms of service.
|
|
|
|
## License
|
|
|
|
MIT License - See LICENSE file for details
|
|
|
|
## Contributing
|
|
|
|
Contributions welcome! Please open an issue or pull request.
|