Soldier e4ebe2da6b Add comprehensive README documentation

Complete documentation including features, quick start guide, API reference, usage examples, architecture diagrams, and configuration options. Fixed typo in title (lightweigt → lightweight).

2025-11-16 09:22:42 +00:00

6.7 KiB

Raw Blame History

AlpenQueue: A lightweight, self-hosted task queue in Go

A simple, self-hosted web scraping task queue with SQLite persistence, flexible CSS selectors, and webhook callbacks. Perfect for monitoring websites, extracting content, and scheduling recurring scrapes.

Features

Flexible CSS Selector Extraction - Extract any content using CSS selectors (title, .class, img[src], etc.)
Ethical Web Crawling - Respects robots.txt and includes proper User-Agent headers
Raw HTML Archiving - Stores complete HTML for historical re-parsing when sites change
Webhook Callbacks - POST extracted content to your endpoint when jobs complete
Recurring Job Scheduling - Run jobs at specified intervals (every N minutes)
SQLite Persistence - All jobs stored in SQLite for reliability and historical access
Simple HTTP API - Single endpoint to create and manage jobs

Quick Start

Prerequisites

Go 1.25 or later
SQLite (automatically included via go-sqlite3)

Installation

git clone https://github.com/yourusername/alpenqueue.git
cd AlpenQueue
go build -o alpenqueue ./cmd/alpenqueue

Running the Server

./alpenqueue
# Server starts on :8080

Or run directly with Go:

cd cmd/alpenqueue
go run main.go

Your First Job

curl -X POST http://localhost:8080/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "selector": "title"
  }'

The worker will scrape the page, extract the title, and store the result in SQLite.

API Documentation

POST /jobs

Create a new scraping job.

Request Body:

{
  "url": "https://example.com",
  "selector": "title",
  "webhook_url": "https://my-site.com/webhook",
  "frequency_minutes": 20
}

Fields:

Field	Type	Required	Description
`url`	string	Yes	URL to scrape
`selector`	string	No	CSS selector (default: "title")
`webhook_url`	string	No	URL to POST results to
`frequency_minutes`	int	No	Recurring interval in minutes (0 = one-time)

Response:

HTTP/1.1 201 Created
Job 1 created

Webhook Payload

When a job completes, AlpenQueue POSTs to your webhook_url:

{
  "status": "ok",
  "took": "1.2s",
  "url": "https://example.com",
  "content": "Example Domain"
}

Status values:

ok - Successfully scraped
blocked - Blocked by robots.txt
error - HTTP or parsing error

Usage Examples

One-Time Scraping

Extract the main heading from a page:

curl -X POST http://localhost:8080/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com",
    "selector": ".titleline > a"
  }'

Recurring Jobs

Monitor a price every 20 minutes:

curl -X POST http://localhost:8080/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://store.com/product",
    "selector": ".price",
    "frequency_minutes": 20,
    "webhook_url": "https://my-api.com/price-alert"
  }'

Extract Images

Get all image sources:

curl -X POST http://localhost:8080/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/gallery",
    "selector": "img"
  }'

The worker extracts both text content and src/href attributes automatically.

Extract Links

Get all download links:

curl -X POST http://localhost:8080/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://releases.example.com",
    "selector": "a.download-link"
  }'

Architecture

┌─────────────┐
│  HTTP API   │  POST /jobs
│   :8080     │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   SQLite    │  jobs table (pending/active/done)
│  Database   │  - url, selector, frequency
└──────┬──────┘  - raw_html, extracted_content
       │         - next_run_at
       ▼
┌─────────────┐
│   Worker    │  Polls every 5s
│   Goroutine │  - Fetches pending jobs
└─────────────┘  - Respects robots.txt
       │         - Extracts content with goquery
       ▼         - POSTs webhook
┌─────────────┐  - Updates next_run_at or marks done
│  Webhooks   │
└─────────────┘

Job Lifecycle

pending - Newly created, waiting for worker
active - Recurring job, scheduled for next run
done - One-time job completed

Worker Behavior

Polls database every 5 seconds for jobs where next_run_at <= now
Fetches URL with 30s timeout
Checks robots.txt before each request
Saves complete HTML to raw_html column
Extracts content using goquery CSS selector
POSTs webhook if configured
For recurring jobs: updates next_run_at and status to active
For one-time jobs: marks status as done

Development

Project Structure

AlpenQueue/
├── cmd/
│   └── alpenqueue/
│       └── main.go          # HTTP server and routes
├── pkg/
│   ├── db/
│   │   └── db.go           # SQLite schema and queries
│   └── worker/
│       └── worker.go       # Job processing and scraping
├── go.mod
└── README.md

Building from Source

go build -o alpenqueue ./cmd/alpenqueue

Dependencies

github.com/mattn/go-sqlite3 - SQLite driver
github.com/PuerkitoBio/goquery - CSS selector parsing
github.com/temoto/robotstxt - robots.txt parsing
golang.org/x/net/html - HTML parsing

Configuration

Database Location

By default, the database is created at ./alpenqueue.db in the working directory. Modify in main.go:

database, err := db.Init("./alpenqueue.db")

Server Port

Default port is :8080. Change in main.go:

http.ListenAndServe(":8080", nil)

Worker Poll Interval

Worker polls every 5 seconds. Modify in worker.go:

time.Sleep(5 * time.Second)

User-Agent

Set in worker.go:

const userAgent = "AlpenQueue/1.0 (+https://github.com/yourusername/alpenqueue)"

Ethical Web Scraping

AlpenQueue follows ethical scraping practices:

robots.txt compliance - Checks and respects disallow rules
User-Agent identification - Clearly identifies itself in requests
Rate limiting - Configurable poll interval prevents hammering servers
Timeout enforcement - 30-second request timeout prevents hanging

Always ensure you have permission to scrape websites and comply with their terms of service.

License

MIT License - See LICENSE file for details

Contributing

Contributions welcome! Please open an issue or pull request.

6.7 KiB Raw Blame History