Add comprehensive README documentation

Complete documentation including features, quick start guide, API reference, usage examples, architecture diagrams, and configuration options. Fixed typo in title (lightweigt → lightweight).
2025-11-16 09:22:42 +00:00 · 2025-11-16 09:22:42 +00:00 · e4ebe2da6b
commit e4ebe2da6b
parent c395a57b38
1 changed files with 291 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -1 +1,291 @@
-# AlpenQueue: A lightweigt, self-hosted task queue in Go
+# AlpenQueue: A lightweight, self-hosted task queue in Go
 A simple, self-hosted web scraping task queue with SQLite persistence, flexible CSS selectors, and webhook callbacks. Perfect for monitoring websites, extracting content, and scheduling recurring scrapes.
 ## Features
 - **Flexible CSS Selector Extraction** - Extract any content using CSS selectors (title, .class, img[src], etc.)
 - **Ethical Web Crawling** - Respects robots.txt and includes proper User-Agent headers
 - **Raw HTML Archiving** - Stores complete HTML for historical re-parsing when sites change
 - **Webhook Callbacks** - POST extracted content to your endpoint when jobs complete
 - **Recurring Job Scheduling** - Run jobs at specified intervals (every N minutes)
 - **SQLite Persistence** - All jobs stored in SQLite for reliability and historical access
 - **Simple HTTP API** - Single endpoint to create and manage jobs
 ## Quick Start
 ### Prerequisites
 - Go 1.25 or later
 - SQLite (automatically included via go-sqlite3)
 ### Installation
 ```bash
 git clone https://github.com/yourusername/alpenqueue.git
 cd AlpenQueue
 go build -o alpenqueue ./cmd/alpenqueue
 ```
 ### Running the Server
 ```bash
 ./alpenqueue
 # Server starts on :8080
 ```
 Or run directly with Go:
 ```bash
 cd cmd/alpenqueue
 go run main.go
 ```
 ### Your First Job
 ```bash
 curl -X POST http://localhost:8080/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "selector": "title"
  }'
 ```
 The worker will scrape the page, extract the title, and store the result in SQLite.
 ## API Documentation
 ### POST /jobs
 Create a new scraping job.
 **Request Body:**
 ```json
 {
  "url": "https://example.com",
  "selector": "title",
  "webhook_url": "https://my-site.com/webhook",
  "frequency_minutes": 20
 }
 ```
 **Fields:**
 | Field | Type | Required | Description |
 |-------|------|----------|-------------|
 | `url` | string | Yes | URL to scrape |
 | `selector` | string | No | CSS selector (default: "title") |
 | `webhook_url` | string | No | URL to POST results to |
 | `frequency_minutes` | int | No | Recurring interval in minutes (0 = one-time) |
 **Response:**
 ```
 HTTP/1.1 201 Created
 Job 1 created
 ```
 ### Webhook Payload
 When a job completes, AlpenQueue POSTs to your webhook_url:
 ```json
 {
  "status": "ok",
  "took": "1.2s",
  "url": "https://example.com",
  "content": "Example Domain"
 }
 ```
 **Status values:**
 - `ok` - Successfully scraped
 - `blocked` - Blocked by robots.txt
 - `error` - HTTP or parsing error
 ## Usage Examples
 ### One-Time Scraping
 Extract the main heading from a page:
 ```bash
 curl -X POST http://localhost:8080/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com",
    "selector": ".titleline > a"
  }'
 ```
 ### Recurring Jobs
 Monitor a price every 20 minutes:
 ```bash
 curl -X POST http://localhost:8080/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://store.com/product",
    "selector": ".price",
    "frequency_minutes": 20,
    "webhook_url": "https://my-api.com/price-alert"
  }'
 ```
 ### Extract Images
 Get all image sources:
 ```bash
 curl -X POST http://localhost:8080/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/gallery",
    "selector": "img"
  }'
 ```
 The worker extracts both text content and `src`/`href` attributes automatically.
 ### Extract Links
 Get all download links:
 ```bash
 curl -X POST http://localhost:8080/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://releases.example.com",
    "selector": "a.download-link"
  }'
 ```
 ## Architecture
 ```
 ┌─────────────┐
 │  HTTP API   │  POST /jobs
 │   :8080     │
 └──────┬──────┘
       │
       ▼
 ┌─────────────┐
 │   SQLite    │  jobs table (pending/active/done)
 │  Database   │  - url, selector, frequency
 └──────┬──────┘  - raw_html, extracted_content
       │         - next_run_at
       ▼
 ┌─────────────┐
 │   Worker    │  Polls every 5s
 │   Goroutine │  - Fetches pending jobs
 └─────────────┘  - Respects robots.txt
       │         - Extracts content with goquery
       ▼         - POSTs webhook
 ┌─────────────┐  - Updates next_run_at or marks done
 │  Webhooks   │
 └─────────────┘
 ```
 ### Job Lifecycle
 1. **pending** - Newly created, waiting for worker
 2. **active** - Recurring job, scheduled for next run
 3. **done** - One-time job completed
 ### Worker Behavior
 - Polls database every 5 seconds for jobs where `next_run_at <= now`
 - Fetches URL with 30s timeout
 - Checks robots.txt before each request
 - Saves complete HTML to `raw_html` column
 - Extracts content using goquery CSS selector
 - POSTs webhook if configured
 - For recurring jobs: updates `next_run_at` and status to `active`
 - For one-time jobs: marks status as `done`
 ## Development
 ### Project Structure
 ```
 AlpenQueue/
 ├── cmd/
 │   └── alpenqueue/
 │       └── main.go          # HTTP server and routes
 ├── pkg/
 │   ├── db/
 │   │   └── db.go           # SQLite schema and queries
 │   └── worker/
 │       └── worker.go       # Job processing and scraping
 ├── go.mod
 └── README.md
 ```
 ### Building from Source
 ```bash
 go build -o alpenqueue ./cmd/alpenqueue
 ```
 ### Dependencies
 - `github.com/mattn/go-sqlite3` - SQLite driver
 - `github.com/PuerkitoBio/goquery` - CSS selector parsing
 - `github.com/temoto/robotstxt` - robots.txt parsing
 - `golang.org/x/net/html` - HTML parsing
 ## Configuration
 ### Database Location
 By default, the database is created at `./alpenqueue.db` in the working directory. Modify in `main.go`:
 ```go
 database, err := db.Init("./alpenqueue.db")
 ```
 ### Server Port
 Default port is `:8080`. Change in `main.go`:
 ```go
 http.ListenAndServe(":8080", nil)
 ```
 ### Worker Poll Interval
 Worker polls every 5 seconds. Modify in `worker.go`:
 ```go
 time.Sleep(5 * time.Second)
 ```
 ### User-Agent
 Set in `worker.go`:
 ```go
 const userAgent = "AlpenQueue/1.0 (+https://github.com/yourusername/alpenqueue)"
 ```
 ## Ethical Web Scraping
 AlpenQueue follows ethical scraping practices:
 - **robots.txt compliance** - Checks and respects disallow rules
 - **User-Agent identification** - Clearly identifies itself in requests
 - **Rate limiting** - Configurable poll interval prevents hammering servers
 - **Timeout enforcement** - 30-second request timeout prevents hanging
 Always ensure you have permission to scrape websites and comply with their terms of service.
 ## License
 MIT License - See LICENSE file for details
 ## Contributing
 Contributions welcome! Please open an issue or pull request.