From e4ebe2da6b83c3bdec74be1e6e4395c13e60843b Mon Sep 17 00:00:00 2001
From: Soldier <soldier@maxtheweb.com>
Date: Sun, 16 Nov 2025 09:22:42 +0000
Subject: [PATCH] Add comprehensive README documentation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Complete documentation including features, quick start guide, API reference, usage examples, architecture diagrams, and configuration options. Fixed typo in title (lightweigt → lightweight).
---
 README.md | 292 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 291 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 9aea7f5..14dbdef 100644
--- a/README.md
+++ b/README.md
@@ -1 +1,291 @@
-# AlpenQueue: A lightweigt, self-hosted task queue in Go
+# AlpenQueue: A lightweight, self-hosted task queue in Go
+
+A simple, self-hosted web scraping task queue with SQLite persistence, flexible CSS selectors, and webhook callbacks. Perfect for monitoring websites, extracting content, and scheduling recurring scrapes.
+
+## Features
+
+- **Flexible CSS Selector Extraction** - Extract any content using CSS selectors (title, .class, img[src], etc.)
+- **Ethical Web Crawling** - Respects robots.txt and includes proper User-Agent headers
+- **Raw HTML Archiving** - Stores complete HTML for historical re-parsing when sites change
+- **Webhook Callbacks** - POST extracted content to your endpoint when jobs complete
+- **Recurring Job Scheduling** - Run jobs at specified intervals (every N minutes)
+- **SQLite Persistence** - All jobs stored in SQLite for reliability and historical access
+- **Simple HTTP API** - Single endpoint to create and manage jobs
+
+## Quick Start
+
+### Prerequisites
+
+- Go 1.25 or later
+- SQLite (automatically included via go-sqlite3)
+
+### Installation
+
+```bash
+git clone https://github.com/yourusername/alpenqueue.git
+cd AlpenQueue
+go build -o alpenqueue ./cmd/alpenqueue
+```
+
+### Running the Server
+
+```bash
+./alpenqueue
+# Server starts on :8080
+```
+
+Or run directly with Go:
+
+```bash
+cd cmd/alpenqueue
+go run main.go
+```
+
+### Your First Job
+
+```bash
+curl -X POST http://localhost:8080/jobs \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://example.com",
+    "selector": "title"
+  }'
+```
+
+The worker will scrape the page, extract the title, and store the result in SQLite.
+
+## API Documentation
+
+### POST /jobs
+
+Create a new scraping job.
+
+**Request Body:**
+
+```json
+{
+  "url": "https://example.com",
+  "selector": "title",
+  "webhook_url": "https://my-site.com/webhook",
+  "frequency_minutes": 20
+}
+```
+
+**Fields:**
+
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `url` | string | Yes | URL to scrape |
+| `selector` | string | No | CSS selector (default: "title") |
+| `webhook_url` | string | No | URL to POST results to |
+| `frequency_minutes` | int | No | Recurring interval in minutes (0 = one-time) |
+
+**Response:**
+
+```
+HTTP/1.1 201 Created
+Job 1 created
+```
+
+### Webhook Payload
+
+When a job completes, AlpenQueue POSTs to your webhook_url:
+
+```json
+{
+  "status": "ok",
+  "took": "1.2s",
+  "url": "https://example.com",
+  "content": "Example Domain"
+}
+```
+
+**Status values:**
+- `ok` - Successfully scraped
+- `blocked` - Blocked by robots.txt
+- `error` - HTTP or parsing error
+
+## Usage Examples
+
+### One-Time Scraping
+
+Extract the main heading from a page:
+
+```bash
+curl -X POST http://localhost:8080/jobs \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://news.ycombinator.com",
+    "selector": ".titleline > a"
+  }'
+```
+
+### Recurring Jobs
+
+Monitor a price every 20 minutes:
+
+```bash
+curl -X POST http://localhost:8080/jobs \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://store.com/product",
+    "selector": ".price",
+    "frequency_minutes": 20,
+    "webhook_url": "https://my-api.com/price-alert"
+  }'
+```
+
+### Extract Images
+
+Get all image sources:
+
+```bash
+curl -X POST http://localhost:8080/jobs \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://example.com/gallery",
+    "selector": "img"
+  }'
+```
+
+The worker extracts both text content and `src`/`href` attributes automatically.
+
+### Extract Links
+
+Get all download links:
+
+```bash
+curl -X POST http://localhost:8080/jobs \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://releases.example.com",
+    "selector": "a.download-link"
+  }'
+```
+
+## Architecture
+
+```
+┌─────────────┐
+│  HTTP API   │  POST /jobs
+│   :8080     │
+└──────┬──────┘
+       │
+       ▼
+┌─────────────┐
+│   SQLite    │  jobs table (pending/active/done)
+│  Database   │  - url, selector, frequency
+└──────┬──────┘  - raw_html, extracted_content
+       │         - next_run_at
+       ▼
+┌─────────────┐
+│   Worker    │  Polls every 5s
+│   Goroutine │  - Fetches pending jobs
+└─────────────┘  - Respects robots.txt
+       │         - Extracts content with goquery
+       ▼         - POSTs webhook
+┌─────────────┐  - Updates next_run_at or marks done
+│  Webhooks   │
+└─────────────┘
+```
+
+### Job Lifecycle
+
+1. **pending** - Newly created, waiting for worker
+2. **active** - Recurring job, scheduled for next run
+3. **done** - One-time job completed
+
+### Worker Behavior
+
+- Polls database every 5 seconds for jobs where `next_run_at <= now`
+- Fetches URL with 30s timeout
+- Checks robots.txt before each request
+- Saves complete HTML to `raw_html` column
+- Extracts content using goquery CSS selector
+- POSTs webhook if configured
+- For recurring jobs: updates `next_run_at` and status to `active`
+- For one-time jobs: marks status as `done`
+
+## Development
+
+### Project Structure
+
+```
+AlpenQueue/
+├── cmd/
+│   └── alpenqueue/
+│       └── main.go          # HTTP server and routes
+├── pkg/
+│   ├── db/
+│   │   └── db.go           # SQLite schema and queries
+│   └── worker/
+│       └── worker.go       # Job processing and scraping
+├── go.mod
+└── README.md
+```
+
+### Building from Source
+
+```bash
+go build -o alpenqueue ./cmd/alpenqueue
+```
+
+### Dependencies
+
+- `github.com/mattn/go-sqlite3` - SQLite driver
+- `github.com/PuerkitoBio/goquery` - CSS selector parsing
+- `github.com/temoto/robotstxt` - robots.txt parsing
+- `golang.org/x/net/html` - HTML parsing
+
+## Configuration
+
+### Database Location
+
+By default, the database is created at `./alpenqueue.db` in the working directory. Modify in `main.go`:
+
+```go
+database, err := db.Init("./alpenqueue.db")
+```
+
+### Server Port
+
+Default port is `:8080`. Change in `main.go`:
+
+```go
+http.ListenAndServe(":8080", nil)
+```
+
+### Worker Poll Interval
+
+Worker polls every 5 seconds. Modify in `worker.go`:
+
+```go
+time.Sleep(5 * time.Second)
+```
+
+### User-Agent
+
+Set in `worker.go`:
+
+```go
+const userAgent = "AlpenQueue/1.0 (+https://github.com/yourusername/alpenqueue)"
+```
+
+## Ethical Web Scraping
+
+AlpenQueue follows ethical scraping practices:
+
+- **robots.txt compliance** - Checks and respects disallow rules
+- **User-Agent identification** - Clearly identifies itself in requests
+- **Rate limiting** - Configurable poll interval prevents hammering servers
+- **Timeout enforcement** - 30-second request timeout prevents hanging
+
+Always ensure you have permission to scrape websites and comply with their terms of service.
+
+## License
+
+MIT License - See LICENSE file for details
+
+## Contributing
+
+Contributions welcome! Please open an issue or pull request.