From e4ebe2da6b83c3bdec74be1e6e4395c13e60843b Mon Sep 17 00:00:00 2001 From: Soldier Date: Sun, 16 Nov 2025 09:22:42 +0000 Subject: [PATCH] Add comprehensive README documentation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Complete documentation including features, quick start guide, API reference, usage examples, architecture diagrams, and configuration options. Fixed typo in title (lightweigt → lightweight). --- README.md | 292 +++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 291 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 9aea7f5..14dbdef 100644 --- a/README.md +++ b/README.md @@ -1 +1,291 @@ -# AlpenQueue: A lightweigt, self-hosted task queue in Go +# AlpenQueue: A lightweight, self-hosted task queue in Go + +A simple, self-hosted web scraping task queue with SQLite persistence, flexible CSS selectors, and webhook callbacks. Perfect for monitoring websites, extracting content, and scheduling recurring scrapes. + +## Features + +- **Flexible CSS Selector Extraction** - Extract any content using CSS selectors (title, .class, img[src], etc.) +- **Ethical Web Crawling** - Respects robots.txt and includes proper User-Agent headers +- **Raw HTML Archiving** - Stores complete HTML for historical re-parsing when sites change +- **Webhook Callbacks** - POST extracted content to your endpoint when jobs complete +- **Recurring Job Scheduling** - Run jobs at specified intervals (every N minutes) +- **SQLite Persistence** - All jobs stored in SQLite for reliability and historical access +- **Simple HTTP API** - Single endpoint to create and manage jobs + +## Quick Start + +### Prerequisites + +- Go 1.25 or later +- SQLite (automatically included via go-sqlite3) + +### Installation + +```bash +git clone https://github.com/yourusername/alpenqueue.git +cd AlpenQueue +go build -o alpenqueue ./cmd/alpenqueue +``` + +### Running the Server + +```bash +./alpenqueue +# Server starts on :8080 +``` + +Or run directly with Go: + +```bash +cd cmd/alpenqueue +go run main.go +``` + +### Your First Job + +```bash +curl -X POST http://localhost:8080/jobs \ + -H "Content-Type: application/json" \ + -d '{ + "url": "https://example.com", + "selector": "title" + }' +``` + +The worker will scrape the page, extract the title, and store the result in SQLite. + +## API Documentation + +### POST /jobs + +Create a new scraping job. + +**Request Body:** + +```json +{ + "url": "https://example.com", + "selector": "title", + "webhook_url": "https://my-site.com/webhook", + "frequency_minutes": 20 +} +``` + +**Fields:** + +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `url` | string | Yes | URL to scrape | +| `selector` | string | No | CSS selector (default: "title") | +| `webhook_url` | string | No | URL to POST results to | +| `frequency_minutes` | int | No | Recurring interval in minutes (0 = one-time) | + +**Response:** + +``` +HTTP/1.1 201 Created +Job 1 created +``` + +### Webhook Payload + +When a job completes, AlpenQueue POSTs to your webhook_url: + +```json +{ + "status": "ok", + "took": "1.2s", + "url": "https://example.com", + "content": "Example Domain" +} +``` + +**Status values:** +- `ok` - Successfully scraped +- `blocked` - Blocked by robots.txt +- `error` - HTTP or parsing error + +## Usage Examples + +### One-Time Scraping + +Extract the main heading from a page: + +```bash +curl -X POST http://localhost:8080/jobs \ + -H "Content-Type: application/json" \ + -d '{ + "url": "https://news.ycombinator.com", + "selector": ".titleline > a" + }' +``` + +### Recurring Jobs + +Monitor a price every 20 minutes: + +```bash +curl -X POST http://localhost:8080/jobs \ + -H "Content-Type: application/json" \ + -d '{ + "url": "https://store.com/product", + "selector": ".price", + "frequency_minutes": 20, + "webhook_url": "https://my-api.com/price-alert" + }' +``` + +### Extract Images + +Get all image sources: + +```bash +curl -X POST http://localhost:8080/jobs \ + -H "Content-Type: application/json" \ + -d '{ + "url": "https://example.com/gallery", + "selector": "img" + }' +``` + +The worker extracts both text content and `src`/`href` attributes automatically. + +### Extract Links + +Get all download links: + +```bash +curl -X POST http://localhost:8080/jobs \ + -H "Content-Type: application/json" \ + -d '{ + "url": "https://releases.example.com", + "selector": "a.download-link" + }' +``` + +## Architecture + +``` +┌─────────────┐ +│ HTTP API │ POST /jobs +│ :8080 │ +└──────┬──────┘ + │ + ▼ +┌─────────────┐ +│ SQLite │ jobs table (pending/active/done) +│ Database │ - url, selector, frequency +└──────┬──────┘ - raw_html, extracted_content + │ - next_run_at + ▼ +┌─────────────┐ +│ Worker │ Polls every 5s +│ Goroutine │ - Fetches pending jobs +└─────────────┘ - Respects robots.txt + │ - Extracts content with goquery + ▼ - POSTs webhook +┌─────────────┐ - Updates next_run_at or marks done +│ Webhooks │ +└─────────────┘ +``` + +### Job Lifecycle + +1. **pending** - Newly created, waiting for worker +2. **active** - Recurring job, scheduled for next run +3. **done** - One-time job completed + +### Worker Behavior + +- Polls database every 5 seconds for jobs where `next_run_at <= now` +- Fetches URL with 30s timeout +- Checks robots.txt before each request +- Saves complete HTML to `raw_html` column +- Extracts content using goquery CSS selector +- POSTs webhook if configured +- For recurring jobs: updates `next_run_at` and status to `active` +- For one-time jobs: marks status as `done` + +## Development + +### Project Structure + +``` +AlpenQueue/ +├── cmd/ +│ └── alpenqueue/ +│ └── main.go # HTTP server and routes +├── pkg/ +│ ├── db/ +│ │ └── db.go # SQLite schema and queries +│ └── worker/ +│ └── worker.go # Job processing and scraping +├── go.mod +└── README.md +``` + +### Building from Source + +```bash +go build -o alpenqueue ./cmd/alpenqueue +``` + +### Dependencies + +- `github.com/mattn/go-sqlite3` - SQLite driver +- `github.com/PuerkitoBio/goquery` - CSS selector parsing +- `github.com/temoto/robotstxt` - robots.txt parsing +- `golang.org/x/net/html` - HTML parsing + +## Configuration + +### Database Location + +By default, the database is created at `./alpenqueue.db` in the working directory. Modify in `main.go`: + +```go +database, err := db.Init("./alpenqueue.db") +``` + +### Server Port + +Default port is `:8080`. Change in `main.go`: + +```go +http.ListenAndServe(":8080", nil) +``` + +### Worker Poll Interval + +Worker polls every 5 seconds. Modify in `worker.go`: + +```go +time.Sleep(5 * time.Second) +``` + +### User-Agent + +Set in `worker.go`: + +```go +const userAgent = "AlpenQueue/1.0 (+https://github.com/yourusername/alpenqueue)" +``` + +## Ethical Web Scraping + +AlpenQueue follows ethical scraping practices: + +- **robots.txt compliance** - Checks and respects disallow rules +- **User-Agent identification** - Clearly identifies itself in requests +- **Rate limiting** - Configurable poll interval prevents hammering servers +- **Timeout enforcement** - 30-second request timeout prevents hanging + +Always ensure you have permission to scrape websites and comply with their terms of service. + +## License + +MIT License - See LICENSE file for details + +## Contributing + +Contributions welcome! Please open an issue or pull request.