Add comprehensive README documentation

Complete documentation including features, quick start guide, API reference, usage examples, architecture diagrams, and configuration options. Fixed typo in title (lightweigt → lightweight).
This commit is contained in:
Soldier 2025-11-16 09:22:42 +00:00
parent c395a57b38
commit e4ebe2da6b

292
README.md
View File

@ -1 +1,291 @@
# AlpenQueue: A lightweigt, self-hosted task queue in Go # AlpenQueue: A lightweight, self-hosted task queue in Go
A simple, self-hosted web scraping task queue with SQLite persistence, flexible CSS selectors, and webhook callbacks. Perfect for monitoring websites, extracting content, and scheduling recurring scrapes.
## Features
- **Flexible CSS Selector Extraction** - Extract any content using CSS selectors (title, .class, img[src], etc.)
- **Ethical Web Crawling** - Respects robots.txt and includes proper User-Agent headers
- **Raw HTML Archiving** - Stores complete HTML for historical re-parsing when sites change
- **Webhook Callbacks** - POST extracted content to your endpoint when jobs complete
- **Recurring Job Scheduling** - Run jobs at specified intervals (every N minutes)
- **SQLite Persistence** - All jobs stored in SQLite for reliability and historical access
- **Simple HTTP API** - Single endpoint to create and manage jobs
## Quick Start
### Prerequisites
- Go 1.25 or later
- SQLite (automatically included via go-sqlite3)
### Installation
```bash
git clone https://github.com/yourusername/alpenqueue.git
cd AlpenQueue
go build -o alpenqueue ./cmd/alpenqueue
```
### Running the Server
```bash
./alpenqueue
# Server starts on :8080
```
Or run directly with Go:
```bash
cd cmd/alpenqueue
go run main.go
```
### Your First Job
```bash
curl -X POST http://localhost:8080/jobs \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"selector": "title"
}'
```
The worker will scrape the page, extract the title, and store the result in SQLite.
## API Documentation
### POST /jobs
Create a new scraping job.
**Request Body:**
```json
{
"url": "https://example.com",
"selector": "title",
"webhook_url": "https://my-site.com/webhook",
"frequency_minutes": 20
}
```
**Fields:**
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `url` | string | Yes | URL to scrape |
| `selector` | string | No | CSS selector (default: "title") |
| `webhook_url` | string | No | URL to POST results to |
| `frequency_minutes` | int | No | Recurring interval in minutes (0 = one-time) |
**Response:**
```
HTTP/1.1 201 Created
Job 1 created
```
### Webhook Payload
When a job completes, AlpenQueue POSTs to your webhook_url:
```json
{
"status": "ok",
"took": "1.2s",
"url": "https://example.com",
"content": "Example Domain"
}
```
**Status values:**
- `ok` - Successfully scraped
- `blocked` - Blocked by robots.txt
- `error` - HTTP or parsing error
## Usage Examples
### One-Time Scraping
Extract the main heading from a page:
```bash
curl -X POST http://localhost:8080/jobs \
-H "Content-Type: application/json" \
-d '{
"url": "https://news.ycombinator.com",
"selector": ".titleline > a"
}'
```
### Recurring Jobs
Monitor a price every 20 minutes:
```bash
curl -X POST http://localhost:8080/jobs \
-H "Content-Type: application/json" \
-d '{
"url": "https://store.com/product",
"selector": ".price",
"frequency_minutes": 20,
"webhook_url": "https://my-api.com/price-alert"
}'
```
### Extract Images
Get all image sources:
```bash
curl -X POST http://localhost:8080/jobs \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/gallery",
"selector": "img"
}'
```
The worker extracts both text content and `src`/`href` attributes automatically.
### Extract Links
Get all download links:
```bash
curl -X POST http://localhost:8080/jobs \
-H "Content-Type: application/json" \
-d '{
"url": "https://releases.example.com",
"selector": "a.download-link"
}'
```
## Architecture
```
┌─────────────┐
│ HTTP API │ POST /jobs
│ :8080 │
└──────┬──────┘
┌─────────────┐
│ SQLite │ jobs table (pending/active/done)
│ Database │ - url, selector, frequency
└──────┬──────┘ - raw_html, extracted_content
│ - next_run_at
┌─────────────┐
│ Worker │ Polls every 5s
│ Goroutine │ - Fetches pending jobs
└─────────────┘ - Respects robots.txt
│ - Extracts content with goquery
▼ - POSTs webhook
┌─────────────┐ - Updates next_run_at or marks done
│ Webhooks │
└─────────────┘
```
### Job Lifecycle
1. **pending** - Newly created, waiting for worker
2. **active** - Recurring job, scheduled for next run
3. **done** - One-time job completed
### Worker Behavior
- Polls database every 5 seconds for jobs where `next_run_at <= now`
- Fetches URL with 30s timeout
- Checks robots.txt before each request
- Saves complete HTML to `raw_html` column
- Extracts content using goquery CSS selector
- POSTs webhook if configured
- For recurring jobs: updates `next_run_at` and status to `active`
- For one-time jobs: marks status as `done`
## Development
### Project Structure
```
AlpenQueue/
├── cmd/
│ └── alpenqueue/
│ └── main.go # HTTP server and routes
├── pkg/
│ ├── db/
│ │ └── db.go # SQLite schema and queries
│ └── worker/
│ └── worker.go # Job processing and scraping
├── go.mod
└── README.md
```
### Building from Source
```bash
go build -o alpenqueue ./cmd/alpenqueue
```
### Dependencies
- `github.com/mattn/go-sqlite3` - SQLite driver
- `github.com/PuerkitoBio/goquery` - CSS selector parsing
- `github.com/temoto/robotstxt` - robots.txt parsing
- `golang.org/x/net/html` - HTML parsing
## Configuration
### Database Location
By default, the database is created at `./alpenqueue.db` in the working directory. Modify in `main.go`:
```go
database, err := db.Init("./alpenqueue.db")
```
### Server Port
Default port is `:8080`. Change in `main.go`:
```go
http.ListenAndServe(":8080", nil)
```
### Worker Poll Interval
Worker polls every 5 seconds. Modify in `worker.go`:
```go
time.Sleep(5 * time.Second)
```
### User-Agent
Set in `worker.go`:
```go
const userAgent = "AlpenQueue/1.0 (+https://github.com/yourusername/alpenqueue)"
```
## Ethical Web Scraping
AlpenQueue follows ethical scraping practices:
- **robots.txt compliance** - Checks and respects disallow rules
- **User-Agent identification** - Clearly identifies itself in requests
- **Rate limiting** - Configurable poll interval prevents hammering servers
- **Timeout enforcement** - 30-second request timeout prevents hanging
Always ensure you have permission to scrape websites and comply with their terms of service.
## License
MIT License - See LICENSE file for details
## Contributing
Contributions welcome! Please open an issue or pull request.