AlpenQueue/README.md

# AlpenQueue: A lightweight, self-hosted task queue in Go

A simple, self-hosted web scraping task queue with SQLite persistence, flexible CSS selectors, and webhook callbacks. Perfect for monitoring websites, extracting content, and scheduling recurring scrapes.

## Features

- **Flexible CSS Selector Extraction** - Extract any content using CSS selectors (title, .class, img[src], etc.)
- **Ethical Web Crawling** - Respects robots.txt and includes proper User-Agent headers
- **Raw HTML Archiving** - Stores complete HTML for historical re-parsing when sites change
- **Webhook Callbacks** - POST extracted content to your endpoint when jobs complete
- **Recurring Job Scheduling** - Run jobs at specified intervals (every N minutes)
- **SQLite Persistence** - All jobs stored in SQLite for reliability and historical access
- **Simple HTTP API** - Single endpoint to create and manage jobs

## Quick Start

### Prerequisites

- Go 1.25 or later
- SQLite (automatically included via go-sqlite3)

### Installation

```bash
git clone https://github.com/yourusername/alpenqueue.git
cd AlpenQueue
go build -o alpenqueue ./cmd/alpenqueue
```

### Running the Server

```bash
./alpenqueue
# Server starts on :8080
```

Or run directly with Go:

```bash
cd cmd/alpenqueue
go run main.go
```

### Your First Job

```bash
curl -X POST http://localhost:8080/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "selector": "title"
  }'
```

The worker will scrape the page, extract the title, and store the result in SQLite.

## API Documentation

### POST /jobs

Create a new scraping job.

**Request Body:**

```json
{
  "url": "https://example.com",
  "selector": "title",
  "webhook_url": "https://my-site.com/webhook",
  "frequency_minutes": 20
}
```

**Fields:**

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `url` | string | Yes | URL to scrape |
| `selector` | string | No | CSS selector (default: "title") |
| `webhook_url` | string | No | URL to POST results to |
| `frequency_minutes` | int | No | Recurring interval in minutes (0 = one-time) |

**Response:**

```
HTTP/1.1 201 Created
Job 1 created
```

### Webhook Payload

When a job completes, AlpenQueue POSTs to your webhook_url:

```json
{
  "status": "ok",
  "took": "1.2s",
  "url": "https://example.com",
  "content": "Example Domain"
}
```

**Status values:**
- `ok` - Successfully scraped
- `blocked` - Blocked by robots.txt
- `error` - HTTP or parsing error

## Usage Examples

### One-Time Scraping

Extract the main heading from a page:

```bash
curl -X POST http://localhost:8080/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com",
    "selector": ".titleline > a"
  }'
```

### Recurring Jobs

Monitor a price every 20 minutes:

```bash
curl -X POST http://localhost:8080/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://store.com/product",
    "selector": ".price",
    "frequency_minutes": 20,
    "webhook_url": "https://my-api.com/price-alert"
  }'
```

### Extract Images

Get all image sources:

```bash
curl -X POST http://localhost:8080/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/gallery",
    "selector": "img"
  }'
```

The worker extracts both text content and `src`/`href` attributes automatically.

### Extract Links

Get all download links:

```bash
curl -X POST http://localhost:8080/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://releases.example.com",
    "selector": "a.download-link"
  }'
```

## Architecture

```
┌─────────────┐
│  HTTP API   │  POST /jobs
│   :8080     │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   SQLite    │  jobs table (pending/active/done)
│  Database   │  - url, selector, frequency
└──────┬──────┘  - raw_html, extracted_content
       │         - next_run_at
       ▼
┌─────────────┐
│   Worker    │  Polls every 5s
│   Goroutine │  - Fetches pending jobs
└─────────────┘  - Respects robots.txt
       │         - Extracts content with goquery
       ▼         - POSTs webhook
┌─────────────┐  - Updates next_run_at or marks done
│  Webhooks   │
└─────────────┘
```

### Job Lifecycle

1. **pending** - Newly created, waiting for worker
2. **active** - Recurring job, scheduled for next run
3. **done** - One-time job completed

### Worker Behavior

- Polls database every 5 seconds for jobs where `next_run_at <= now`
- Fetches URL with 30s timeout
- Checks robots.txt before each request
- Saves complete HTML to `raw_html` column
- Extracts content using goquery CSS selector
- POSTs webhook if configured
- For recurring jobs: updates `next_run_at` and status to `active`
- For one-time jobs: marks status as `done`

## Development

### Project Structure

```
AlpenQueue/
├── cmd/
│   └── alpenqueue/
│       └── main.go          # HTTP server and routes
├── pkg/
│   ├── db/
│   │   └── db.go           # SQLite schema and queries
│   └── worker/
│       └── worker.go       # Job processing and scraping
├── go.mod
└── README.md
```

### Building from Source

```bash
go build -o alpenqueue ./cmd/alpenqueue
```

### Dependencies

- `github.com/mattn/go-sqlite3` - SQLite driver
- `github.com/PuerkitoBio/goquery` - CSS selector parsing
- `github.com/temoto/robotstxt` - robots.txt parsing
- `golang.org/x/net/html` - HTML parsing

## Configuration

### Database Location

By default, the database is created at `./alpenqueue.db` in the working directory. Modify in `main.go`:

```go
database, err := db.Init("./alpenqueue.db")
```

### Server Port

Default port is `:8080`. Change in `main.go`:

```go
http.ListenAndServe(":8080", nil)
```

### Worker Poll Interval

Worker polls every 5 seconds. Modify in `worker.go`:

```go
time.Sleep(5 * time.Second)
```

### User-Agent

Set in `worker.go`:

```go
const userAgent = "AlpenQueue/1.0 (+https://github.com/yourusername/alpenqueue)"
```

## Ethical Web Scraping

AlpenQueue follows ethical scraping practices:

- **robots.txt compliance** - Checks and respects disallow rules
- **User-Agent identification** - Clearly identifies itself in requests
- **Rate limiting** - Configurable poll interval prevents hammering servers
- **Timeout enforcement** - 30-second request timeout prevents hanging

Always ensure you have permission to scrape websites and comply with their terms of service.

## License

MIT License - See LICENSE file for details

## Contributing

Contributions welcome! Please open an issue or pull request.