# OffersExtractor Flask API

Production-ready Flask application for extracting internship offers from PDF files using DeepSeek AI.

## 🚀 Features

- **PDF Processing**: Extract structured internship offers from PDF documents
- **AI-Powered**: Uses DeepSeek API for intelligent text extraction
- **Production Ready**: Configured with Gunicorn, Nginx, and Supervisor
- **Secure**: HTTPS, security headers, file validation
- **Scalable**: Multi-worker setup with proper timeouts
- **RESTful API**: Clean JSON responses
- **Health Checks**: Built-in health monitoring endpoint

## 📋 Prerequisites

- Python 3.9+
- pip
- virtualenv
- Nginx (production)
- Supervisor (production)
- DeepSeek API key

## 🔧 Installation

### Development Setup

1. **Clone and navigate to the project**:
```bash
cd OffersExtractorFlask
```

2. **Create virtual environment**:
```bash
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
```

3. **Install dependencies**:
```bash
pip install -r requirements.txt
```

4. **Configure environment**:
```bash
cp .env.example .env
# Edit .env and add your DEEPSEEK_API_KEY
```

5. **Run development server**:
```bash
python app.py
```

The API will be available at `http://localhost:5100`

### Production Deployment

1. **Deploy to server** (e.g., `/var/www/offers-extractor`):
```bash
# On server
cd /var/www
git clone <repository> offers-extractor
cd offers-extractor
```

2. **Setup virtual environment**:
```bash
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```

3. **Configure environment**:
```bash
cp .env.example .env
nano .env  # Add DEEPSEEK_API_KEY
```

4. **Create log directories**:
```bash
sudo mkdir -p /var/log/offers-extractor
sudo chown www-data:www-data /var/log/offers-extractor
```

5. **Setup Supervisor**:
```bash
sudo cp supervisor.conf /etc/supervisor/conf.d/offers-extractor.conf
sudo supervisorctl reread
sudo supervisorctl update
sudo supervisorctl start offers-extractor
```

6. **Setup Nginx**:
```bash
sudo cp nginx.conf /etc/nginx/sites-available/extractor.stagi-edu.com
sudo ln -s /etc/nginx/sites-available/extractor.stagi-edu.com /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx
```

7. **Setup SSL with Let's Encrypt**:
```bash
sudo certbot --nginx -d extractor.stagi-edu.com
```

8. **Verify deployment**:
```bash
curl https://extractor.stagi-edu.com/health
```

## 📡 API Endpoints

### Health Check
```http
GET /health
```

**Response**:
```json
{
  "status": "healthy",
  "service": "OffersExtractor",
  "deepseek_configured": true
}
```

### Extract Offers
```http
POST /extract_offers
Content-Type: multipart/form-data
```

**Request**:
- Field: `pdf` (file)
- Max size: 15MB
- Type: PDF only

**Response** (Success):
```json
{
  "offers": [
    {
      "title": "Stage Full Stack Developer",
      "description": "Développement d'applications web...",
      "skills": ["Angular", "Laravel", "MySQL"],
      "duration_months": 6,
      "tags": ["Web", "Full Stack", "Angular"],
      "is_paid": true
    }
  ],
  "count": 1
}
```

**Response** (Error):
```json
{
  "error": "Error type",
  "message": "Detailed error message"
}
```

## 🔌 Laravel Integration

The main Laravel application acts as a proxy to avoid CORS issues. Update your Laravel `.env`:

```env
EXTRACTOR_URL=https://extractor.stagi-edu.com
EXTRACTOR_TIMEOUT=60
```

The Laravel `ExtractorController` already proxies requests:
- Laravel endpoint: `POST /api/v1/extractor/extract`
- Proxies to: `https://extractor.stagi-edu.com/extract_offers`

Frontend should call Laravel, NOT the Flask service directly.

## 🛠️ Configuration

### Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `DEEPSEEK_API_KEY` | DeepSeek API authentication key | **Required** |
| `FLASK_ENV` | Environment (development/production) | `production` |
| `FLASK_DEBUG` | Enable debug mode | `False` |
| `HOST` | Server host | `0.0.0.0` |
| `PORT` | Server port | `5100` |

### Gunicorn Configuration

- **Workers**: 4 (adjust based on CPU cores)
- **Timeout**: 300s (for long PDF processing)
- **Bind**: `127.0.0.1:5100` (behind Nginx)

### Nginx Configuration

- **Max upload**: 15MB
- **Timeouts**: 300s (for extraction)
- **SSL**: TLS 1.2/1.3
- **Security headers**: Enabled

## 📊 Monitoring

### Check service status:
```bash
sudo supervisorctl status offers-extractor
```

### View logs:
```bash
# Application logs
tail -f /var/log/offers-extractor/error.log
tail -f /var/log/offers-extractor/access.log

# Nginx logs
tail -f /var/log/nginx/extractor.stagi-edu.com.access.log
tail -f /var/log/nginx/extractor.stagi-edu.com.error.log
```

### Restart service:
```bash
sudo supervisorctl restart offers-extractor
```

## 🧪 Testing

### Test health endpoint:
```bash
curl https://extractor.stagi-edu.com/health
```

### Test extraction (with file):
```bash
curl -X POST https://extractor.stagi-edu.com/extract_offers \
  -F "pdf=@/path/to/test.pdf"
```

### Test via Laravel proxy:
```bash
curl -X POST http://your-laravel-app.com/api/v1/extractor/extract \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -F "pdf=@/path/to/test.pdf"
```

## 🔒 Security

- ✅ HTTPS enforced
- ✅ Security headers (X-Frame-Options, CSP, etc.)
- ✅ File type validation
- ✅ File size limits (15MB)
- ✅ Temporary file cleanup
- ✅ API key protection
- ✅ CORS configured (handled by Nginx in production)

## 📝 Error Handling

The API returns appropriate HTTP status codes:

- `200`: Success
- `400`: Bad request (invalid file, missing fields)
- `413`: File too large
- `500`: Server error (API failure, extraction error)
- `502`: Service unreachable

## 🚦 Performance

- **Rate Limiting**: 1 second delay between API calls (DeepSeek)
- **Deduplication**: Automatic offer deduplication by description
- **Streaming**: Efficient PDF page-by-page processing
- **Cleanup**: Automatic temporary file removal

## 📦 Dependencies

- **Flask**: Web framework
- **flask-cors**: CORS handling
- **PyMuPDF**: PDF processing
- **python-dotenv**: Environment configuration
- **requests**: HTTP client for DeepSeek API
- **gunicorn**: WSGI server
- **Werkzeug**: WSGI utilities

## 🔄 Updates

To update the service:

```bash
cd /var/www/offers-extractor
git pull
source venv/bin/activate
pip install -r requirements.txt
sudo supervisorctl restart offers-extractor
```

## 🐛 Troubleshooting

### Service won't start
```bash
# Check logs
sudo supervisorctl tail offers-extractor stderr

# Verify environment
source venv/bin/activate
python -c "from app import app; print('OK')"
```

### DeepSeek API errors
- Verify `DEEPSEEK_API_KEY` in `.env`
- Check API quota/limits
- Review error logs

### Upload fails
- Check Nginx `client_max_body_size`
- Verify file permissions
- Check disk space

## 📞 Support

For issues related to:
- **API functionality**: Check application logs
- **Nginx/proxy**: Check Nginx logs
- **DeepSeek API**: Check DeepSeek documentation
- **Laravel integration**: Check Laravel logs

## 📄 License

[Your License Here]

## 🤝 Contributing

[Your Contributing Guidelines Here]
