Agentic Web Scraper & Crawler Framework

This module builds a full-stack, scalable web crawling and scraping framework powered by LLMs, async pipelines, and containerized actors. It includes dynamic scraping, orchestration, observability, and public actor publishing with marketplace integration.

Day 1-15: Local Scraping Basics

Topics Covered

Python HTTP Clients
- requests, httpx for static HTML retrieval.
HTML Parsing Libraries
- BeautifulSoup, lxml for DOM traversal and tag targeting.
Data Validation
- pydantic for defining schemas and contract validation.

Hands-on Tasks

Build three micro-scrapers targeting static websites.
Create reusable utility functions for DOM parsing.

Deliverables

Poetry-managed repo with published PyPI package.
GitHub repository containing three functional micro-scrapers.

Day 16-30: Browser Automation

Topics Covered

Selenium
- Grid setup, headless mode, handling dynamic pages.
Playwright
- Headless mode, Python and Node.js clients, cross-browser support.

Hands-on Tasks

Implement a scraper for infinite-scroll websites.
Benchmark and analyze Playwright vs Selenium performance.

Deliverables

Blog post titled "Choosing Playwright vs Selenium".
Demo script showcasing scraping of dynamic content.

Day 31-45: JavaScript/TS & Puppeteer

Topics Covered

Puppeteer for SPA scraping
- DOM instrumentation and JavaScript execution.
Anti-Bot Evasion
- Stealth plugins, browser fingerprinting mitigation.

Hands-on Tasks

Create an NPM package encapsulating Puppeteer anti-bot techniques.
Write GitHub Actions workflows for automated testing.

Deliverables

NPM package publication.
CI testing pipeline in GitHub Actions.

Day 46-60: Async & Distributed Crawling

Topics Covered

Async Crawling
- aiohttp, async queues, back-pressure techniques.
Distributed Architecture
- Sitemap parsing, robots.txt handling, depth/width-first strategies.

Hands-on Tasks

Create a scalable crawler with the ability to crawl 10,000 URLs.
Export results as JSON lines to S3.

Deliverables

Draw.io architecture diagram.
"One-command" crawler demo for large-scale extraction.

Day 61-75: Storage & Schema Design

Topics Covered

Data Storage Models
- MongoDB vs Parquet for structured data.
Content Deduplication
- Fingerprinting for delta updates.

Hands-on Tasks

Define a Pydantic-based data schema.
Benchmark and compare Mongo and Parquet for crawl performance.

Deliverables

Schema documentation.
Performance benchmark report.

Day 76-90: Apify-Style Actor Packaging

Topics Covered

Actor Containerization
- Docker multi-arch builds, secrets mounting.
Interface Design
- Input/output handling for actors.

Hands-on Tasks

Build ProdigalActor base class in Python and Node.js.
Implement a Reddit sub-crawler as a sample actor.

Deliverables

Dockerized actor framework.
Reddit sub-crawler published to internal hub.

Day 91-105: Scheduler & Orchestration

Topics Covered

Queue Systems
- Redis Streams, Celery for task queuing and retries.
Job Scheduling
- Prefect 3 for orchestration, CRON, and ad-hoc jobs.

Hands-on Tasks

Build a scheduling service with failure recovery.
Integrate retry and CRON support.

Deliverables

Functional scheduler with auto-resume.
Demo of CRON-triggered jobs.

Day 106-120: LLM-Augmented Scraping

Topics Covered

DOM Understanding with LLMs
- XPath generation via GPT, trafilatura, tavily.
Accuracy Evaluation
- Regex vs LLM-based extraction.

Hands-on Tasks

Conduct zero-shot extraction from 20 news sites.
Compare LLMs against traditional methods.

Deliverables

Comparative research PDF.
Notebook demonstrating DOM extraction.

Day 121-135: Marketplace & Public Hub

Topics Covered

UX for Actors
- Next.js dashboard, Stripe integration for billing.
Discoverability
- Public-facing marketplace.

Hands-on Tasks

Design marketplace prototype in Figma.
Draft API specs for publishing and billing.

Deliverables

Clickable Figma prototype.
Technical spec for marketplace API.

Day 136-150: Observability & Guardrails

Topics Covered

Monitoring
- OpenTelemetry, Grafana, incident response.
Scraper Governance
- Rate limiting, captcha telemetry.

Hands-on Tasks

Configure Grafana dashboards with alerts.
Draft incident management documentation.

Deliverables

Alerting dashboard.
Incident run-book (PDF).

Day 151-165: Advanced Agents

Topics Covered

Planner-Executor Agents
- LangChain + Autogen for autonomous workflows.
Evaluation Metrics
- Latency, cost, coverage tracking.

Hands-on Tasks

Build "Research-a-Topic" demo agent.
Evaluate performance across multiple tasks.

Deliverables

Multi-step autonomous scraping demo.
Evaluation report.

Day 166-180: Public Contribution Pipeline

Topics Covered

OSS Readiness
- Contributor guides, CLA bots, semantic-release.
Quality Gates
- Linting, unit testing, end-to-end CI checks.

Hands-on Tasks

Write and test full contributor documentation.
Implement GitHub workflows for quality assurance.

Deliverables

Markdown Contributor Guide.
Fully integrated CI with tests and release validation.

Tech Stack

Languages:
- Python 3.12, TypeScript 5
Scraping & Crawling:
- requests, httpx, BeautifulSoup, lxml, aiohttp, Playwright, Puppeteer, Selenium
Orchestration:
- Redis Streams, Celery, Prefect 3
Actor System:
- Docker, FastAPI, Node.js
Storage:
- MongoDB, PostgreSQL, Parquet, AWS S3
LLMs & Agents:
- LangChain, Autogen, GPT APIs
Monitoring & OSS:
- OpenTelemetry, Grafana, GitHub Actions

Day 1-15: Local Scraping Basics​

Topics Covered​

Hands-on Tasks​

Deliverables​

Day 16-30: Browser Automation​

Topics Covered​

Hands-on Tasks​

Deliverables​

Day 31-45: JavaScript/TS & Puppeteer​

Topics Covered​

Hands-on Tasks​

Deliverables​

Day 46-60: Async & Distributed Crawling​

Topics Covered​

Hands-on Tasks​

Deliverables​

Day 61-75: Storage & Schema Design​

Topics Covered​

Hands-on Tasks​

Deliverables​

Day 76-90: Apify-Style Actor Packaging​

Topics Covered​

Hands-on Tasks​

Deliverables​

Day 91-105: Scheduler & Orchestration​

Topics Covered​

Hands-on Tasks​

Deliverables​

Day 106-120: LLM-Augmented Scraping​

Topics Covered​

Hands-on Tasks​

Deliverables​

Day 121-135: Marketplace & Public Hub​

Topics Covered​

Hands-on Tasks​

Deliverables​

Day 136-150: Observability & Guardrails​

Topics Covered​

Hands-on Tasks​

Deliverables​

Day 151-165: Advanced Agents​

Topics Covered​

Hands-on Tasks​

Deliverables​

Day 166-180: Public Contribution Pipeline​

Topics Covered​

Hands-on Tasks​

Deliverables​

Tech Stack​

Day 1-15: Local Scraping Basics

Topics Covered

Hands-on Tasks

Deliverables

Day 16-30: Browser Automation

Topics Covered

Hands-on Tasks

Deliverables

Day 31-45: JavaScript/TS & Puppeteer

Topics Covered

Hands-on Tasks

Deliverables

Day 46-60: Async & Distributed Crawling

Topics Covered

Hands-on Tasks

Deliverables

Day 61-75: Storage & Schema Design

Topics Covered

Hands-on Tasks

Deliverables

Day 76-90: Apify-Style Actor Packaging

Topics Covered

Hands-on Tasks

Deliverables

Day 91-105: Scheduler & Orchestration

Topics Covered

Hands-on Tasks

Deliverables

Day 106-120: LLM-Augmented Scraping

Topics Covered

Hands-on Tasks

Deliverables

Day 121-135: Marketplace & Public Hub

Topics Covered

Hands-on Tasks

Deliverables

Day 136-150: Observability & Guardrails

Topics Covered

Hands-on Tasks

Deliverables

Day 151-165: Advanced Agents

Topics Covered

Hands-on Tasks

Deliverables

Day 166-180: Public Contribution Pipeline

Topics Covered

Hands-on Tasks

Deliverables

Tech Stack