Skip to main content

Agentic Web Scraper & Crawler Framework

This module builds a full-stack, scalable web crawling and scraping framework powered by LLMs, async pipelines, and containerized actors. It includes dynamic scraping, orchestration, observability, and public actor publishing with marketplace integration.

Day 1-15: Local Scraping Basics

Topics Covered

  • Python HTTP Clients
    • requests, httpx for static HTML retrieval.
  • HTML Parsing Libraries
    • BeautifulSoup, lxml for DOM traversal and tag targeting.
  • Data Validation
    • pydantic for defining schemas and contract validation.

Hands-on Tasks

  • Build three micro-scrapers targeting static websites.
  • Create reusable utility functions for DOM parsing.

Deliverables

  • Poetry-managed repo with published PyPI package.
  • GitHub repository containing three functional micro-scrapers.

Day 16-30: Browser Automation

Topics Covered

  • Selenium
    • Grid setup, headless mode, handling dynamic pages.
  • Playwright
    • Headless mode, Python and Node.js clients, cross-browser support.

Hands-on Tasks

  • Implement a scraper for infinite-scroll websites.
  • Benchmark and analyze Playwright vs Selenium performance.

Deliverables

  • Blog post titled "Choosing Playwright vs Selenium".
  • Demo script showcasing scraping of dynamic content.

Day 31-45: JavaScript/TS & Puppeteer

Topics Covered

  • Puppeteer for SPA scraping
    • DOM instrumentation and JavaScript execution.
  • Anti-Bot Evasion
    • Stealth plugins, browser fingerprinting mitigation.

Hands-on Tasks

  • Create an NPM package encapsulating Puppeteer anti-bot techniques.
  • Write GitHub Actions workflows for automated testing.

Deliverables

  • NPM package publication.
  • CI testing pipeline in GitHub Actions.

Day 46-60: Async & Distributed Crawling

Topics Covered

  • Async Crawling
    • aiohttp, async queues, back-pressure techniques.
  • Distributed Architecture
    • Sitemap parsing, robots.txt handling, depth/width-first strategies.

Hands-on Tasks

  • Create a scalable crawler with the ability to crawl 10,000 URLs.
  • Export results as JSON lines to S3.

Deliverables

  • Draw.io architecture diagram.
  • "One-command" crawler demo for large-scale extraction.

Day 61-75: Storage & Schema Design

Topics Covered

  • Data Storage Models
    • MongoDB vs Parquet for structured data.
  • Content Deduplication
    • Fingerprinting for delta updates.

Hands-on Tasks

  • Define a Pydantic-based data schema.
  • Benchmark and compare Mongo and Parquet for crawl performance.

Deliverables

  • Schema documentation.
  • Performance benchmark report.

Day 76-90: Apify-Style Actor Packaging

Topics Covered

  • Actor Containerization
    • Docker multi-arch builds, secrets mounting.
  • Interface Design
    • Input/output handling for actors.

Hands-on Tasks

  • Build ProdigalActor base class in Python and Node.js.
  • Implement a Reddit sub-crawler as a sample actor.

Deliverables

  • Dockerized actor framework.
  • Reddit sub-crawler published to internal hub.

Day 91-105: Scheduler & Orchestration

Topics Covered

  • Queue Systems
    • Redis Streams, Celery for task queuing and retries.
  • Job Scheduling
    • Prefect 3 for orchestration, CRON, and ad-hoc jobs.

Hands-on Tasks

  • Build a scheduling service with failure recovery.
  • Integrate retry and CRON support.

Deliverables

  • Functional scheduler with auto-resume.
  • Demo of CRON-triggered jobs.

Day 106-120: LLM-Augmented Scraping

Topics Covered

  • DOM Understanding with LLMs
    • XPath generation via GPT, trafilatura, tavily.
  • Accuracy Evaluation
    • Regex vs LLM-based extraction.

Hands-on Tasks

  • Conduct zero-shot extraction from 20 news sites.
  • Compare LLMs against traditional methods.

Deliverables

  • Comparative research PDF.
  • Notebook demonstrating DOM extraction.

Day 121-135: Marketplace & Public Hub

Topics Covered

  • UX for Actors
    • Next.js dashboard, Stripe integration for billing.
  • Discoverability
    • Public-facing marketplace.

Hands-on Tasks

  • Design marketplace prototype in Figma.
  • Draft API specs for publishing and billing.

Deliverables

  • Clickable Figma prototype.
  • Technical spec for marketplace API.

Day 136-150: Observability & Guardrails

Topics Covered

  • Monitoring
    • OpenTelemetry, Grafana, incident response.
  • Scraper Governance
    • Rate limiting, captcha telemetry.

Hands-on Tasks

  • Configure Grafana dashboards with alerts.
  • Draft incident management documentation.

Deliverables

  • Alerting dashboard.
  • Incident run-book (PDF).

Day 151-165: Advanced Agents

Topics Covered

  • Planner-Executor Agents
    • LangChain + Autogen for autonomous workflows.
  • Evaluation Metrics
    • Latency, cost, coverage tracking.

Hands-on Tasks

  • Build "Research-a-Topic" demo agent.
  • Evaluate performance across multiple tasks.

Deliverables

  • Multi-step autonomous scraping demo.
  • Evaluation report.

Day 166-180: Public Contribution Pipeline

Topics Covered

  • OSS Readiness
    • Contributor guides, CLA bots, semantic-release.
  • Quality Gates
    • Linting, unit testing, end-to-end CI checks.

Hands-on Tasks

  • Write and test full contributor documentation.
  • Implement GitHub workflows for quality assurance.

Deliverables

  • Markdown Contributor Guide.
  • Fully integrated CI with tests and release validation.

Tech Stack

  • Languages:
    • Python 3.12, TypeScript 5
  • Scraping & Crawling:
    • requests, httpx, BeautifulSoup, lxml, aiohttp, Playwright, Puppeteer, Selenium
  • Orchestration:
    • Redis Streams, Celery, Prefect 3
  • Actor System:
    • Docker, FastAPI, Node.js
  • Storage:
    • MongoDB, PostgreSQL, Parquet, AWS S3
  • LLMs & Agents:
    • LangChain, Autogen, GPT APIs
  • Monitoring & OSS:
    • OpenTelemetry, Grafana, GitHub Actions