Agentic Web Scraper & Crawler Framework
This module builds a full-stack, scalable web crawling and scraping framework powered by LLMs, async pipelines, and containerized actors. It includes dynamic scraping, orchestration, observability, and public actor publishing with marketplace integration.
Day 1-15: Local Scraping Basics
Topics Covered
- Python HTTP Clients
requests
,httpx
for static HTML retrieval.
- HTML Parsing Libraries
BeautifulSoup
,lxml
for DOM traversal and tag targeting.
- Data Validation
pydantic
for defining schemas and contract validation.
Hands-on Tasks
- Build three micro-scrapers targeting static websites.
- Create reusable utility functions for DOM parsing.
Deliverables
- Poetry-managed repo with published PyPI package.
- GitHub repository containing three functional micro-scrapers.
Day 16-30: Browser Automation
Topics Covered
- Selenium
- Grid setup, headless mode, handling dynamic pages.
- Playwright
- Headless mode, Python and Node.js clients, cross-browser support.
Hands-on Tasks
- Implement a scraper for infinite-scroll websites.
- Benchmark and analyze Playwright vs Selenium performance.
Deliverables
- Blog post titled "Choosing Playwright vs Selenium".
- Demo script showcasing scraping of dynamic content.
Day 31-45: JavaScript/TS & Puppeteer
Topics Covered
- Puppeteer for SPA scraping
- DOM instrumentation and JavaScript execution.
- Anti-Bot Evasion
- Stealth plugins, browser fingerprinting mitigation.
Hands-on Tasks
- Create an NPM package encapsulating Puppeteer anti-bot techniques.
- Write GitHub Actions workflows for automated testing.
Deliverables
- NPM package publication.
- CI testing pipeline in GitHub Actions.
Day 46-60: Async & Distributed Crawling
Topics Covered
- Async Crawling
aiohttp
, async queues, back-pressure techniques.
- Distributed Architecture
- Sitemap parsing, robots.txt handling, depth/width-first strategies.
Hands-on Tasks
- Create a scalable crawler with the ability to crawl 10,000 URLs.
- Export results as JSON lines to S3.
Deliverables
- Draw.io architecture diagram.
- "One-command" crawler demo for large-scale extraction.
Day 61-75: Storage & Schema Design
Topics Covered
- Data Storage Models
- MongoDB vs Parquet for structured data.
- Content Deduplication
- Fingerprinting for delta updates.
Hands-on Tasks
- Define a Pydantic-based data schema.
- Benchmark and compare Mongo and Parquet for crawl performance.
Deliverables
- Schema documentation.
- Performance benchmark report.
Day 76-90: Apify-Style Actor Packaging
Topics Covered
- Actor Containerization
- Docker multi-arch builds, secrets mounting.
- Interface Design
- Input/output handling for actors.
Hands-on Tasks
- Build
ProdigalActor
base class in Python and Node.js. - Implement a Reddit sub-crawler as a sample actor.
Deliverables
- Dockerized actor framework.
- Reddit sub-crawler published to internal hub.
Day 91-105: Scheduler & Orchestration
Topics Covered
- Queue Systems
- Redis Streams, Celery for task queuing and retries.
- Job Scheduling
- Prefect 3 for orchestration, CRON, and ad-hoc jobs.
Hands-on Tasks
- Build a scheduling service with failure recovery.
- Integrate retry and CRON support.
Deliverables
- Functional scheduler with auto-resume.
- Demo of CRON-triggered jobs.
Day 106-120: LLM-Augmented Scraping
Topics Covered
- DOM Understanding with LLMs
- XPath generation via GPT, trafilatura, tavily.
- Accuracy Evaluation
- Regex vs LLM-based extraction.
Hands-on Tasks
- Conduct zero-shot extraction from 20 news sites.
- Compare LLMs against traditional methods.
Deliverables
- Comparative research PDF.
- Notebook demonstrating DOM extraction.
Day 121-135: Marketplace & Public Hub
Topics Covered
- UX for Actors
- Next.js dashboard, Stripe integration for billing.
- Discoverability
- Public-facing marketplace.
Hands-on Tasks
- Design marketplace prototype in Figma.
- Draft API specs for publishing and billing.
Deliverables
- Clickable Figma prototype.
- Technical spec for marketplace API.
Day 136-150: Observability & Guardrails
Topics Covered
- Monitoring
- OpenTelemetry, Grafana, incident response.
- Scraper Governance
- Rate limiting, captcha telemetry.
Hands-on Tasks
- Configure Grafana dashboards with alerts.
- Draft incident management documentation.
Deliverables
- Alerting dashboard.
- Incident run-book (PDF).
Day 151-165: Advanced Agents
Topics Covered
- Planner-Executor Agents
- LangChain + Autogen for autonomous workflows.
- Evaluation Metrics
- Latency, cost, coverage tracking.
Hands-on Tasks
- Build "Research-a-Topic" demo agent.
- Evaluate performance across multiple tasks.
Deliverables
- Multi-step autonomous scraping demo.
- Evaluation report.
Day 166-180: Public Contribution Pipeline
Topics Covered
- OSS Readiness
- Contributor guides, CLA bots, semantic-release.
- Quality Gates
- Linting, unit testing, end-to-end CI checks.
Hands-on Tasks
- Write and test full contributor documentation.
- Implement GitHub workflows for quality assurance.
Deliverables
- Markdown Contributor Guide.
- Fully integrated CI with tests and release validation.
Tech Stack
- Languages:
- Python 3.12, TypeScript 5
- Scraping & Crawling:
- requests, httpx, BeautifulSoup, lxml, aiohttp, Playwright, Puppeteer, Selenium
- Orchestration:
- Redis Streams, Celery, Prefect 3
- Actor System:
- Docker, FastAPI, Node.js
- Storage:
- MongoDB, PostgreSQL, Parquet, AWS S3
- LLMs & Agents:
- LangChain, Autogen, GPT APIs
- Monitoring & OSS:
- OpenTelemetry, Grafana, GitHub Actions