ScrapeGraphAI: Scrape Any Website with Natural Language

Plus analyze GitHub repos with LangChain loaders

Jan 26, 2026

Grab your coffee. Here are this week’s highlights.

🤝 COLLABORATION

What Data Engineers Really Think About Airflow (5.8K Surveyed)

Astronomer analyzed 5.8k+ responses from data engineers on how they are navigating Airflow today, and the findings might surprise you.

You’ll learn:

How early adopters are using Airflow 3 features in production
Which teams are bringing AI into production, and what’s holding others back
94% believe that Airflow is beneficial to their career

🔗 Download the State of Airflow 2026 Report

📅 Today’s Picks

ScrapeGraphAI: Scrape Any Website with Natural Language

Problem

Traditional scraping with BeautifulSoup follows a familiar pattern: fetch HTML, inspect elements in DevTools, and write CSS selectors to extract your data.

But websites don’t stay static. When the HTML structure changes, your selectors break, and you’re back to rewriting code.

Solution

ScrapeGraphAI uses LLMs to extract data from natural language descriptions. Simply describe what you want in plain English, and the LLM figures out the extraction logic automatically.

Key features:

Self-healing scrapers that adapt when websites are redesigned
Type-safe output with Pydantic schema validation
Built-in JavaScript rendering for React, Vue, and Angular sites
Multi-page scraping with SearchGraph for research tasks
Cloud or local models via OpenAI, Anthropic, or Ollama

Plus, ScrapeGraphAI is open source! Install it with “pip install scrapegraphai”.

📖 View Full Article | 🧪 Run code | ⭐ View GitHub

🔄 Worth Revisiting

Analyze GitHub Repositories with LangChain Document Loaders

Problem

Are you tired of manually searching through hundreds of GitHub issues with keyword search to find what you need?

Solution

With LangChain’s GitHubIssuesLoader, you can load repository issues into a vector store and query them with natural language instead of exact keywords.

You can ask questions like “What feature requests are related to video?” and get instant, relevant answers from your issue history.

📖 View Full Article | 🧪 Run code | ⭐ View GitHub

☕️ Weekly Finds

hf-mem [ML] - CLI to estimate inference memory requirements for Hugging Face models before downloading

fake2db [Testing] - Create custom test databases populated with fake data for SQLite, MySQL, PostgreSQL, and MongoDB

MiraTTS [LLM] - High-quality text-to-speech model fine-tuned from Spark-TTS with enhanced realism and stability

📚 Latest Deep Dives

From CSS Selectors to Natural Language: Web Scraping with ScrapeGraphAI - Web scraping without selector maintenance. ScrapeGraphAI uses LLMs to extract data from any site using plain English prompts and Pydantic schemas.

Before You Go

🔍 Explore More on CodeCut

Tool Selector - Discover 70+ Python tools for AI and data science
Production Ready Data Science - A practical book for taking projects from prototype to production

💬 Rate Your Experience

How would you rate your newsletter experience? Share your feedback →

Rainbow Roxy

Feb 2

Regarding ScrapeGraphAI, this looks really promising, especially the self-healing scrapers. Given how tricky website structures can be, it almost sounds too good to be true. I'm curious how reliably the LLM's 'adaptation' mechanism truly handles major redesigns without significant oversight or constant reprompting.

CodeCut Newsletter

Discussion about this post

Ready for more?