Website Content Crawler
Crawl one website and extract clean page content, then get a readable report plus raw JSON through UI, HTTP, or MCP.
Created by Chris Moen • Version 2 • 9 steps
What you get
- Crawl one website and extract clean page content, then get a readable report plus raw JSON through UI, HTTP, or MCP.
Integrations
- apify
How it works
- Normalize website crawl input
- Validate website crawl input
- Start website content crawl
- Fail run when Apify fails
- Fetch crawl dataset items
- Summarize crawl results
- Persist full crawl results JSON
- Build buyer-facing crawl report
Website Content Crawler
Crawl one website and extract clean page content, then get a readable Breyta report you can use in the UI or call from HTTP and MCP.
Who this is for
This app is for operators, researchers, founders, and AI builders who need a quick way to turn a site into structured crawl output for review, extraction, or downstream RAG work.
What happens after install
After install, you can run the crawler from:
- the Breyta run form
- the HTTP endpoint
- the MCP tool
Each run crawls one starting URL, follows pages under that site, and returns:
- a readable markdown summary
- a downloadable JSON file with the crawled rows
What you need to enter
- Start URL: the site or section you want to crawl, such as
https://docs.example.com - Max pages: optional crawl cap, default
25, maximum25
No extra buyer API setup is required for the base experience on this app.
What you get back
For each crawled page, Breyta returns a compact row when available, including:
- page title
- page URL
- crawl depth
- content type
- extracted content preview
Good uses
- doc set capture for AI workflows
- quick knowledge-base exports
- blog or help-center snapshots
- lightweight site audits before deeper processing
Limits and expectations
- Results depend on what the site exposes and what the upstream crawler can access at run time.
- Very large or heavily dynamic sites may need narrower start URLs or smaller test runs first.
- The app cap limits how many pages are crawled per request.
Failure cases
You may see incomplete or failed runs if:
- the target site blocks or rate-limits the crawler
- the start URL is invalid or inaccessible
- the site relies on unsupported runtime behavior
- an upstream network issue interrupts the crawl
When that happens, retry with a narrower section URL or the same URL and a smaller page cap first.
FAQ
What does the Website Content Crawler app do?
The Website Content Crawler scans a single website to extract clean, readable text. It removes unnecessary clutter and provides you with a summary report and raw JSON data for your records or further analysis.
How does the workflow extract data from a website?
This app connects to Apify to handle the crawling process reliably. It validates your input, runs the extraction, and then formats the results into a structured summary and a full dataset.
How can I use the extracted data in other tools?
You can access the extracted content directly through the Breyta UI, via a standard HTTP API, or as a Model Context Protocol (MCP) tool. This makes it easy to feed clean website data into your own applications or external AI models.
What do I need to set up to start crawling?
Since the app uses Apify for the heavy lifting, you'll need to provide your API credentials during the installation process. Once connected, you can configure the specific website URL you want to crawl and start the workflow immediately.