Building a structured dataset from the web is still a pipeline problem. You have to find a source, write or configure a scraper, design a schema, handle deduplication, schedule refreshes, and fix breakage when sites change. This process stays the same whether you do it once or a hundred times.
Bigset solves this workflow directly. It is an open‑source multi‑agent system released under AGPL‑3.0. You give it a natural‑language description of the data you need and it returns a ready‑to‑export CSV or XLSX file built from live web data. No URLs, no selectors, no manual schema work.
The system works in two tiers. First, a schema‑inference agent reads your sentence and decides the column names, data types, and primary key before any web request. Second, an orchestrator agent discovers which entities match your description using TinyFish Search. It then fans out sub‑agents, one per entity, each limited to six tool calls. Those sub‑agents fetch real pages, extract the needed fields, and insert a row. Primary‑key deduplication and source attribution happen automatically, and the dataset ID is kept in a JavaScript closure so the LLM cannot be tricked into writing to another dataset.
You can schedule refreshes from every 30 minutes to weekly, keeping the table current without manual reruns. Generation takes two to five minutes because the agents perform real web research. The full codebase is self‑hostable with Docker; you only need API keys for TinyFish, OpenRouter, and Clerk. Exports are CSV or XLSX today, with SQL query support and an agent‑native API planned.
Bigset turns a vague data request into a structured, up‑to‑date table with virtually no engineering overhead.
#AI #Product #DataEngineering #Automation #OpenSource #WebScraping