Fix AgentTrove Trace Load: Create ShareGPT-Ready Dataset in Python

When working with large agent trace datasets the main hurdles are extracting only the successful examples, keeping the process memory‑efficient, and producing a clean JSONL file ready for supervised fine‑tuning. A streaming approach solves the first two issues: the dataset is read lazily with itertools.islice, so only a limited number of rows are held in memory at any time. For each row we evaluate success by checking the normalized result field for keywords like “resolved”, “success”, “pass”, “passed” or “correct”; if that fails we safely cast the reward value to a float and treat any value ≥ 1.0 as a win, defaulting to False on conversion errors. This avoids crashes caused by missing or malformed reward entries.

After a row passes the success test we normalize its turn list, discard empty turns, and require at least two conversational exchanges before writing the entry. Each accepted trace is dumped as a single line JSON object containing the conversations, original source and teacher fields, then appended to the output file. Counters keep track of how many rows have been scanned and how many have been kept, allowing the script to stop automatically once the desired number of successful traces (e.g., 200) is reached.

To explore the filtered set we provide a lightweight search function that streams the same slice, optionally filters by source or keyword, and prints a truncated view of matching traces. If no matches appear, the user is advised to increase the scan window. This pattern lets analysts quickly iterate on domain‑specific subsets (e.g., nl2bash, swesmith, codeforces) without loading the full corpus, making it ideal for preparing domain‑specific SFT mixes or for exploratory analysis before launching a full fine‑tuning run with tools such as Axolotl or LLaMA‑Factory.

#AI #Product #MachineLearning #DataScience #LLM #FineTuning