A dev on GitHub told me my AI model's training data was a mess

I built a text generator and posted the code. Someone opened an issue saying my training data had tons of duplicate entries and junk text, which made the output weird. They pointed out a specific JSON file where over 30% of the entries were repeats. I spent the next week cleaning it all up, removing the duplicates and filtering out low-quality samples. The model's coherence improved by a noticeable amount after just one retraining cycle. Has anyone else had a simple data hygiene fix make a huge difference in output quality?

2 comments

2 Comments

christopher4932mo ago

Wow, that's interesting but I actually had the opposite happen! My model got worse after I cleaned out duplicates too aggressively, maybe like @wesleymiller's case was a lucky break. Sometimes that "mess" adds needed variation.

wesleymiller2mo ago

My last data cleanup improved my model's accuracy by 18 percent.