I’ve been trying to quantify how well various models perform on pretty basic data analysis - for a given query result, answer a few questions. Each query result has the same basic format; a section of aggregates, then repeating time series for each specified breakdown. These aren’t really complex queries either, there’s maybe a dozen unique values and I’m limiting them to a single aggregation.
what’s interesting, more or less, is that everything tends to suck at the evaluations. There are outliers - gpt-5 tends to do very well across the board, including 5-mini, for instance. Overall though, best case, you’re looking at 80% success rate on average across all the questions at the high end, but usually more like 50-60%
“well that’s garbage”, I might hear you saying, and intuitively you’d be right. what’s intriguing to me, though, is that when I ask the same models to do more complex tasks with similarly tight criteria (eg, “why is this happening”) after giving them some query tools, I see much higher success rates; nearly 100% for SOTA models, and increasingly even smaller models are improving.
why is this? how can a series of dependent tasks with a 20% failure rate sum to 100% success?
my supposition is that there’s a few specific things happening here, some of them very unintuitive.
1. models are often wrong, but they’re not off by _much_, especially when it comes to figures. for example, one question in the eval suite asks to calculate the sum of the top 3 aggregates. overall, this has about a 50% success rate. however, if you look at the incorrect results they all cluster around +\- 100 from the true value (and over half of them are only off by less than 5 in either direction). this isn’t great, sure, for many applications. but it’s a pretty acceptable level of error when absolute precision doesn’t count (and in my case, it really doesn’t - outliers in system telemetry are usually a few orders of magnitude off the distribution)
2. models are very bad at absolute time, but good at relative time. many of the questions involve quantitative measurements of time series data - “when was the peak of this series”, etc. models fuck this up pretty reliably unless they have the exact timestamps available in the context (and even then, they’ll botch it sometimes), but they’re pretty good at saying “this happened before that” in a series, or identifying the amount of outliers and peaks in the distribution. this is another one of those “pretty ok” compromises, albeit with some big caveats around tool design and query interfaces, but often in system telemetry you don’t necessarily need to know exactly what time something happened, but you do need to know the order in which things happened or at least the order those things were observed.
3. agent context is different than zero-shot context. this is probably the biggest single thing I can think of as to explain the divergent results; in an agent loop, the model has significantly more context devoted to whatever investigation it’s doing thanks to running multiple queries, etc. this seems to help quite a bit, probably because it’s “thinking” about more stuff that’s relevant versus getting a simple “here’s some data, pick out some facts” prompt.
the takeaways I have from this whole experience? nothing too novel. don’t use LLMs for extracting structured data unless you can live with sampling error. leverage agent loops with good/efficient tools to allow the model to discover more contextual data. return control to the operator frequently but passively in order to reduce the chance of the agent going on side quests.