Really nice ideas Jeff. I don't want to take this too far from Chris's original topic, but some discussion of these things would still be relevant.
The first step -- I don't want to call it trivial or easy, because it's not, but there are at least a couple ways to do it, as a few of us have demonstrated now -- is operationalizing the measure of fitness. Part of this involves defining the thing itself, because it's not quite like "fitness" is something with a really obvious meaning we all agree on. So you decide on what you're trying to measure, exactly, then you find a way to measure it with numbers.
But the next step is the tough one. Because really, if you wanted to be useless, you could blow right past the first two bits by making them arbitrary. Let's say, I'm going to declare that "fitness" to me is how many chips you can eat in a handful, and it's measured by how many fingers you have on a hand. Look at that, we're all super fit. But that doesn't match our natural sense of the word, and would not be a useful metric, so we've accomplished nothing. Philosophers call this type of definition non-intuitive; scientists refer to studies that are internally valid ("correct") but not externally valid ("useful").
So the question becomes whether the definitions and metrics we define actually represent anything that matters to us. But we can't go too far here, because we still need to preserve the first bit (the measurability). To give an example equally ridiculous in the other direction, we might define fitness as "being in super good shape." Which is probably right. But it can't be quantified. We really want both here.
Jeff talks about trying to connect these pieces in the classical scientific way. I think this is awesome and basically badass but probably, in almost all cases, impossible. The whole neural net idea is a good example of both what's being asked and also why it's not really workable. As he also mentions, this whole concept is more or less what CFHQ regularly claims that they're doing -- receiving from The Internet and The Affiliates a vast datastream of feedback regarding the efficacy of their workouts, and turning it somehow into useful info about the program. The quality of the data is obviously in question here, but the complexity of it is far more so; how would we operationalize it in the above way? HQ has at least a theoretical answer for this, with their power=fitness theory, but I find that this fails on both fronts: in practice it's too unfeasible to actually apply, and in theory it doesn't truly fit my idea of fitness. So I have much respect (yes, really) for their attempt, but it doesn't succeed IMO.
(You mention the black box idea, which is a really killer way of getting around the whole issue of rigor and causality by skipping it completely. Rather than doing the science and trying to wade through the complexities, you can try stuff and see if you like what happens. The causality may be bullshit and the results WON'T apply to anybody but you. That's fine. A few more ideas on this
here...)
In mine and Joe's system, we tried to deal with this by leaning towards the operational side. We used some standard criteria to generate numbers, and then made every effort to homogenize them. The "relevance" side was hoped to derive from our choice of tests, which were mostly things that are pretty widely recognized as indicative of certain sorts of physical skills, and also meaningful in their own right (for instance, even if it means nothing else, running fast is a good thing). But we weren't able to say much more than that.
Rubrics like Chris's -- or the CF North material, or Rip's strength tables -- are a different sort of thing. They lean toward the "external validity" side of the scale, by starting with the real world and trying to quantize it into numbers and rankings. For instance, Rip's tables are purportedly based on the actual lifting he's seen and gathered from real athletes moving through their training. Similarly, I'm sure, standards like yours are based on your looking around, at your own experience and those of athletes around you, and saying, "okay, most of the people at X level can do about 50 pushups, so we'll call that the baseline for athletes at that level." This isn't the rigorous sort of data you get when you start with the numbers, but it's more pertinent. It lets people set goals for themselves, which is probably the most important application of any of this.
All of this said, I suspect for the reasons I've given that a method like ours is best for "testing," ranking, or otherwise evaluating traits like athleticism in a rigorous way... and systems like Chris's are best for actually setting up training and recommending goals and paths to improvement. One is descriptive, one is prescriptive.
If this gets any further afield we might want to take it to a different thread, if Chris is more interested in nuts-and-bolts discussion of his stuff right now.