Skip straight to the graph below which shows the scale of the problem, or just continuing reading to get the full story.
<CG>’s current process tournament promotion process is too slow and bulky to handle the task of normalizing all the games in its database. Moreover, I’m becoming more and more convinced that it is even error-prone. There is simply not enough checks built into the system, and because of the design, it is difficult to systematically check the data for consistency.
The latter clause would take us too far afield to justify inclusion in this post, but it’s a recurring theme that has been discussed before. The main gist is that tournaments can become too easily denormalized during maintenance, and <chessgames> himself has pointed out how dangerous the current updating process can be. And given the fact that tournaments are promoted before they are normalized, and the normalization is split between the admins and the editors, there is no guarantee that the original promoted tournament is correctly normalized.
Putting that aside presently, let’s examine the scale of the task of collecting tournament games and normalizing them, from a system’s point of view.
At the moment, <CG> does this via a process that I’ll refer to as “tournament promotion”. This is a laborious task, as I can personally attest, which most biographers do “by hand”. Least there be any doubt of my opinion on the matter, please be aware that I do have more than a little experience in the matter. In fact, the current record holder for building the largest collection on <CG> is me. How do I know that I’m the record holder? Because I had to petition <CG> to increase the limit for the number of games when I built <Biel Interzonal (1993)>, see here my forum post here:
This may not have been noted in the intro <Tab> will almost certainly write for the tournament, but I think it merits a footnote at least. Or maybe not.
Anyways, the collection I built ended up with 463 games, a record which still stands today as far as I know. Of course, if you think you have a larger collection than this one, please feel free to leave a comment.
For the reader unfamiliar with the process, I’ll provide a thumbnail sketch of how a collection is currently made, using just the <CG> tools provided.
Step (1) … Step (n)
OK, I lied, I’m not going to write all that out right now. Maybe I should try to do a Tom Sawyer, and try to get a fellow biographer to write out how *they* do it. (Look at how much fun it is explaining all the details so that somebody else can easily understand the process and do it too!) Me, I’m too traumatized by the memory of it to live through it all again. PCSD, i.e. Post-Collection-Stress-Disorder. Actually, it would be helpful if the process were documented.
Trust me, building a collection on <CG> can be a long, slow, tedious, and grueling process. Now, not everybody agrees with this assessment:
Maybe I feel that way because when I first became active on <CG> I was immediately attracted to studying the process, and so volunteered to do some very large tournaments that for some reason(!) hadn’t been done before. My first collection was a little fun (the first time usually is), and it was historically worthwhile. The <2nd Pan-American Open, Hollywood (1954)> tournament was a good start, being a manageable 35 games. This was the first, and the last, tournament I did by hand.
The next tournament I did raveled some of the larger tournaments on <CG>, being the <Biel Interzonal (1985)>. It consisted of 153 games. At this point I had written a program to create a “HTML collection build page”. It found all the games between two players for a given year, going through the tournament participants pair-by-pair. Helpful, but it yielded collections that weren’t sorted by round. By the time I did my next tournament, the already mentioned <Biel Interzonal (1993)>, I was having a difficult time understanding how the other biographers were happy with such a system.
What made it especially unpleasant was the fact that I knew a ready-made, normalized tournament already existed; one that was just sitting out there on the net, a mouse-click away from a download. It was only the uploading to <CG> that was “a-missing”. The entire series of WCC tournaments, down to the interzonals, were readily available on both <Carolus> and/or <Mark Weeks> websites. It seemed to me that we were constantly re-inventing the wheel.
Sometimes one have to do such work (e.g. when transcribing and loading a tournament that nobody has normalized with round/dates before – e.g. the <Altona (1869)> tournament that I did). But not always. When the work is required, I’m happy to pull my weight and do my share. Still, there’s a limit…
We haven’t discussed the large amount of work that gets done in writing a good, or even not so good, introduction to the tournament. The acknowledged main writer presently on <CG> is <Tabanus>. His write-ups have won awards on the site, and he’ll include details far beyond what I would bother to research. In a word, they’re excellent, especially in terms of content.
But I view the task of building a tournament as divided into two. Each component orthogonal from the other – collecting and normalizing the games, and writing a historical introduction. For my purposes, the extreme research isn’t necessary for what I consider to be the main priority – to bring some semblance of order to the chaos. By that I mean the condition of the PGN headers in the full assemblage of <CG> games. At the moment there is an unfortunate conflation of the two goals – and each tournament which gets promoted must have normalized games, as well as an in-depth historical write-up. There is a influential contingent of biographers who (strongly) feel this is the best way, and so tournament normalization is currently only done via tournament promotion, which is only done via a voting process. And that typically is only done after the tournament is elaborately written up.
But this wasn’t always the case. Many, many of the promoted tournaments were done in a less formal fashion. This was necessary to boot-strap the tournament collection. Originally, many tournaments had only a cursory write-up, and were put together with the main thrust of just having the games collected and organized (often the round numbers were present, many times the round dates were known as well). Sometimes, the original authors of these informal introductions are unfairly criticized for not adhering to the current rigorous standard of scholarship. This is of course terribly unfair, since the game collections were put together for their own personal purposes, and were never meant for publication (or detailed scrutiny). These collections, and the work they represent, were essentially commandeered.
The problem then, is that the entire ensemble of tournaments with the “Tournament Index” on <CG> haven’t all be vetted equally. There hasn’t been the time or inclination to do so for many of the tournaments, at least not in a systematic fashion. After all, there is much other work to be done to incorporate new tournaments with the current standard of rigor.
Let’s move on to the main thesis of this post – how much work is there to do, and how much has already been done?
That’s why I tried to elaborate the work in just building one tournament, with the detour as to why so many other tournaments have been promoted without detailed historical research. Simply put, the amount of work is vast, and the number of biographers currently pursuing this work is few.
And so, finally we arrive at the moment of truth.
Suppose <CG> were to aspire to having a database comparable to some of the leading chess databases – like <ChessBase>, <NIC>, or even <MillBase> (the open-source, freely downloaded, SCID compatible db that I use)?
How much work would be involved?
To even quantitatively gauge the amount of work involves a little bit of head-scratching. I came up with this metric, which is to compare the number of tournaments processed (i.e. promoted) on <CG>, versus the number of tournaments found by SCID for <MillBase>. The latter is … (work in progress, will update via edit later)
(Back to top)
And here is the continuation of the graph for the years 1960-1980.
(Back to top)