I now have the entire <Tournament Index> properly indexed with <CG> id’s, for the tournament (tid), game (gid), and White and Black players (pid). This is tremendous progress, allowing me to examine many aspects of the database’s integrity and consistency.
But useful though that is, it requires me to lot the PGN into my Python program, since most chess database software knows nothing about <CG> id’s. That means that the games must be properly normalized to be properly categorized and grouped in a program like SCID (or Chessbase).
Let’s load my May 2015 snapshot into SCID for some exploratory investigation. SCID has a useful feature known as the <Tournament Finder> where the games are analyzed to pull out the tournaments. A giant PGN file will saturate the listing (e.g. a max of 1000 can be displayed). The entire <CG TI> consists of a minimum of ~1850 tournaments. That is, assuming the tournaments are properly normalized. Unnormalized, SCID will find fictitious tournaments.
Remember, programs are actually fairly stupid, and can only do what they are told (by the programmer – who can be either smart or dumb, depending!). In this case, the output is only as good as the input.
I’ve interested in <USSR ch — Lenningrad (1956)>. So let’s look for that tournament. SCID’s <Tournament Finder> has a filter that allows us to make “cuts” to select tournaments. Perhaps the simplest cut to make is on year; let’s look for tournaments from the year 1956 only:
(Click image to enlarge)
Here is a picture of an almost idyllic world. There are twelve tournaments listed. It would be “The Good” if there were only eleven. Look closely, and you’ll see that the <57th US Open> is unnormalized, and is listed as two tournaments because of two different Site tags. Of course we recognize that <Oklahoma City, OK> and <Oklahoma City, OK USA> are the same. But not poor dumb old Uncle SCID.
In the forensic investigate of the <USSR ch (1956)>, though, it appears OK, as there is only a single tournament listed. But one must beware of yet another potential problem – where playoff games aren’t properly distinguished – and so fall into the same bin as normal tournament games.
The best treatment of a playoff is to append the words “playoff” to the normal tournament name. Then two tournaments will show up. Do we have this situation with the <USSR ch (1956)>? There were 18 players, which suggests a RR max of 18*(18-1)/2 = 9*17 = 153 games. Instead we find 158 games. So yes, we do have this problem. Hence, 1956 must be considered “The Bad”.
But it could be worse. Just take a gander at 1957, which might just merit the label “The Ugly”:
(Click image to enlarge)
It just so happens the <Smyslov–Botvinnik WCC (1957)> took place during that year. And <CG>‘s WCC’s are known to be notoriously unnormalized.