About the Minecraft Geologic Survey dataset

by Leonard Richardson

(The latest version of this document is at http://mcgs.crummy.com/201407/README.html. You can download the MCGS from the Internet Archive.)

So, you've downloaded the Minecraft Geologic Survey dataset. Or, more prudently, you're reading this document before downloading the dataset. In that case, you should know that the dataset is twelve gigabytes in size and decompresses to about fifty gigabytes. It contains nearly two million files and directories, and will probably take an hour or two just to unzip. Before downloading a bunch of data you might not need, please download the MCGS core sample for a single Minecraft world, take a look at it using this document as a guide, and see if your idea is workable.

This document describes the first MCGS sample, "201407", taken during July and August 2014 and including samples of worlds downloaded between March and June 2014. Here's what's inside the tarball:

The survey world

This is the primary result of the Minecraft Geologic Survey: a Minecraft world that is significantly smaller than the original world, but which contains many of the points of interest. There are about 175,000 of these survey worlds; one for each original world I was able to survey.

At (0,0) of the survey world we have the spawn chunk of the original world. It's quite possible for the spawn chunk to show up again, elsewhere in the survey. I put the spawn chunk in its own column because this is the first thing a player sees and it's the place the mapmaker is most likely to have tried to make interesting.

If the world does not have a spawn chunk set, I locate a Player entity and use their current location as the spawn chunk.

The survey of the Nether begins at (32, 0) and includes 1% of the Nether chunks (minimum 5, maximum 100), with the more interesting chunks near the origin. The first chunk in the survey is located at (32, 0), the second chunk is at (32, 32), and so on. (There is always a one-chunk gap between survey chunks; this keeps them visually distinct when you load up the survey world in Minecraft.)

If the original world has no Nether, this column of chunks will be empty in the survey world. You can see what this looks like by looking at the screenshot. This world doesn't have a Nether. If it did, you would see a row of chunks in between the spawn chunk and the row of Overworld chunks.

The survey of the Overworld begins at (64, 0) and includes 1% of the Overworld chunks, with more interesting chunks near the origin. The first Overworld chunk is located at (64, 0), the second at (64, 32), and so on.

If you look at the screenshot you'll see that most of the Overworld chunks in this map are nearly identical. Sometimes this is due to shortcomings of the MCGS algorithm for deciding which chunks are "interesting", but sometimes it's just due to a homogenous map.

The survey of the End begins at (96, 0) and includes 1% of the End chunks, with more interesting chunks near the origin. The first End chunk is located at (96, 0), the second at (96, 32), and so on.

Most maps don't feature an End, and End chunks are seldom very interesting.

Note: The survey world will be in Minecraft 1.8's Anvil format, even if the original map was in a different format.

A word about "interestingness"

The current version of survey script measures the interestingness of a chunk in terms of entities: mobs, paintings, chests, spawners, signs, books, command blocks, and so on. It goes through every chunk in the world and scores it with the number of entities found in that chunk.

I chose the entity count because it's basically free to calculate: I had to go through the entities anyway to get the text of signs and books. But of course a chunk doesn't need to contain entities to be interesting: consider a sculpture made of wool. And even if I was really good at figuring out which chunks were interesting, a survey that only contained interesting chunks wouldn't be representative. Mediocre chunks deserve to be represented in the dataset as well.

So the sample starts out very biased towards interesting chunks, but the bias decreases as more chunks are sampled. The chunk at (64, 0) is pretty much guaranteed to be the most entity-rich chunk in the Overworld. Overworld chunk #10, at (64, 288) probably has one or two entities, but probably not a whole lot. Overworld chunk #100, at (64, 3168), is chosen more or less at random from across the entire world, and is very likely to be normal Minecraft generated terrain.

The manifest: manifest.json

Every survey directory contains a file called manifest.json which talks about the MCGS sample world, and about the original world used to generate the sample. Despite its filename, manifest.json isn't strictly a JSON file. It's a JSON stream, a file with one JSON object per line. Each object represents a single point of interest within the sample or the original world. Here are some of the objects you might see.

Signs: Sign

{
 "id": "Sign",
 "lines": ["1.", "No cheating", "creative mods", "etc."],
 "dimension": 0, 
 "coords": [-745, 36, 3201]
}

This is a sign. Its coordinates within the world are recorded, as is the text on the sign.

Command blocks: Control

{"id": "Control",
 "command": "/give @a 307",
 "dimension": 0, 
 "coords": [-750, 35, 3195]
}

This is a command block. Its coordinates within the world are recorded, as is the command it executes.

Books: 386 and 387

{"id": 387,
 "title": "For Cheaters!", 
 "author": "telafiesta",
 "dimension": 0, 
 "coords": [-282, 73, 2840],'
 "contained_in": "Chest",
 "pages": ["Check the Table in the libary, second floor can't miss it."]
}

This is a book. The value of contained_in is an addendum to coords which explains where exactly the book is in that block: "Chest", "ItemFrame", "Player" (in a player's inventory), or "Ground" (lying on the ground).

An item with id=386 is also a book. (386 is the item ID for "Book and Quill", 387 is the item ID for "Written Book".)

Maps: 358

{"id": 358,
 "map_id": 1,
 "path": "data/map_1.dat",
 "dimension": 0,
 "coords": [-505, 71, 354],
 "contained_in": "Chest"
}

This is an in-game map (358 is the item ID for "Map"). The survey directory includes copies of all these maps, and the path field is the path to the map file within the survey directory. The maps are in the normal Minecraft map format—all I did was copy them from the original world directory into the survey directory.

contained_in works the same way as for books.

Chunks: [chunk]

{"id": "[chunk]",
 "dimension": 0,
 "source_chunk": [25, 91],
 "destination_chunk": [0, 0]
}

This is a chunk that was copied from the original world to the sample world. source_chunk and destination_chunk are chunk-based coordinates, so multiply by 16 to get block offsets. In this example, chunk (25, 91) in the overworld was copied to chunk (0,0) in the sample world. That is, the chunk that occupies the space between (400,1456) and (415, 1471) in the original world, occupies the space between (0,0) and (15, 15) in the survey world.

You can tell that this is the original world's spawn chunk, because the spawn chunk is always copied to (0,0).

Here's another example.

{"id": "[chunk]", 
 "pool": "all",
 "dimension": 0,
 "source_chunk": [-49, 214],
 "destination_chunk": [2, 4]
}

Chunk (-49, 214) of the original world (that is, the chunk starting at (-784, 3424)) has been copied to chunk (2, 4) of the survey world. The copy of this chunk in the survey world starts at (32, 64). This is part of the MCGS sample of the world's Overworld, and since it's relatively near the origin, it's probably more on the "interesting" side of the sample.

I don't remember what pool does, and it always seems to be "all", so don't worry about it, I guess.

[core-sample]

{"id": "[core-sample]",
 "target_dir": "sampled/201407/planetminecraft.com/maps/201406/downloads/www.mediafire.com/000/1v1 Arena.rar/1v1 Arena",
 "source_dir": "unzipped/planetminecraft.com/maps/201406/downloads/www.mediafire.com/000/1v1 Arena.rar/1v1 Arena"
}

This is an entry for the MCGS core sample as a whole. target_dir is the directory you found , and source_dir is a directory that was once on my computer, used as a temporary place to unzip 1v1 Arena.rar, but which is long gone. You probably don't need this entry—fingerprints-ready.json is a lot more useful.

The index: fingerprints.json

Now I'd like to introduce you to a very useful little file called fingerprints.json. This file includes basic information for every Minecraft world in the dataset. If you run the program to create a Reef world, the first thing that program will do is load fingerprints.json to get a sense of what it has to work with. So let's take a look.

Like manifest.json, fingerprints.json isn't strictly a JSON file. It's a JSON stream, a file with one JSON object per line. Each object represents a single Minecraft world and its MCGS sample.

Here's the first object from the stream in fingerprints.json. I'll go over it in detail in a bit, but first I want to show you the whole thing.

{
  "_id": "201407/planetminecraft.com/maps/201406/downloads/www.mediafire.com/000/1v1 Arena.rar/1v1 Arena",
  "published": "2014-05-10 21:54",
  "popularity": 5,
  "thread": {
    "_id": "http://planetminecraft.com/project/1v1-arena-2891632/",
    "host": "planetminecraft.com"
  },
  "commands": [
    "/say Read Rules then click the button",
    "/give @a 307",
    ...
  ],
  "permalink": "http://planetminecraft.com/project/1v1-arena-2891632/",
  "title": "1v1 Arena",
  "signs": [
    [
      "1.",
      "No cheating",
      "creative mods",
      "etc."
    ],
    ...
  ],
  "books": [],
  "interactionCount": 0,
  "spawn_dimension": 0,
  "creator": "touchdown1545",
  "classification": "PVP",
  "spawn": {
    "block_counts": {
      "87": 12,
      "51": 12,
      ...
    },
    "surface_map": [
      [
        17,
        18,
        18,
        18,
        24,
        24,
        24,
        11,
        11,
        11,
        24,
        24,
        0,
        0,
        0,
        0
      ],
      ...
    ],
    "height_map": [
      [
        61,
        60,
        60,
        60,
        35,
        35,
        35,
        34,
        34,
        34,
        35,
        35,
        0,
        0,
        0,
        0
      ],
      ...
    ],
    "entities": {},
    "median_height": 35,
    "tile_entities": {}
  },
  "tags": [
    "PvP"
  ],
  "diamonds": 1,
  "memes": [],
  "modded": false,
  "description": "Whats up guys.  I got my new 1v1 Arena ready...",
  "views": 48,
  "downloads": 8,
  "overworld": {
    "block_counts": {
      "174": 1440,
      "110": 144,
      ...
    },
    "surface_map": [ ... ]
    "height_map": [ ... ]
    "entities": {},
    "median_height": 61,
    "tile_entities": {}
  }
}

Now let's go over it again, but this time I'll explain what everything does.

Info about the world as a whole

_id

  "_id": "201407/planetminecraft.com/maps/201406/downloads/www.mediafire.com/000/1v1 Arena.rar/1v1 Arena",

This is a unique ID for the sample. It also serves as a pointer to the sample on disk, relative to wherever you extracted the MCGS tarball.

What do all these path portions mean? I'm glad you asked.

thread

  "thread": {
    "_id": "http://planetminecraft.com/project/1v1-arena-2891632/",
    "host": "planetminecraft.com",
    "title": "1v1 Arena"
  },

This serves as an identifier for the Minecraft Forum thread (or equivalent on another site). The _id should always be a URL you can use to get to a webpage describing the world. The title is the title of the thread (for the Minecraft forum) or the title of the world on the website (other sites). This may be different from the name of the world directory! (See below for details.)

Note that several worlds may come from the same thread.

permalink

  "permalink": "http://planetminecraft.com/project/1v1-arena-2891632/",

I'm pretty sure this will always be the same as ['thread']['_id'], but maybe not? Better use this just to be safe.

published

  "published": "2014-05-10 21:54",

This is my best guess as to when the world was published. For Minecraft Forum worlds, it's when the forum thread was created.

title

  "title": "1v1 Arena",

This is the title of the Minecraft world. It's the name of the original world directory, and it may be different from the ['thread']['title']. With Planet Minecraft the thread title is usually better than the world title; with the Minecraft forum the world title is usually better.

creator

  "creator": "touchdown1545",

This is the account name used by the mapmaker on the website where I found out about the world.

popularity

  "popularity": 5,

This is a very rough measure of the world's popularity, derived from a combination of diamonds, interactionCount, downloads, and views. Newer worlds will always have lower popularity scores because they haven't had time to get popular. This number is meaningless on its own, but comparing the popularity of two worlds will tell you, roughly, which one is more popular.

interactionCount, downloads, views, diamonds

  "interactionCount": 0,
  "downloads": 8,
  "views": 48,
  "diamonds": 1,

These are the proxies used to calculate popularity

Not all sites have all these features, which is one reason why popularity is so unreliable.

classification

  "classification": "PVP",

This is a rough classification of the world into one of several genres, based on title, tags, and the content of in-game signs and books.

The most popular classifications are "Structure" (34,390 worlds), "Adventure challenge" (30,349 worlds), and "Parkour" (17,657 worlds).

description

  "description": "Whats up guys.  I got my new 1v1 Arena ready...",

The mapmaker's description of the world.

tags

  "tags": [
    "PvP"
  ],

Each string in this list is a tag applied to this world by the mapmaker to help people find it. Planet Minecraft has a standard list of categories, but the Minecraft forum lets you make up your own tags, and other sites may not have tags at all.

memes

  "memes": [],

A list of common Minecraft memes employed in this world. This proved not to be very interesting. The memes:

modded

  "modded": false,

My guess as to whether or not this world is designed for a modded version of Minecraft.

spawn_dimension

  "spawn_dimension": 0,

The number of the dimension the 'spawn' chunk is taken from. -1 is usually the Nether, 0 is usually the Overworld, and 1 is usually the End. I say 'usually' because sometimes a world will have something weird like -1 as the Overworld and 0 as the Nether. I don't know why. Of course, a modded world may have any number of additional dimensions.

commands

  "commands": [
    "/say Read Rules then click the button",
    "/give @a 307",
    ...
  ]

Each line in this list is a command found in a command block during the course of the survey. If you want more details, they're in the manifest.json file for this particular world.

signs

  "signs": [
    [
      "1.",
      "No cheating",
      "creative mods",
      "etc."
    ],
    ...
  ],

Each item in this list is a 4-item list containing the text of a sign found during the course of the survey. If you want more details, they're in the manifest.json file for this particular world.

books

  "books": [],

Each item in this list is a JSON object describing a book found during the course of the survey. Since there are no books in this world, let's take a look at a world that does have books.

 [
  {"author": "cilindrin",
   "pages": ["oops I forgot to tell you the name of this level right?..."],
   "title":"Opps"
  },
  ...
 ]

If you want more details on a book, they're in the manifest.json file for this particular world.

Chunk fingerprints

An object will contain up to four fingerprint objects, named "spawn", "overworld", "nether", and "end".

As I mentioned earlier, the MCGS survey software determines how "interesting" a chunk is by counting entities.

Let's take a look at one of these fingerprints.

block_counts

    "block_counts": {
      "87": 12,
      "51": 12,
      ...
    },

This is a JSON object mapping Minecraft block IDs to the number of times that block shows up in this chunk. This example says that block #87 (Netherrack) shows up 12 times, and block #51 (Fire) also shows up 12 times (presumably on top of the netherrack).

surface_map

    "surface_map": [
      [
        17,
        18,
        18,
        18,
        24,
        24,
        24,
        11,
        11,
        11,
        24,
        24,
        0,
        0,
        0,
        0
      ],
      ...
    ],

This is a list of 16 lists of 16 block IDs, showing which block type is the topmost block for each (x,z) coordinate in the chunk. In this example, we have a row with oak wood (17) on top, then oak leaves (18), oak leaves, oak leaves, sandstone (24), sandstone, and so on.

If unrecognized block IDs show up in this list, the world is flagged as modded.

height_map

    "height_map": [
      [
        61,
        60,
        60,
        60,
        35,
        35,
        35,
        34,
        34,
        34,
        35,
        35,
        0,
        0,
        0,
        0
      ],
    ...
    ],

This is a list of 16 lists of 16 y-coordinates. Each is the y-coordinate of the topmost non-transparent block at a given (x,z) coordinate. In the first row, the first few values (the oak tree—remember that the corresponding part of surface_map showed the block IDs for oak wood and oak leaves) are up at y=60, but when we get to the sandstone we quickly drop to y=35.

This is copied from Minecraft's internal light array, which is why glass blocks are not counted.

median_height

    "median_height": 35,

This is just the median of all the values from height_map.

entities and tile_entities

    "entities": {},
    "tile_entities": {}

That's not very interesting. Let's take a look at a different world that has some stuff in here.

    "entities": {"Painting":8,"Squid":1}
    "tile_entities": {"Sign":5,"Chest":2,"MobSpawner":1}

These are objects that count how many entities of a given type are in the chunk. This chunk has eight paintings, a squid, five signs, two chests, and a spawner. That's a pretty interesting chunk!

The presence of unrecognized entities will get a world flagged as modded.

The log: links-with-sample-info.json

You probably don't need this file, but I'll describe it just in case. This is the record of where on the Internet a given archive file comes from, how I downloaded it, where it can be found in the full Minecraft Archive Project dataset, and which Minecraft Geologic Survey worlds were derived from it. Here's an example:

{
  "_id": "http://mediafire.com/?brft0d8mv7r4c60",
  "hostname": "mediafire.com",
  "type": "download",
  "data_dump_path": "downloaded/planetminecraft.com/maps/201406",

  "archive": [
    {
      "accessed": "2014-06-26 21:55:28",
      "status": 509
    },
    {
      "size": 5942464,
      "filename": "Electric_Cave-1.zip",
      "path": "201406/downloads/mediafire.com/372/Electric_Cave-1.zip",
      "media_type": "application/zip",
      "accessed": "2014-06-26 02:51:21",
      "status": 200
    }
  ],
  "mcgs_paths": [
    "201407/planetminecraft.com/maps/201406/downloads/mediafire.com/372/Electric_Cave-1.zip/Electric_Cave"
  ]
}

Again, let's take it from the top, with commentary.

{
  "_id": "http://mediafire.com/?brft0d8mv7r4c60",
  "hostname": "mediafire.com",
  "type": "download",
  "data_dump_path": "downloaded/planetminecraft.com/maps/201406",

The _id is always the URL of the original file, which I got from the Minecraft forum thread, the Planet Minecraft page, or the equivalent for one of the other sites. It's usually a file hosted on mediafire.com or dropbox.com.

The hostname is the hostname of that URL, which I split out and used for statistical purposes to see which sites hosted the most files.

In this file, type will always be "download". I kept track of other kinds of links (screenshots, Youtube videos, wiki links, etc), but the world download links were the only ones that made it into the MCGS, because the MCGS only covers worlds.

data_dump_path is the path to the directory I was using when I downloaded this file. You won't find this path in the MCGS tarball—it's part of the two-terabyte Minecraft Archive Project dataset—but the MCGS paths are based on these paths.

  "archive": [
    {
      "accessed": "2014-06-26 21:55:28",
      "status": 509
    },
    {
      "size": 5942464,
      "filename": "Electric_Cave-1.zip",
      "path": "201406/downloads/mediafire.com/372/Electric_Cave-1.zip",
      "media_type": "application/zip",
      "accessed": "2014-06-26 02:51:21",
      "status": 200
    }
  ],

archive is a list of times I tried to download the file. Sometimes something goes wrong and I try again later. In this case, my script successfully downloaded the file on June 26 at 2:51 AM. For some reason (this was a messy process) my script tried to download this file again at 11:55 PM, and got a 509 error, which is just as well because the script was able to download it the first time.

URLs that just gave me a 404 error aren't included in this dataset. This is only for URLs where the script was eventually able to download an archive and at least try to run the MCGS survey code on the archive.

Note that the "201406" in the path is the same as the "201406" in the data_dump_path. If you want to join those two paths together (there's no reason to do this unless you have the Minecraft Archive Project data), you'll need to strip one of them.

  "mcgs_paths": [
    "201407/planetminecraft.com/maps/201406/downloads/mediafire.com/372/Electric_Cave-1.zip/Electric_Cave"
  ]

Most of the entries in this stream will have one or more paths in mcgs_paths. These are the paths to the MCGS sample world directories I covered earlier. Most of the time there will only be one path in here, because most of the time an archive file contains only one Minecraft world. If an archive contains more than one archive world (let's say it's a collection, or it contains an "easy" and a "hard" version of the same challenge), each world will be processed by the survey script, and you'll get multiple entries in mcgs_paths.

If the MCGS script runs into trouble when processing a world, there will also be a field called mcgs_exceptions, which will explain the problem. In this case, MCGS couldn't unzip the archive file:

" downloaded/minecraftforum.net/maps/201404/downloads/2012/1/949535/Survival_1.0.zip was not unzipped properly!"

In this case, MCGS unzipped the archive file but there was a problem converting the pre-Anvil world to Anvil format:

"Command '['java', '-jar', 'AnvilConverter.jar', u'unzipped/minecraftforum.net/maps/201404/downloads/2011/2/184310/World4.zip', u'World4']' returned non-zero exit status 1"

In this case, MCGS was able to load the world but the chunk data was corrupt:

"Chunk (22, 46) had an error: IOError('Unknown compress format: 82',)"

Yes, there's no limit to the number of things that can go wrong.

Conclusion

I realize that this is a lot of data, but there's no limit to what you can do with it. The projects I've done so far—the Reef worlds, Minecraft Signs, and minecraft_ebooks—just scratch the surface. At this point you know a lot more about Minecraft's data structures than I did when I started this project, and I've done a lot of the work for you in consolidating the data and sampling it down to a reasonable size. So go to it and have fun!

I recommend using pymclevel to load and manipulate the survey worlds—that's what I used to create them. If you're already familiar with Minecraft modding, you'll probably have better luck finding using something written in Java.

Finally, I'd like to thank the 71,094 mapmakers whose worlds show up in the Minecraft Geologic Survey. You are the ones who made this possible.