About the Minecraft Geologic Survey dataset
by Leonard Richardson
(The latest version of this document is at http://mcgs.crummy.com/201407/README.html. You can download the MCGS from the Internet Archive.)
So, you've downloaded the Minecraft Geologic Survey dataset. Or, more prudently, you're reading this document before downloading the dataset. In that case, you should know that the dataset is twelve gigabytes in size and decompresses to about fifty gigabytes. It contains nearly two million files and directories, and will probably take an hour or two just to unzip. Before downloading a bunch of data you might not need, please download the MCGS core sample for a single Minecraft world, take a look at it using this document as a guide, and see if your idea is workable.
This document describes the first MCGS sample, "201407", taken during July and August 2014 and including samples of worlds downloaded between March and June 2014. Here's what's inside the tarball:
- MCGS survey data for 175,345 Minecraft worlds from 162,123 archive files created by 71,094 people. (Another 7155 worlds weren't surveyed due to errors in the data.) This data is kept in the normal Minecraft world format—one directory per survey. Learn more about the survey worlds.
- Each survey world directory contains a file
called
manifest.json
which gives detailed information about the original Minecraft world. This includes the coordinates in the original world of every chunk selected for the sample; as well as the text and locations of every sign, book, and command block in the original world. Learn more about manifest.json. - The
fingerprints.json
file contains semi-detailed information about every world in the survey, all in one file. If you want bang for your buck, this is the single best file in the MCGS dataset. Learn more aboutfingerprints.json
. - The
links-with-sample-info.json
file is a log of each Minecraft world's journey from being stored on a file-sharing site like Mediafire, to being downloaded to my computer and becoming part of the Minecraft Archive Project, to being surveyed and sampled down to become part of the Minecraft Geologic Survey dataset. Learn more aboutlinks-with-sample-info.json
The survey world
This is the primary result of the Minecraft Geologic Survey: a Minecraft world that is significantly smaller than the original world, but which contains many of the points of interest. There are about 175,000 of these survey worlds; one for each original world I was able to survey.
At (0,0) of the survey world we have the spawn chunk of the original world. It's quite possible for the spawn chunk to show up again, elsewhere in the survey. I put the spawn chunk in its own column because this is the first thing a player sees and it's the place the mapmaker is most likely to have tried to make interesting.
If the world does not have a spawn chunk set, I locate a Player entity and use their current location as the spawn chunk.
The survey of the Nether begins at (32, 0) and includes 1% of the Nether chunks (minimum 5, maximum 100), with the more interesting chunks near the origin. The first chunk in the survey is located at (32, 0), the second chunk is at (32, 32), and so on. (There is always a one-chunk gap between survey chunks; this keeps them visually distinct when you load up the survey world in Minecraft.)
If the original world has no Nether, this column of chunks will be empty in the survey world. You can see what this looks like by looking at the screenshot. This world doesn't have a Nether. If it did, you would see a row of chunks in between the spawn chunk and the row of Overworld chunks.
The survey of the Overworld begins at (64, 0) and includes 1% of the Overworld chunks, with more interesting chunks near the origin. The first Overworld chunk is located at (64, 0), the second at (64, 32), and so on.
If you look at the screenshot you'll see that most of the Overworld chunks in this map are nearly identical. Sometimes this is due to shortcomings of the MCGS algorithm for deciding which chunks are "interesting", but sometimes it's just due to a homogenous map.
The survey of the End begins at (96, 0) and includes 1% of the End chunks, with more interesting chunks near the origin. The first End chunk is located at (96, 0), the second at (96, 32), and so on.
Most maps don't feature an End, and End chunks are seldom very interesting.
Note: The survey world will be in Minecraft 1.8's Anvil format, even if the original map was in a different format.
A word about "interestingness"
The current version of survey script measures the interestingness of a chunk in terms of entities: mobs, paintings, chests, spawners, signs, books, command blocks, and so on. It goes through every chunk in the world and scores it with the number of entities found in that chunk.
I chose the entity count because it's basically free to calculate: I had to go through the entities anyway to get the text of signs and books. But of course a chunk doesn't need to contain entities to be interesting: consider a sculpture made of wool. And even if I was really good at figuring out which chunks were interesting, a survey that only contained interesting chunks wouldn't be representative. Mediocre chunks deserve to be represented in the dataset as well.
So the sample starts out very biased towards interesting chunks,
but the bias decreases as more chunks are sampled. The chunk at (64,
0) is pretty much guaranteed to be the most entity-rich chunk in the
Overworld. Overworld chunk #10, at (64, 288) probably has one or two
entities, but probably not a whole lot. Overworld chunk #100, at (64,
3168), is chosen more or less at random from across the entire world,
and is very likely to be normal Minecraft generated terrain.
The manifest:
manifest.json
Every survey directory contains a file
called manifest.json
which talks about the MCGS sample
world, and about the original world used to generate the
sample. Despite its filename, manifest.json
isn't
strictly a JSON file. It's a JSON stream, a file with one JSON
object per line. Each object represents a single point of interest
within the sample or the original world. Here are some of the objects you might see.
Signs: Sign
{ "id": "Sign", "lines": ["1.", "No cheating", "creative mods", "etc."], "dimension": 0, "coords": [-745, 36, 3201] }
This is a sign. Its coordinates within the world are recorded, as is the text on the sign.
Command blocks: Control
{"id": "Control", "command": "/give @a 307", "dimension": 0, "coords": [-750, 35, 3195] }
This is a command block. Its coordinates within the world are recorded, as is the command it executes.
Books: 386
and 387
{"id": 387, "title": "For Cheaters!", "author": "telafiesta", "dimension": 0, "coords": [-282, 73, 2840],' "contained_in": "Chest", "pages": ["Check the Table in the libary, second floor can't miss it."] }
This is a book. The value of contained_in
is an
addendum to coords
which explains where exactly the book
is in that block: "Chest", "ItemFrame", "Player" (in a player's
inventory), or "Ground" (lying on the ground).
An item with id=386 is also a book. (386 is the item ID for "Book and Quill", 387 is the item ID for "Written Book".)
Maps: 358
{"id": 358, "map_id": 1, "path": "data/map_1.dat", "dimension": 0, "coords": [-505, 71, 354], "contained_in": "Chest" }
This is an in-game map (358 is the item ID for "Map"). The survey
directory includes copies of all these maps, and the path
field is the path to the map file within the survey directory. The
maps are in the normal Minecraft map format—all I did was copy
them from the original world directory into the survey directory.
contained_in
works the same way as for books.
Chunks: [chunk]
{"id": "[chunk]", "dimension": 0, "source_chunk": [25, 91], "destination_chunk": [0, 0] }
This is a chunk that was copied from the original world to the
sample world. source_chunk
and destination_chunk
are chunk-based coordinates, so multiply by 16 to get block
offsets. In this example, chunk (25, 91) in the overworld was copied
to chunk (0,0) in the sample world. That is, the chunk that occupies the
space between (400,1456) and (415, 1471) in the original world, occupies
the space between (0,0) and (15, 15) in the survey world.
You can tell that this is the original world's spawn chunk, because the spawn chunk is always copied to (0,0).
Here's another example.
{"id": "[chunk]", "pool": "all", "dimension": 0, "source_chunk": [-49, 214], "destination_chunk": [2, 4] }
Chunk (-49, 214) of the original world (that is, the chunk starting at (-784, 3424)) has been copied to chunk (2, 4) of the survey world. The copy of this chunk in the survey world starts at (32, 64). This is part of the MCGS sample of the world's Overworld, and since it's relatively near the origin, it's probably more on the "interesting" side of the sample.
I don't remember what pool
does, and it always seems to
be "all", so don't worry about it, I guess.
[core-sample]
{"id": "[core-sample]", "target_dir": "sampled/201407/planetminecraft.com/maps/201406/downloads/www.mediafire.com/000/1v1 Arena.rar/1v1 Arena", "source_dir": "unzipped/planetminecraft.com/maps/201406/downloads/www.mediafire.com/000/1v1 Arena.rar/1v1 Arena" }
This is an entry for the MCGS core sample as a
whole. target_dir
is the directory you
found , and
source_dir
is a directory that
was once on my computer, used as a temporary place to unzip 1v1
Arena.rar
, but which is long gone. You probably don't need this
entry—fingerprints-ready.json
is a lot more useful.
The index:
fingerprints.json
Now I'd like to introduce you to a very useful little file
called fingerprints.json
. This file includes basic
information for every Minecraft world in the dataset. If you run the program to
create a Reef world, the first thing that program will do is
load fingerprints.json
to get a sense of what it has
to work with. So let's take a look.
Like manifest.json
, fingerprints.json
isn't strictly a JSON file. It's a JSON stream, a file with one
JSON object per line. Each object represents a single Minecraft world
and its MCGS sample.
Here's the first object from the stream
in fingerprints.json
. I'll go over it in detail in a bit,
but first I want to show you the whole thing.
{ "_id": "201407/planetminecraft.com/maps/201406/downloads/www.mediafire.com/000/1v1 Arena.rar/1v1 Arena", "published": "2014-05-10 21:54", "popularity": 5, "thread": { "_id": "http://planetminecraft.com/project/1v1-arena-2891632/", "host": "planetminecraft.com" }, "commands": [ "/say Read Rules then click the button", "/give @a 307", ... ], "permalink": "http://planetminecraft.com/project/1v1-arena-2891632/", "title": "1v1 Arena", "signs": [ [ "1.", "No cheating", "creative mods", "etc." ], ... ], "books": [], "interactionCount": 0, "spawn_dimension": 0, "creator": "touchdown1545", "classification": "PVP", "spawn": { "block_counts": { "87": 12, "51": 12, ... }, "surface_map": [ [ 17, 18, 18, 18, 24, 24, 24, 11, 11, 11, 24, 24, 0, 0, 0, 0 ], ... ], "height_map": [ [ 61, 60, 60, 60, 35, 35, 35, 34, 34, 34, 35, 35, 0, 0, 0, 0 ], ... ], "entities": {}, "median_height": 35, "tile_entities": {} }, "tags": [ "PvP" ], "diamonds": 1, "memes": [], "modded": false, "description": "Whats up guys. I got my new 1v1 Arena ready...", "views": 48, "downloads": 8, "overworld": { "block_counts": { "174": 1440, "110": 144, ... }, "surface_map": [ ... ] "height_map": [ ... ] "entities": {}, "median_height": 61, "tile_entities": {} } }
Now let's go over it again, but this time I'll explain what everything does.
Info about the world as a whole
_id
"_id": "201407/planetminecraft.com/maps/201406/downloads/www.mediafire.com/000/1v1 Arena.rar/1v1 Arena",
This is a unique ID for the sample. It also serves as a pointer to the sample on disk, relative to wherever you extracted the MCGS tarball.
What do all these path portions mean? I'm glad you asked.
- "201407" is the date of the MCGS sample: July 2014.
- "planetminecraft.com" means that I found out about this world from Planet Minecraft.
- "maps" means that this comes from the Planet Minecraft maps section, not somewhere else like the mods section.
- "www.mediafire.com" means that I downloaded the world from MediaFire.
- "000" is meaningless; it's just a subdirectory I created to avoid having 100,000 files in the same directory.
- "1v1 Arena.rar" is the filename of the original archive file served from MediaFire.
- "1v1 Arena" is the name of the Minecraft world inside "1v1 Arena.rar". Sometimes there will be several worlds inside a single archive file; each world will get its own directory and its own entry in this file.
thread
"thread": { "_id": "http://planetminecraft.com/project/1v1-arena-2891632/", "host": "planetminecraft.com", "title": "1v1 Arena" },
This serves as an identifier for the Minecraft Forum thread (or
equivalent on another site). The _id
should always be a
URL you can use to get to a webpage describing the
world. The title
is the title of the thread (for the
Minecraft forum) or the title of the world on the website (other
sites). This may be different from the name of the world directory!
(See below for details.)
Note that several worlds may come from the same thread.
permalink
"permalink": "http://planetminecraft.com/project/1v1-arena-2891632/",
I'm pretty sure this will always be the same as ['thread']['_id'], but maybe not? Better use this just to be safe.
published
"published": "2014-05-10 21:54",
This is my best guess as to when the world was published. For Minecraft Forum worlds, it's when the forum thread was created.
title
"title": "1v1 Arena",
This is the title of the Minecraft world. It's the name of the
original world directory, and it may be different from
the ['thread']['title']
. With Planet Minecraft the thread
title is usually better than the world title; with the Minecraft forum
the world title is usually better.
creator
"creator": "touchdown1545",
This is the account name used by the mapmaker on the website where I found out about the world.
popularity
"popularity": 5,
This is a very rough measure of the world's popularity,
derived from a combination
of diamonds
, interactionCount
, downloads
,
and views
. Newer worlds will always have lower popularity
scores because they haven't had time to get popular. This number is
meaningless on its own, but comparing the popularity
of
two worlds will tell you, roughly, which one is more popular.
interactionCount, downloads, views, diamonds
"interactionCount": 0, "downloads": 8, "views": 48, "diamonds": 1,
These are the proxies used to calculate popularity
interactionCount
is the number of comments I was able to find on the world's original posting. For Planet Minecraft threads, this is the number of comments in the entire thread.downloads
is the number of downloads.views
is the number of views to the thread.diamonds
is a Planet Minecraft feature similar to "liking".
Not all sites have all these features, which is one reason
why popularity
is so unreliable.
classification
"classification": "PVP",
This is a rough classification of the world into one of several genres, based on title, tags, and the content of in-game signs and books.
- "Adventure challenge" (A challenge world played in adventure mode, e.g. no breaking blocks.)
- "Art" (Usually pixel art.)
- "CTM" (Complete the Monument, a subgenre of "Survival challenge".)
- "Complex Structure" (A subgenre of "Structure", probably containing more than one building.)
- "Creative build"
- "Environment" (Custom terrain.)
- "Flat" (A boring world used as a base for other, more interesting worlds. Made obsolete by terrain generation customization options in recent versions of Minecraft.)
- "Gadget" (Usually involving redstone.)
- "PVP"
- "Parkour" (A subgenre of "Adventure challenge".)
- "Skyblock" (A subgenre of "Survival challenge".)
- "Structure" (Usually a single building.)
- "Survival Island" (A subgenre of "Survival challenge".)
- "Survival challenge" (A challenge world played in survival mode.)
- "Unknown challenge" (A challenge world, but I couldn't figure out whether it's played in survival or adventure mode.)
- "Unknown" (There wasn't enough information for me to automatically classify the world.)
The most popular classifications are "Structure" (34,390 worlds), "Adventure challenge" (30,349 worlds), and "Parkour" (17,657 worlds).
description
"description": "Whats up guys. I got my new 1v1 Arena ready...",
The mapmaker's description of the world.
tags
"tags": [ "PvP" ],
Each string in this list is a tag applied to this world by the mapmaker to help people find it. Planet Minecraft has a standard list of categories, but the Minecraft forum lets you make up your own tags, and other sites may not have tags at all.
memes
"memes": [],
A list of common Minecraft memes employed in this world. This proved not to be very interesting. The memes:
- "Herobrine"
- "Notch"
- "Slenderman"
modded
"modded": false,
My guess as to whether or not this world is designed for a modded version of Minecraft.
spawn_dimension
"spawn_dimension": 0,
The number of the dimension the 'spawn' chunk is taken from. -1 is usually the Nether, 0 is usually the Overworld, and 1 is usually the End. I say 'usually' because sometimes a world will have something weird like -1 as the Overworld and 0 as the Nether. I don't know why. Of course, a modded world may have any number of additional dimensions.
commands
"commands": [ "/say Read Rules then click the button", "/give @a 307", ... ]
Each line in this list is a command found in a command block during
the course of the survey. If you want more details, they're
in the manifest.json
file for this particular world.
signs
"signs": [ [ "1.", "No cheating", "creative mods", "etc." ], ... ],
Each item in this list is a 4-item list containing the text of a
sign found during the course of the survey. If you want more details,
they're in the manifest.json
file for this particular world.
books
"books": [],
Each item in this list is a JSON object describing a book found during the course of the survey. Since there are no books in this world, let's take a look at a world that does have books.
[ {"author": "cilindrin", "pages": ["oops I forgot to tell you the name of this level right?..."], "title":"Opps" }, ... ]
If you want more details on a book, they're
in the manifest.json
file for this particular world.
Chunk fingerprints
An object will contain up to four fingerprint objects, named "spawn", "overworld", "nether", and "end".
- "spawn" is the fingerprint for the spawn chunk (or nearest equivalent), located at (0,0) in the survey world.
- "nether" is the fingerprint for the most interesting Nether chunk, located at (32,0) in the survey world. If there is no Nether in the world, this fingerprint is omitted.
- "overworld" is the fingerprint for the most interesting Overworld chunk, located at (64,0) in the survey world. If there is no Overworld in the world, this fingerprint is omitted.
- "end" is the fingerprint for the most interesting End chunk, located at (96,0) in the survey world. If there is no End in the world, this fingerprint is omitted.
As I mentioned earlier, the MCGS survey software determines how "interesting" a chunk is by counting entities.
Let's take a look at one of these fingerprints.
block_counts
"block_counts": { "87": 12, "51": 12, ... },
This is a JSON object mapping Minecraft block IDs to the number of times that block shows up in this chunk. This example says that block #87 (Netherrack) shows up 12 times, and block #51 (Fire) also shows up 12 times (presumably on top of the netherrack).
surface_map
"surface_map": [ [ 17, 18, 18, 18, 24, 24, 24, 11, 11, 11, 24, 24, 0, 0, 0, 0 ], ... ],
This is a list of 16 lists of 16 block IDs, showing which block type is the topmost block for each (x,z) coordinate in the chunk. In this example, we have a row with oak wood (17) on top, then oak leaves (18), oak leaves, oak leaves, sandstone (24), sandstone, and so on.
If unrecognized block IDs show up in this list, the world is flagged
as modded
.
height_map
"height_map": [ [ 61, 60, 60, 60, 35, 35, 35, 34, 34, 34, 35, 35, 0, 0, 0, 0 ], ... ],
This is a list of 16 lists of 16 y-coordinates. Each is the
y-coordinate of the topmost non-transparent block at a given
(x,z) coordinate. In the first row, the first few values (the oak
tree—remember that the corresponding part
of surface_map
showed the block IDs for oak wood and oak
leaves) are up at y=60, but when we get to the sandstone we quickly
drop to y=35.
This is copied from Minecraft's internal light array, which is why glass blocks are not counted.
median_height
"median_height": 35,
This is just the median of all the values from height_map
.
entities
and tile_entities
"entities": {}, "tile_entities": {}
That's not very interesting. Let's take a look at a different world that has some stuff in here.
"entities": {"Painting":8,"Squid":1} "tile_entities": {"Sign":5,"Chest":2,"MobSpawner":1}
These are objects that count how many entities of a given type are in the chunk. This chunk has eight paintings, a squid, five signs, two chests, and a spawner. That's a pretty interesting chunk!
The presence of unrecognized entities will get a world flagged
as modded
.
The log:
links-with-sample-info.json
You probably don't need this file, but I'll describe it just in case. This is the record of where on the Internet a given archive file comes from, how I downloaded it, where it can be found in the full Minecraft Archive Project dataset, and which Minecraft Geologic Survey worlds were derived from it. Here's an example:
{ "_id": "http://mediafire.com/?brft0d8mv7r4c60", "hostname": "mediafire.com", "type": "download", "data_dump_path": "downloaded/planetminecraft.com/maps/201406", "archive": [ { "accessed": "2014-06-26 21:55:28", "status": 509 }, { "size": 5942464, "filename": "Electric_Cave-1.zip", "path": "201406/downloads/mediafire.com/372/Electric_Cave-1.zip", "media_type": "application/zip", "accessed": "2014-06-26 02:51:21", "status": 200 } ], "mcgs_paths": [ "201407/planetminecraft.com/maps/201406/downloads/mediafire.com/372/Electric_Cave-1.zip/Electric_Cave" ] }
Again, let's take it from the top, with commentary.
{ "_id": "http://mediafire.com/?brft0d8mv7r4c60", "hostname": "mediafire.com", "type": "download", "data_dump_path": "downloaded/planetminecraft.com/maps/201406",
The _id
is always the URL of the original file, which
I got from the Minecraft forum thread, the Planet Minecraft page, or
the equivalent for one of the other sites. It's usually a file hosted
on mediafire.com or dropbox.com.
The hostname
is the hostname of that URL, which I
split out and used for statistical purposes to see which sites hosted
the most files.
In this file, type
will always be "download". I kept
track of other kinds of links (screenshots, Youtube videos, wiki links,
etc), but the world download links were the only ones that made it
into the MCGS, because the MCGS only covers worlds.
data_dump_path
is the path to the directory I was
using when I downloaded this file. You won't find this path in the
MCGS tarball—it's part of the two-terabyte Minecraft Archive Project
dataset—but the MCGS paths are based on these paths.
"archive": [ { "accessed": "2014-06-26 21:55:28", "status": 509 }, { "size": 5942464, "filename": "Electric_Cave-1.zip", "path": "201406/downloads/mediafire.com/372/Electric_Cave-1.zip", "media_type": "application/zip", "accessed": "2014-06-26 02:51:21", "status": 200 } ],
archive
is a list of times I tried to download the
file. Sometimes something goes wrong and I try again later. In this
case, my script successfully downloaded the file on June 26 at 2:51
AM. For some reason (this was a messy process) my script tried to
download this file again at 11:55 PM, and got a 509 error,
which is just as well because the script was able to download it the
first time.
URLs that just gave me a 404 error aren't included in this dataset. This is only for URLs where the script was eventually able to download an archive and at least try to run the MCGS survey code on the archive.
Note that the "201406" in the path
is the same as the
"201406" in the data_dump_path
. If you want to join those
two paths together (there's no reason to do this unless you have the
Minecraft Archive Project data), you'll need to strip one of them.
"mcgs_paths": [ "201407/planetminecraft.com/maps/201406/downloads/mediafire.com/372/Electric_Cave-1.zip/Electric_Cave" ]
Most of the entries in this stream will have one or more paths
in mcgs_paths
. These are the paths to the MCGS sample
world directories I covered earlier. Most of the time there will only
be one path in here, because most of the time an archive file contains
only one Minecraft world. If an archive contains more than one archive
world (let's say it's a collection, or it contains an "easy" and a
"hard" version of the same challenge), each world will be processed by
the survey script, and you'll get multiple entries
in mcgs_paths
.
If the MCGS script runs into trouble when processing a world, there
will also be a field called mcgs_exceptions
, which will
explain the problem. In this case, MCGS couldn't unzip the archive file:
" downloaded/minecraftforum.net/maps/201404/downloads/2012/1/949535/Survival_1.0.zip was not unzipped properly!"
In this case, MCGS unzipped the archive file but there was a problem converting the pre-Anvil world to Anvil format:
"Command '['java', '-jar', 'AnvilConverter.jar', u'unzipped/minecraftforum.net/maps/201404/downloads/2011/2/184310/World4.zip', u'World4']' returned non-zero exit status 1"
In this case, MCGS was able to load the world but the chunk data was corrupt:
"Chunk (22, 46) had an error: IOError('Unknown compress format: 82',)"
Yes, there's no limit to the number of things that can go wrong.
Conclusion
I realize that this is a lot of data, but there's no limit to what you can do with it. The projects I've done so far—the Reef worlds, Minecraft Signs, and minecraft_ebooks—just scratch the surface. At this point you know a lot more about Minecraft's data structures than I did when I started this project, and I've done a lot of the work for you in consolidating the data and sampling it down to a reasonable size. So go to it and have fun!
I recommend using pymclevel to load and manipulate the survey worlds—that's what I used to create them. If you're already familiar with Minecraft modding, you'll probably have better luck finding using something written in Java.
Finally, I'd like to thank the 71,094 mapmakers whose worlds show up in the Minecraft Geologic Survey. You are the ones who made this possible.