Five Years on Scratch: A New Youth Programming Dataset

The Nature publishing group journal Scientific Data recently published the research of two Department faculty, Assistant Professor Mako Hill and Affiliate Assistant Professor Andrés Monroy-Hernández, who is also a Researcher at FUSE Labs, Microsoft Research.

Scientific Data’s publication of a “Data Descriptor” marks the end of nearly three years of work by Dr. Hill and Dr. Monroy-Hernández to build, document, and release a massive dataset of public information, collected over a five-year study on youth programming and social interaction activity on the Scratch online community.

Scratch is a programming language created by Lifelong Kindergarten at the MIT Media Lab.  Created by Dr. Monroy-Hernández in March 2007, the Scratch online community is a public, freely accessible website for users of Scratch, where millions of young people have learned to program by creating and remixing animations, games, and other multimedia elements into interactive programs, referred to as ‘projects.’ Users of the online community can view and learn from previously shared projects, communicate and ask questions of their peers, and find potential collaborators.

Both Dr. Hill and Dr. Monroy-Hernández believe “this is the largest and most comprehensive downloadable dataset of youth programming artifacts and communication.” The dataset is comprised of 32 tables with information on more than 1 million Scratch users, nearly 2 million Scratch projects, more than 10 million comments, more than 30 million visits to Scratch projects, and more, including metadata on user behavior and the full source code for every project.

“Our primary goal in the release of these data is to make it easier for researchers to study how young people learn, create, communicate, and interact in informal learning environments,” the professors explain in the article, “especially around computer programming.” By providing wider access to the complete dataset, Dr. Hill and Dr. Monroy-Hernández hope to encourage other researchers “to develop and test theories in diverse fields of study including communication, the learning sciences, computer science, the social sciences, and digital humanities research.”

Although the dataset contains only data that the Scratch website posts publically, the researchers worked closely with both the Scratch team and the ethics review board at MIT to develop a protocol for the release of the data, which attempts to balance the scientific benefits of wider access to these data with potential risks for human subjects. As a result, the researchers will only grant access to the data to other vetted researchers who agree to the data sharing agreement— and only plan on using it for research purposes.

To view the full “Data Descriptor” publication, including more information on Methods, Core Tables, and Samples, CLICK HERE.