Thank you all for a wonderful semester. Here is a summary, in chronological order, of our recorded lectures. You can also view the entire playlist on youtube.
Marti Hearst, the course instructor at UC Berkeley, introduces the main concepts for the course, and Gilad Mishne (@gilad) of Twitter describes his goals for the course and provides an introduction to Twitter. (slides for lecture1a) (slides for lecture1b).
Twitter Philosophy and Software Architecture
Othman Laraki (@othman), Twitter’s Vice President for Growth, International and Revenue, on Growing a Human-Scale Service, and Raffi Krikorian (@raffi), the Director of Twitter’s Platform Services group, on the Twitter Software Ecosystem. View the slides for Othman‘s and Raffi‘s talks.
Introduction to Hadoop
Bill Graham (@billgraham), who is active in the Hadoop community and a Pig contributor, gave a very clear and detailed intro to Hadoop and outlined how it is used at Twitter. His slides can be found here.
Introduction to Apache Pig
Coding to the Twitter API
Detecting Twitter Trends
If you’d like to know how Twitter computes its Trending Topics, Kostas Tsioutsiouliklis (@kostas) shared some of the secrets with the class. He also talked about MinHash algorithms. See his lecture notes.
Real-Time Twitter Search
Brian Larson (@larsonite), the tech lead for search and relevance at Twitter, gives a detailed technical talk about how real-time search works at Twitter.
Splunk’s Software Architecture and GUI for Analyzing Twitter Data
Stephen Sorkin of Splunk described alternative software architecture for processing large data. Splunk also has a sophisticated GUI for analyzing Twitter and other data sources in real time; be sure to watch the last 15 minutes of the video to see the demo. Stephen’s slides: pdf
Twitter’s Social Network
Big Learning with Graphs
Joey Gonzalez, a recent PhD from CMU and a postdoc at UC Berkeley, is working on GraphLab, the hot technology for processing huge graphs quickly. There is new a version called GraphChi (for chihuahua) that you can run on your personal computer; so you don’t even need access to EC2 to run it going forward. Slides here.
Security at Twitter and Elsewhere
Kurt Thomas is a former Twitter engineer and a current PhD student at UC Berkeley who studies how the criminal underground conspires to make money via unintended uses of computer systems. He talked about fraud detection for Twitter and other online systems. See his lecture notes.
Information Diffusion on Twitter
Stan Nikolov (@snikolov) of the Twitter Search and Relevance team walked through one particular theoretical model of information diffusion which tries to predict under what conditions an idea stops spreading based on a network’s structure. The slides in his Lecture Notes let you see the Pig scripts in detail, and you can see the video simulatinos that Stan created on his blog.
Introduction to Scalding
On Thursday we learned about an alternative language for analyzing big data: Scalding. It’s built on Scala and is used extensively by the Twitter Revenue group. Oscar Boykin (@posco) presented a lecture that he and Argyris Zymnis (@argyris) put together. See the lecture notes for more details.
Spark: Making Big Data Analytics Interactive and Real-Time
Spark is the hot next thing for Hadoop / MapReduce, and Matei Zaharia (@matei_zaharia), a PhD student in UC Berkeley’s AMP Lab, described how it works and what’s coming next. The key idea is to make analysis of big data interactive and able to respond in real time. Next up in the research agenda is streaming data and blending real time and batch processing. Matei also gave a live demo. (slides here)
These lecture notes simply summarized the course at a high level. Thanks for the great semester!