Thank you all for a wonderful semester. Here is a summary, in chronological order, of our recorded lectures. You can also view the entire playlist on youtube.
Course Introduction
Marti Hearst, the course instructor at UC Berkeley, introduces the main concepts for the course, and Gilad Mishne (@gilad) of Twitter describes his goals for the course and provides an introduction to Twitter. (slides for lecture1a) (slides for lecture1b).
http://www.youtube.com/watch?feature=player_embedded&v=N2yW6QzRq80
Twitter Philosophy and Software Architecture
Othman Laraki (@othman), Twitter’s Vice President for Growth, International and Revenue, on Growing a Human-Scale Service, and Raffi Krikorian (@raffi), the Director of Twitter’s Platform Services group, on the Twitter Software Ecosystem. View the slides for Othman‘s and Raffi‘s talks.
http://www.youtube.com/watch?feature=player_embedded&v=8egzxhgTV5Q
Introduction to Hadoop
Bill Graham (@billgraham), who is active in the Hadoop community and a Pig contributor, gave a very clear and detailed intro to Hadoop and outlined how it is used at Twitter. His slides can be found here.
http://www.youtube.com/watch?feature=player_embedded&v=t3fEGhE-HYA
Introduction to Apache Pig
Jon Coveney (@jco) gives an in-depth tutorial on Apache Pig, including how it interacts with Hadoop. The log analysis group at Twitter uses Pig extensively. Jon’s slides can be found here: (pdf)
http://www.youtube.com/watch?feature=player_embedded&v=UiIjEzW3br8
Coding to the Twitter API
Rion Snow (@rion) gave an introduction to the Twitter API, including the RESTful API and the streaming API for both Java and Python. See all the slides (no video).
Detecting Twitter Trends
If you’d like to know how Twitter computes its Trending Topics, Kostas Tsioutsiouliklis (@kostas) shared some of the secrets with the class. He also talked about MinHash algorithms. See his lecture notes.
http://www.youtube.com/watch?feature=player_embedded&v=duHxpSTmwW0
Real-Time Twitter Search
Brian Larson (@larsonite), the tech lead for search and relevance at Twitter, gives a detailed technical talk about how real-time search works at Twitter.
http://www.youtube.com/watch?feature=player_embedded&v=PAYOjnLICo4
Splunk’s Software Architecture and GUI for Analyzing Twitter Data
Stephen Sorkin of Splunk described alternative software architecture for processing large data. Splunk also has a sophisticated GUI for analyzing Twitter and other data sources in real time; be sure to watch the last 15 minutes of the video to see the demo. Stephen’s slides: pdf
http://www.youtube.com/watch?feature=player_embedded&v=LPYw5p8J7i0
Twitter’s Social Network
Learn about weak ties, triadic closures, and personal pagerank, and how they all relate to the Twitter social graph from Aneesh Sharma (@aneeshs) in this lecture. Slides here.
http://www.youtube.com/watch?feature=player_embedded&v=_eWDR3P8Lhw
Big Learning with Graphs
Joey Gonzalez, a recent PhD from CMU and a postdoc at UC Berkeley, is working on GraphLab, the hot technology for processing huge graphs quickly. There is new a version called GraphChi (for chihuahua) that you can run on your personal computer; so you don’t even need access to EC2 to run it going forward. Slides here.
http://www.youtube.com/watch?feature=player_embedded&v=E1LwqtBdPYs
Twitter Recommendations
Alpa Jain (@alpa), who works on monetization algorithms at Twitter, described SVD and other recommendation algorithms used at Twitter. Alpa’s slides are here: pdf
http://www.youtube.com/watch?feature=player_embedded&v=NSscbT7JwxY
Security at Twitter and Elsewhere
Kurt Thomas is a former Twitter engineer and a current PhD student at UC Berkeley who studies how the criminal underground conspires to make money via unintended uses of computer systems. He talked about fraud detection for Twitter and other online systems. See his lecture notes.
http://www.youtube.com/watch?feature=player_embedded&v=fU2gcB_toNw
Information Diffusion on Twitter
Stan Nikolov (@snikolov) of the Twitter Search and Relevance team walked through one particular theoretical model of information diffusion which tries to predict under what conditions an idea stops spreading based on a network’s structure. The slides in his Lecture Notes let you see the Pig scripts in detail, and you can see the video simulatinos that Stan created on his blog.
http://www.youtube.com/watch?feature=player_embedded&v=lbCmFZpMNxA
Introduction to Scalding
On Thursday we learned about an alternative language for analyzing big data: Scalding. It’s built on Scala and is used extensively by the Twitter Revenue group. Oscar Boykin (@posco) presented a lecture that he and Argyris Zymnis (@argyris) put together. See the lecture notes for more details.
http://www.youtube.com/watch?feature=player_embedded&v=4wNwOGBkTGQ
Spark: Making Big Data Analytics Interactive and Real-Time
Spark is the hot next thing for Hadoop / MapReduce, and Matei Zaharia (@matei_zaharia), a PhD student in UC Berkeley’s AMP Lab, described how it works and what’s coming next. The key idea is to make analysis of big data interactive and able to respond in real time. Next up in the research agenda is streaming data and blending real time and batch processing. Matei also gave a live demo. (slides here)
http://www.youtube.com/watch?feature=player_embedded&v=rpXxsp1vSEs
Course Wrapup
These lecture notes simply summarized the course at a high level. Thanks for the great semester!