ALFA Group Projects in Fall 2017 / Spring 2018
Learning Defenses in Computer Networks: Adversarial Machine Learning Approaches
Computer systems are easy to attack if considered in a static scenario. The adversary has the advantage in time to study the system, find its vulnerabilities and choose the place to attack. To counter that, one can use the concept of moving target defense (MTD) by making the system dynamic and consequently more difficult for attacker to exploit since he also then has to deal with a great deal of uncertainty just like defenders do.
This project aims at using adversarial neural networks concept in order to model the dynamics between the defender and attacker. The project will involve applying machine learning to investigate how to secure Peer-to-Peer networks against autonomous and adaptive adversaries. It is ideal for students planning on taking 8.857 and/or 6.858.
Trajectories Like Mine: Machine Learning for Healthcare
The machine learning problem of “trajectories like mine” is to efficiently find patients with physiological waveforms similar to a reference waveform. Once a similarity set is found, it can be exploited for future or diagnostic extrapolations to the patient of reference without model-based learning.
One ML approach for retrieving “trajectories like mine” is locality sensitive hashing. We are interested in scalable and practical implementations of LSH extensions for prediction problems in EEG, ECG or arterial blood pressure (ABP).
Data Science for Online Education
More and more material is available for learning online, e.g. on Massive Open Online Course platforms. While online students learn their activities can be tracked and later used for research or even cycled through machine learning models to help guide instruction to improve their learning.
ALFA is conducting Data Science and Machine Learning on such activity data. Join us in developing new technology to help learners learn better and instructors teach better. We are working on feature engineering, predictive modeling and transfer learning. We plan to dive into an "Intro to Java" course to motivate our efforts.
How Safe Are We: Metrics in Cybersecurity
To assess the resilience of a computer system against various threats we need to have metrics that measure certain properties of a system. Moreover, metrics can be defined not only for a defender but also for assessing how successful the attacker is.
Currently, there already exists a number of metrics and categories in cybersecurity. The goal of this project is to examine the existing metrics and propose new ones that better capture certain behaviors of a system. Particular emphasis is given to metrics that can detect zero-day attacks. For development of new cybersecurity metrics we will e.g. draw inspiration from metrics used in biology.
MEng and UAP Projects
MOOC Data Science: Exploiting Learner Activity Data for the Science of Learning
Project Question:How can the activity data of a single or thousands of online learners be organized, analyzed and then exploited to inform Learning Science?
Each time a learner interacts with an e-learning system it is possible to capture a record of their engagement. Data comprising mouse clicks, video controls, problem responses, programming, collaborations and discussions, i.e. learner activity data, then becomes available to Learning Science. The challenge is to provide technology and develop new approaches that transform this fundamentally different set of observations of how learning proceeds into actionable knowledge. Our project goal is to provide insights into online Learning Science: the science of how e-learners learn and how online courses, modules and units can be effectively taught. For example, there is significantly larger student stopout in online courses compared to traditional classroom settings. How can stopout be accurately predicted? How can prediction support more personalized interventions and potentially decrease stop-out rates?
The MOOC platform logs most learner behaviors, e.g. clicks. For large MOOCs this means a significant amount of data. We are developing an open source data science pipeline as part of our MOOC Learner Project (MLP). This will make it available to Learning Science teams. The components of MLP are currently:
- MOOC Learner Curated: unifies and curates the captured data from an edX MOOC platform into a schema oriented around learners and different types of activities, populating a course database in mysql.
- MOOC Learner Quantified: forms complex descriptions i.e. features, of student learning behavior and course properties from a course database). Kicks off the process of understanding a particular course and its students.
- MOOC Learner Modeled: creates models of MOOC learner behavior using machine learning, from a dataset derived using MLQ s/w. Facilitates prediction and explanatory modeling.
- MOOC Learner Visualized: visualizes MOOC learner data, features and models.
7+ MITx courses, constituting 20-120 GB of "raw" click stream data. In addition, text data e.g. forums.Research Questions
- Before and After: Measuring the Impact of Instructional Design Changes: What is the impact of changes in course design, e.g. updated videos? We have two sessions of the same course where the video content and structure (i.e. length) has been changed in between. By comparing the learner activity in the before session to the after session, what can we learn about the effectiveness of the changes? An ALFA team member has recently developed code that curates video watching behavior for this course. A new team member is sought to analyze the data
- Confirming the Doer Effect: Can we observationally confirm the doer effect in Introduction to Programming MOOCs? "The "doer effect" is an association between the number of online interactive practice activities students' do and their learning outcomes that is not only statistically reliable but has much higher positive effects than other learning resources, such as watching videos or reading text. Such an association suggests a causal interpretation — more doing yields better learning — which requires randomized experimentation to most rigorously confirm. But such experiments are expensive, and any single experiment in a particular course context does not provide rigorous evidence that the causal link will generalize to other course content." Koedinger, MacLaughlan, Jia, Bier, 2016 These learning scientists suggest that "analytics of increasingly available online learning data sets can complement experimental efforts by facilitating more widespread evaluation of the generalizability of claims about what learning methods produce better student learning outcomes."
- Sustaining Open Source Software: How can we develop and maintain a useable open source software project to use for data science with MOOCs, see http://mooclearnerproject.csail.mit.edu
- Learning design patterns: Can the patterns that instructors use to teach a concept, given a platform's options, be conceptually extended and implemented with semantically tagging for specific analysis? This means that before data collection content would be simply instrumented for activity collection that, upon post-hoc analysis, will help identify the impact of the pattern usage. This effectively enables a learning scientist to tell the data scientists what to collect with knowledge of the content and the teaching design decision process. This work is in collaboration with colleagues from the HK University School of Education who are developing a learning design pattern language, A new student team member is sought to work on semantic tagging and data interpretation for a particular pattern, course and module.
Machine Learning and Cyber Security in Peer-to-Peer Networks
Denial of Service (DoS) Cyber attacks continue to increase and cause numerous disruptions in both industry and politics. With more and more critical information moving through networks, it is important to keep these networks available. A Peer-to-Peer network can be utilized against DoS attacks due to its centralized nature, but the separation between physical and logical layer is still challenging. The project will involve applying machine learning to investigate how to secure Peer-to-Peer networks against autonomous and adaptive adversaries. It is ideal for students planning on taking or who have taken 8.857 and/or 6.858.
Coding the Tax Code: Regulation to Formalism
AI techniques exist that translate case law into software and that support intelligent reasoning and argumentation around it. This project focuses alternatively on the regulatory form of law, e.g. tax law. It will involve developing an automatic parsing system for translating regulations into a formalism. It is part of the larger STEALTH project. This project will appeal to students interested in programming languages and/or natural language text understanding and representation techniques. See the article in the NYTimes in relation to this project: Computer Scientists Wield Artificial Intelligence to Battle Tax Evasion.
Learning Defenses in Computer Systems: Adversarial Neural Networks Approach
Computer systems are easy to attack if considered in a static scenario. The adversary has the advantage in time to study the system, find its vulnerabilities and choose the place to attack. To counter that, one can use the concept of moving target defense (MTD) by making the system dynamic and consequently more difficult for attacker to exploit since he also then has to deal with a great deal of uncertainty just like defenders do. This project aims at using adversarial neural networks concept in order to model the dynamics between the defender and attacker. There, both defender and attacker would be represented with a neural network that learn how to perform tasks of defense and attack, respectively.
Trajectories Like Mine: Machine Learning in Healthcare
The machine learning problem of “trajectories like mine” is to efficiently find patients with physiological waveforms similar to a reference waveform. Once a similarity set is found, it can be exploited for future or diagnostic extrapolations to the patient of reference without model-based learning. One ML approach for retrieving “trajectories like mine” is locality sensitive hashing. We are interested in practical implementations of LSH extensions for prediction problems in EEG, ECG or arterial blood pressure (ABP). Can different hashing families be combined? Is there different hashing families for different types of predictions or data?
See below for some of the projects we offered in the past, if you'd like to familiarize yourself with our interests.
Past Projects Table of Contents
- Data Science (RA for MEng) (Filled - Spring 2017)
- Machine Learning and CyberSecurity in Software Defined Networks (Filled - Spring 2017
- Coding the Tax Code: Regulations to Formalism (Filled - Spring 2017)
- Coding the Tax Code: Software Verification (Filled - Spring 2017)
- Coevolutionary Algorithm Design (Filled - Fall 2016)
- Network Model Simulation for Cybersecurity (Filled - Fall 2016)
- MOOC Data Schema Population (Filled - Fall 2016)
- MOOC Behavioral Variable Engineering (Filled - Fall 2016)
- MOOC Student Modeling (Filled - Fall 2016)
- STEALTH: Tax Law as Non-Monotonic Logic (Filled - Fall 2016)
- Machine Learning for Candidate Filtering (Filled - Fall 2015)
- STEM (Filled - Fall 2015)
- STEALTH (Filled - Fall 2015)
- Fast BigData Learning with GPUs (Filled - 2013)
- Development of a 'blended' (classroom and online) course in China on Evolutionary Processes, Systems and Computation (2014, filled)
- Mining a MOOC's activity data: 6.002X explored (Filled - 2012)
- Super-UROP Projects 2013-2014
Data Science (RA for MEng)
(Filled - Spring 2017)
Data science is positioned at the intersection of large scale data collection, analytics and machine learning. Domain experts want insights that are actionable from their data but extracting predictive models and other insights requires an extensive software work flow. We are trying to make that workflow flexible, fast and intelligent. This project will tackle the problem of working the student behavioral data archived from a MOOC through an open source data science pipeline. Experience in Python is essential, experience with SQL and machine learning is helpful.
Machine Learning and CyberSecurity in Software Defined Networks
(Filled - Spring 2017)
The flexibility of Software Defined Networks has resulted in increasing growth and adaptation. However recently alarming hacking vulnerabilities have been revealed. The project will involve applying machine learning to replicate how topology poisoning can be used for eavesdropping and how to detect the fake links introduced into the network by the attacker. It is ideal for students planning on taking or who have taken 8.857 and/or 6.858. We need help with the modeling, elucidating the hacking schemes, figuring out defenses.
Coding the Tax Code: Regulations to Formalism
(Filled - Spring 2017)
AI techniques exist that translate case law into software and that support intelligent reasoning and argumentation around it. This project focuses alternatively on the regulatory form of law. It will involve developing an automatic parsing system for translating regulations into a formalism. It is part of the larger STEALTH project. (NY Times coverage) This project will appeal to students interested in programming languages and/or natural language text understanding techniques.
Coding the Tax Code: Software Verification
(Filled - Spring 2017)
To date, programs that constitute AI-based law are difficult to check, i.e. there is no formal process that verifies the law is correctly transcribed. This project will involve developing a verification process. We're trying to develop a method to verify the program resulting from translating an intermediary legal formalism (the specification) to a program. This activity is part of the larger STEALTH project (NY Times coverage ). This project will appeal to student interested in the topics of two advanced courses: 6.820 and/or 6.887.
Coevolutionary Algorithm Design
(Filled - Fall 2016)
Implementation of a co-evolutionary algorithm to support the RIVALS and MACE systems. These systems support the study of adversarial dynamics in cyber-security. They model attackers and defenders in cyberspace paying particular attention to arms races and deception.
Network Model Simulation for Cybersecurity
(Filled - Fall 2016)
Model peer to peer and Software Defined Network configurations and traffic to support the configurability of the RIVALS and MACE systems. These systems support the study of adversarial dynamics in cyber-security. They model attackers and defenders in cyberspace paying particular attention to arms races and deception.
MOOC Data Schema Population
(Filled - Fall 2016)
Contribute and support the release of open-source software that transforms the raw learner behavioral data captured during MOOC learning into a relational database.
MOOC Behavioral Variable Engineering
(Filled - Fall 2016)
Contribute and support the release of open-source software that transforms direct MOOC learner data stored within a relational database into module-based variables that describe learner behavior at a practical abstraction.
MOOC Student Modeling
(Filled - Fall 2016)
Contribute and support the release of open-source software that models and predicts student behavior from module-based variables.
STEALTH: Tax Law as Non-Monotonic Logic
(Filled - Fall 2016)
Translate specific parts of the tax code into monotonic logic to evaluate the logic's capacity to express code and support inferential use.
Machine Learning for Candidate Filtering
(Filled - Fall 2015)
How do you make sure that you do not miss any candidates with potential when you filter out applicants for an advertised position, i.e. can you create a system that covers for you when you have a bad day? When selecting candidates from a large number of applications there is always a worry that you missed a candidate with great potential. In classification terms, you indicated a false negative. For example when there are multiple applications to an education program, how do you make sure that you do not miss students with strong potential. This project involves identifying which features to observe when selecting candidates from a large number of applications and building a machine learning system that produces a model for decision support. This system will be deployed in order to indicate candidates with great potential.
(Filled - Fall 2015)
How do you make computer science teaching, compelling, accessible and possible to use at a large scale? In the ALFA group we are developing Educational Technology to help improve the content, delivery and reach of computer science education, e.g. reducing the passive learning by interacting algorithms in the class room and then stepping through them online. Whether, you think you can create material which is aesthetically pleasing, intuitive to use, contain engaging exercises and/or enlarge capacity and automation, then ALFA is for you.
(Filled - Fall 2015)
Can we hack the legislation "whack-a-mole"? Every time a regulation is introduced, a new loophole is found and exploited. In the STEALTH project http://stealth.csail.mit.edu at the ALFA group we are developing Artificial Intelligence approaches in order to anticipate fraud in regulations, e.g. learning what transactions in the US Partnership taxation that can be non-compliant and how to audit them. In the STEALTH project you will specify, implement, test and improve legislation before releasing it in order to catch unwanted features and identify what actions might be indicative of suspicious behavior.
Development of a 'blended' (classroom and online) course in China on Evolutionary Processes, Systems and Computation
(Filled - 2014)
How do you create a novel and interesting course in Computer Science with a scalable global delivery?
In collaboration with Shantou University in China we are developing a "blended" (classroom and online) course on Evolutionary Processes, Systems and Computation. The first week will be taught in China, then 8 weeks are taught online and finally the last week is taught in China. The course aims are:
- To extend the students understanding of evolutionary processes, systems and computation.
- To expose the students to new or non-conventional ways of learning, given their experience.
The longer term aim is to expand and scale the course to a full MOOC. In order to further the MOOC educational experience we want to consider what data to collect in the course in order to do analytics.
The tasks in the project involve:
- Setting up software for running the blended course
- Develop software for the projects in course, e.g. Mobile distributed Interactive Evolutionary Computation project of co-evolving strategies
- Assisting delivery of blended course to Chinese students
- Designing experiments and collection of data regarding blended learning
You will be working with a team of Post Docs, PhDs, MEngs and UROPs. Knowledge of Mandarin is a bonus. Full, might open during IAP and Spring 2014
Fast BigData Learning with GPUs
(Filled - 2013)
Recent advances in programming tools allow the programming of graphics processing units (GPUs) at a high level of abstraction, leveraging the great computational power of graphics cards for general purpose computation. These massively parallel architectures (more than a thousand cores) are in fact being used in a variety of fields such as bioinformatics, computational finance, and data science among many other scientific applications. In the field of Machine Learning, GPUs can reduce significantly the time required to learn from large-scale datasets. In particular, Evolutionary Computation approaches are great candidates for parallelization in GPU due to the fact that they can easily be parallelized. We are seeking an undergraduate, AUP, or senior who is interested in programming GPUs and implementing scalable and fast machine learning approaches for BigData scenarios with special focus on (but not limited to) evolutionary computation.
Mining a MOOC's activity data: 6.002X explored
(Filled - 2012)
We are building a variety of machine learning algorithms for mining data generated while delivering educational content to hundreds and thousands of students all over the world. A very fundamental question that folks in education are attempting to answer is: "What worked?" Answering this question would require us to analyze data in novel ways, for example building models of students, balancing for confounding factors. We are looking for a talented UROP or MEng student to work with a Research Scientist and a group of scientists and fellows at the MIT EdX team. This project has possible transformative affects on the next generation education systems. Read about EdX here and here.|
Juniors, Seniors, MEng
Background: Course 6 courses in software and machine learning knowledge (6.034 and 6.867)
Super-UROP Projects 2013-2014
ALFA Group listed 8 super-UROP projects on the EECS department super-UROP site. Five were accepted after being discussed, refined and submitted as proposals. Another has become a UROP project.
Considering Biological Factors to Improve Genetic Programming in Detecting SNP Interactions, Chau Vu.
Improving the Speed and Performance of FlexGP Using Core-sets, Elisa Castener
Knowledge Discovery and Prediction from Blood Pressure Data, Harrison Hunter
Knowledge Discovery from Data Arising from Massive Open Online Courses, Matt Susskind
Scaling Dynamic Bayesian Networks on Volunteer Computer: Course Quality for MITx, Nico Rakover
Our 2012-2013 Super UROP projects are introduced here.