The goal of the "Human Data Interaction" project is to develop methods that are the intersection of data science, machine learning, and large scale interactive systems, to answer a rather very simple question:
Why does it take a long time to process, analyze and derive insights from the data?
Answer to this question lies in developing technology and a new cadre of methodologies that are based on intricately understanding the complexities in how humans (scientists, researchers, analysts, sales folks, marketing folks and everyone of us) interact with data to analyze, interpret, and derive insights from it. As we have ventured into multiple domains, and multiple applications, we observe that the processes involved in generating insights from data can be organized into five steps: organize, pre process, understand, learn models, generate insights, disemminate. The goals of our multiple projects listed below can be summarized with five foundational concepts that we propose in order to smoothen this interaction:
MLBlocks: A language that exposes many choices in the Machine learning endeavor
Feature Factory: Collaborative, interactive, crowd sourced feature discovery solution.
Deep Mining: Tuning the machine learning pipeline.
Visionbase: A database like interface for computer vision pipeline.
The Data Science Machine: The ultimate automated data science machine.
Kalyan Veeramachaneni (Lead)
Alex Wang
Edwin Zhang
Max Kanter
Bryan Collazo
Kiarash Adl (Feature Factory)
We are developing basic building blocks required for learning a model, called MLBlocks. These include multiple parametrized ways to represent data, defining and applying constructs that compare and contrast an entity in the data (student, patient, car etc.)- against self or others, leading up to forming variables, selecting models, and ultimately building and analyzing predictive models. Each step in this process is ridden with choices and parameters that could be tuned to increase the predictive model performance. Structuring this process allows us to make a variety of predictions without the need to revisit the raw data. See MLBlocks for MOOC data Science explained here.
Read more about MLBlocks here and here.