The Human Data Interaction Project
The goal of the "Human Data Interaction" project is to develop methods that are the intersection of data science, machine learning, and large scale interactive systems, to answer a rather very simple question:
Why does it take a long time to process, analyze and derive insights from the data?
Answer to this question lies in developing technology and a new cadre of methodologies that are based on intricately understanding the complexities in how humans (scientists, researchers, analysts, sales folks, marketing folks and everyone of us) interact with data to analyze, interpret, and derive insights from it. As we have ventured into multiple domains, and multiple applications, we observe that the processes involved in generating insights from data can be organized into five steps: organize, pre process, understand, learn models, generate insights, disemminate. The goals of our multiple projects listed below can be summarized with five foundational concepts that we propose in order to smoothen this interaction:
- Organize and expose: One should organize the entire data science endeavor for different data types (and combinations thereof): Signals, images, linked data (relational) and text. When possible, we should expose every parameter for which a choice was made in pre-processing, feature extraction, data representation or modeling. Developing a language that standarizes these processes and exposes all the parameters is paramount for building next generation frameworks.
- Encapsulate: When possible one should encapsulate the process at the right abstractions, so proper interfaces can be provided to both meta machine learning approaches (See Self learning) and humans who wish to choose among parameters.
- Interactive interfaces: Proper interfaces to some standardized routines is critical. Developing, in-memory or distributed processing engines, accompanied by distributed storage of intermediary reusable data structures, while providing very simple interactive interfaces to control the computation is essential.
- Increase participation: When needed, one has to pay attention as to where the process is limited by human intuition and scale that via collaborative and interactive platforms.
- Self Learn: Once the entire process is organized, it is natural to ask, that perhaps we could learn from past experiences of going from raw data to insights and optimize the next data science endeavor such that valuble compute and human resources are saved.
MLBlocks: A language that exposes many choices in the Machine learning endeavor
Feature Factory: Collaborative, interactive, crowd sourced feature discovery solution.
Deep Mining: Tuning the machine learning pipeline.
Visionbase: A database like interface for computer vision pipeline.
The Data Science Machine: The ultimate automated data science machine.
Machine Learning Blocks (MLBlocks)
We are developing basic building blocks required for learning a model, called MLBlocks. These include multiple parametrized ways to represent data, defining and applying constructs that compare and contrast an entity in the data (student, patient, car etc.)- against self or others, leading up to forming variables, selecting models, and ultimately building and analyzing predictive models. Each step in this process is ridden with choices and parameters that could be tuned to increase the predictive model performance. Structuring this process allows us to make a variety of predictions without the need to revisit the raw data. See MLBlocks for MOOC data Science explained here.