Rising alongside the relatively new technology of big data is the new job title data scientist. While not tied exclusively to big data projects, the data scientist role does complement them because of the increased breadth and depth of data being examined, as compared to traditional roles.
So what does a data scientist do?
A data scientist represents an evolution from the business or data analyst role. The formal training is similar, with a solid foundation typically in computer science and applications, modeling, statistics, analytics and math. What sets the data scientist apart is strong business acumen, coupled with the ability to communicate findings to both business and IT leaders in a way that can influence how an organization approaches a business challenge. Good data scientists will not just address business problems, they will pick the right problems that have the most value to the organization.
Over the years, working as a data modeler and database architect, I have noticed that there are a couple rules that should be followed during data modeling and development. Here I describe some tips in the hope that they might help you. I have listed the tips in the order that they occur during the project lifecycle rather than listing them by importance or by how common they are.
The Protocols and Structures for Inference (PSI) project has developed an architecture for presenting machine learning algorithms, their inputs (data) and outputs (predictors) as resource-oriented RESTful web services in order to make machine learning technology accessible to a broader range of people than just machine learning researchers.
Data Science Studio (DSS) from Dataiku is a complete Data Science software tool for developers and analysts, which significantly shortens the time-consuming load-clean-train-test-deploy cycles of building predictive applications. A community edition and a free trial available.
Feature selection techniques have become an apparent need in many bioinformatics applications. In addition to the large pool of techniques that have already been developed in the machine learning and data mining fields, specific applications in bioinformatics have led to a wealth of newly proposed techniques.