Link: http://www.etc-architect.com/?p=339
From ETC-Architect
There are often a lot of people in big data architecture that are often lacking knowledge in data science, so in my observation the most common mistakes in big data projects are not in technology, but in the following areas:
- Everything builds on one master model. The problem is that when you need accuracy you need to deploy many models, instead of working with queries against one model. So when you use the data for predictions you would always app;y several algorithm, such as random forests or Kaggle competitions against the data.So before you start with anything else get yourself a mix of models as the blending of models is the only real thing that increases accuracy, apart from data quality. But with data quality you first need to find out where to start and many models will deliver you that insight as well.
- Data is not tested. Often projects will test code, infrastructure, business processes and a lot more, but not the data. Especially when you use the same data for different analysis test the data for coherence and noise.
- Data is not smoothed. Anyone who has ever done some work in statistics knows about smoothing, such as a local weighted scatterplot smoothing. Data architecture should not forget about these basics in data science.
- Start with a computer. This is properly the most common mistake done by people coming from Information technology. Instead first plot it. Even in genomics with terabytes of data only scientists that made a manual plot (such as the Bland-Altman plot) before they started made any real discoveries.
- Data mix is unknown in size. In virtually all environments we have a mix of data sources with often the same data residing in multiple places. The one thing that is important to know is to describe the real sample size, especially when analysing customer data.
- Trusting your data. Unless you understand the main confounders do not trust the data. It is very hard to explain why a certain group had success with their diet as there virtually no randomised dieter. The same is true for customer satisfaction which reaches record level when the weather turns out to be just perfect.
- Start with a solution. When doing data architecture stick to the problems and keep them alive, often solutions will come and go, but as long you have a clear definition of the problem you can always get back.