In this age of information, there is huge amount of data everywhere around us. But it has always been a challenging task to make use of this data and convert it into some meaningful information. Almost all the large organizations have been following the practice of maintaining data in offline or online databases. However, the drastic growth of complex Big Data has completely rooted out the use of traditional databases. The ways of data processing have changed; so have the ways of data mining. Now the question arises – What exactly is data mining? Data mining is the process of digging out useful and interesting knowledge from large amounts of data we have or keep receiving in our databases.
Why do we need to mine data?
- Data being generated and captured is getting bigger
- Data storage is getting cheaper
- Rewards of data mining are huge as there is competition among companies, which compels them to sell more to same customers
Data Mining- SEMMA
SEMMA is an approach for the implementation of data mining applications. It involves the following phases:
- Sample – There are volumes of information. Extracting a significant enough sample size allows agile handling of information. Data samples can be classified into three groups, depending on the purpose of their use: Training, Validation, and Test.
- Explore – In this phase of exploration, the user searches unexpected trends or anomalies to gain a better understanding of the data set. In this phase, data is both visually as well as numerically explored for trends and groupings.
- Modify – This is where the user creates, selects, and transforms the variables in order to put them into building the model.
- Model – This is the stage when a combination of variables that reliably predicts a desired outcome is found. Modelling techniques include neural networks, decision trees, logistic models or statistical models as a series of time, memory-based reasoning, etc.
- Assess – In this final phase, the user evaluates the usefulness and reliability of the discoveries made in the process of data mining.
What is R?
R is a free software environment, which provides a wide variety of statistical and graphical techniques meant for statistical computing and graphics. It provides comprehensive collections of packages for different tasks involved in data mining.
RATTLE stands for R Analytical Tool To Learn Easily. Rattle is a tab-oriented user interface that is similar to Microsoft Office’s ribbon interface. With R, it becomes very easy to start with data mining. Following are the tasks that it performs:
- Presents statistical and visual summaries of data
- Transforms data into forms that are ready to be modeled
- It builds both supervised and unsupervised models from data
- Provides graphical presentation of performance of models
- Scores new data sets
Got a question for us? Mention them in the comments section and we will get back to you.