Published on Nov 27,2014
1.3K Views
Email Post

‘Tweet’ the word!

Social Media has evolved over the years and has developed new methods to interact and exchange information on the internet. With the help of these websites today, what was impossible 20 years ago is now feasible within a fraction of a second.

One of the popular phenomena in recent times is Twitter. Initially started in the year 2006, this is a simple platform to interact and tweet about relevant matters. With its latest trends, twitter has attracted a wide range of people in peeping through what this platform is all about.

With over 140 million users, around 340 million tweets get generated each day. Twitter has also succeeded in drawing the attention of political, commercial, research and other establishments by making its data stream available to the public.

Data Warehouse and OLAP

Data warehouse and  OLAP tools are used in business intelligence applications and beyond to support decision making processes. This technology originated in the early 90s as a response to the problem of providing access to all the key people in the enterprise to whatever level of information they need to for decision making. Data warehousing is a specialization of the database technology for integrating, accumulating and analyzing data from various sources. It employs multidimensional data model which structures data into cubes containing measures of interest characterized by descriptive properties drawn from a set of dimensions.

OLAP tools provide means to query and analyze the warehouse information and produce online statistical summaries at different levels of detail. Data mining has also become an integral part of any mature data warehouse system. The former enables automatic discovery of correlations and casual relationships within the data and thus enriches the original data set with additional characteristics.

Data warehousing technology has established itself as the leading solution for large-scale data management and analysis. The first study made on twitter was published in 2010 and it was based on twitter’s topological characteristics and its power as a medium of sharing information. Twitter API framework was launched in 2009 that inspired thousands of application development projects, including a number of research initiatives. There are various twitter-related research that confirms the potential for discovery from its data.

The Architecture

A data warehouse system is structured into multiple layers to optimize the performance and to minimize the load on the data sources. The architecture comprises up to five basic layers from data source to front-end tools of the analysts. The data-source layer is represented by the available Twitter APIs for data streaming and may include additional external sources, such as geographical databases, taxonomies, event detection and language recognition systems for enriching the metadata and the contents of the streamed tweet records.

The ETL (Extract, Transform, Load) layer takes care of capturing the original data stream, bringing it into a format compliant with the target database and feeding the transformed  data set into the data warehouse. The data set delivered by the Twitter Streaming API is semi-structured using the JavaScript Object Notation (JSON) as its output format.

Each tweet is streamed as a JSON object containing 67 data fields with high degree of heterogeneity. A tweet record encompasses the tweeted message itself along with detailed metadata on the user’s profile and geographic location. About 10 % of the total public stream provided by the Streaming API covers more than one million tweets per hour, which is a huge load of data for a high performing data warehouse system.

To understand what type of knowledge can be discovered from this data it is important to investigate the data model.  It encompasses users, their tweets and the relationships between and within those two classes. Users can be friends or followers of other users, tagged in tweets, authors of tweets or re-tweet other users’ messages. The third component is the timeline, which describes the evolution, or the ordering, of user and tweet objects. Using the terminology of the Twitter Developer Documentation, the data model consists of the following three object classes:

1) Status Objects (tweets) consist of the text, the author and their metadata.

2) User Objects capture various user characteristics (nick-name, avatar, etc.).

3) Timelines provide an accumulated view on the user’s activity, such as the tweets authored by or mentioning (tagging) a particular user, status updates, follower and friendship relationships, re-tweets, etc

Got a question for us? Please mention them in the comments section and we will get back to you. 

Related Posts:

A Brief on Data warehouse

A Brief on ETL

Get started with Data warehousing

About Author
edureka
Published on Nov 27,2014

Share on

Browse Categories

Comments
1 Comment