Wednesday, February 18, 2015

Data and Data Warehouse: Scope, Challenges, Future

Structured Data

Structured data refers to information with a high degree of organization which can be easily stored in a relational database which makes search simple and straightforward. It makes application of search engine algorithms and search operations simple and effective. Structured data has the advantage of being easily entered, stored, queried and analyzed, often using the Structured Query Language.
In other words data that resides in a fixed field within a record or file is called structured data. Implementing data warehousing in this type of data requires identifying the business process for which the data would be stored, and further identifying how the data will be stored, processed and analyzed. We need to define the tables and fields for which data will be stored, data types (numeric, currency, date), limitation on values. The relation between field and tables is defined which makes it easy to transform data into information.
Apart from tabular data, data in spreadsheets would be considered structured data, as it can be easily scanned for information and is properly arranged in relational database system.  An interesting point to note is about the XML files. The data in these files are not fixed in location or stored in databases but are still counted as structured, because the data are tagged and can be accurately identified.

Unstructured Data

Unstructured data is information that cannot be organized in a relational database and does not have a defined data model. It includes texts, emails and multimedia contents. Example of unstructured data would be audio files, videos, emails, pictures, web pages, and many business documents. The data generated through social media also falls under this category.  These data sources and files may have an internal structure but still it can be challenging to store such information in row-column basis, and thus it is classified under unstructured data.
The term ‘Big Data’ is often associated with unstructured data as it refers to extremely large data sets which require special tools to analyze the data. However, big data can include both structured and unstructured information.
http://saphanatutorial.com/wp-content/uploads/2014/01/Structured-and-Unstructured-Data.jpg
Structured data approximately contributes to 10%-20% data of an organization. It can be broadly classified into following data types:
Operational Data: This data supports the ongoing operations of an organization. This can include areas such as sales, service, manufacturing, billing, orders, accounts etc. It can be captured in Online Transaction Processing (OLTP) system as well as in OLAP (Online Analytical Processing) systems and is mostly numeric values for analysis and organizational decision making support. 
Master Data: Master data refers to persistent non-transactional data that can be used across multiple functional groups and defines key organizational entities. Master data may include data about customers, products, employees, inventory, suppliers, and sites.
Historical Data: It can be any past information about company’s entities which is stored and can be used to help forecast the company's future; for example, employees, operations, sales, customers, historical price, revenue growth, earnings growth.
Data warehousing is easily compatible with structured data and is being used extensively to store, transform and analyze this data as per business requirements to show results which enhance the decision making capability of organization. As mentioned earlier, spreadsheets and XML files also fall under structured data type (semi-structured data to be precise), we find a limitation posed by current data warehousing techniques as we cannot store XML files data in organized data models and need specific tools and methodologies to implement analytics to such data.

Analyzing Unstructured data

One main difference between structured and unstructured data is that former is analogous to machine-language whereas latter is for human interaction. Experts estimate that almost 80%-90% of data in organizations is unstructured data and is expected to grow to 40 zettabytes (1021) by 2020.  It can be classified in several types based on data source and each type may require different functional support and data mining techniques.
http://www.robertprimmer.com/_Media/image_med_med.png
Figure 2: Structured Vs unstructured data volume over the years
What’s striking is that that the unstructured data has a growth curve substantially greater than that of structured data. If we look out to the 2014 projections we see that, combined, the forecast is to ship 80EB of storage
Static: Scanned documents, faxes, PDF files, X-Ray and other content that is captured and managed but not subsequently modified, although it may be annotated and/or redacted if needed.
Dynamic: Authored or other content that may be created, edited, reviewed, approved by multiple people or groups. Life cycles are associated with this type of unstructured data. Document types may include policies, procedures, white papers other office documents.
E-mail, Instant Messaging: This includes e-mail and instant messaging logs. Typically, this data must be archived and, in some cases, treated as a business record.
Specialized Content: Web data is an example of unstructured data with specialized access control, content entry, rendering and other functions to manage a web site.
Social Media Content: This is a yet another very broad classification of unstructured data and it can include all the above mentioned types. Generally it consists of text, video, audio, gaming graphics, location data and so on and it contains more of rich data rather than text. It is the biggest source of data and a lot of effort is being invested globally to mine and analyze this data. With the amount of volume associated to it, it is often referred as ‘Big Data’.

The huge volume associated with unstructured data makes storage difficult but it also makes it inevitable for organizations to find means to glean information from this data. However data warehousing has not reached the sophistication level to be able to store and analyze unstructured data. Therefore, industry has turned to technological solutions to help them better manage the unstructured data. Techniques such as data mining, Natural Language Processing (NLP), text analytics, and noisy-text analytics provide different methods to find patterns in, or otherwise interpret, this information. Apart from this companies have various tools to manage this data: Big data software like Hadoop, Business intelligence software (handle structured as well as unstructured data), Document Management systems, and Search and Indexing tools.

Data Warehouse Limitations

The ultimate goal of a data warehouse system is to store historical information about a company’s transactions, and make this information easy to comprehend so that it can be used for important business decisions. However, in some business setups the need to store and operate on historical data may be limited and end user the end user may not have a strong interest in older processing data. In such scenarios cost and complexity associated with data warehousing may not bring much value to the business.
Data Flexibility remains a big hindrance as data warehouse tends to have static data with only specific number of ‘drill downs’ to specific solution. Data warehouses are usually subject to ad hoc queries and are quite difficult to tune for processing speed and query speed. The queries are limited to the initial modelling i.e. data relations are usually set when the aggregation level was defined and assembled.
It might be a tough fact to grasp but data warehousing may not be the ultimate solution to every business issue and has certain limitations when it comes to analyzing data. These limitations have been summarized in below points.
  • Data warehousing is complex to implement and needs multiple tools, and sometimes extraction, transformation and loading process may take significant time and effort which can decrease the value of produced results.
  • There is no automated way to get reports or dashboards, it involves significant effort to make data presentable.
  • Data security always remains a concern as it would depend on your cloud vendor (if used in cloud services applications) and integrity of individuals involved in analysis
  • Source system feeding the data warehouse may have hidden problems (for example incomplete information entered), and this may remain undetected for a significant time.
  • In process of data homogenization from several data sources, significant value of data might be lost.
Unstructured data is pervasive, ubiquitous, and has so many variations that it is hard to classify. Similar data may have different characteristics at different places. There is a large volume associated with it and the same type of data may be referred using different terminology by different people. This poses a biggest obstacle in implementing unstructured data in data warehouse because analytical processing requires that there must be a rationalization of terminology, or the analyst cannot recognize when the text illustrates the same thing.
Business intelligence (BI) and data warehousing suppliers have been adding support for unstructured data management to their tool sets, and some IT organizations have built their own platforms for converting unstructured data into structured records, for example, through knowledge management systems. But that can be a time-consuming and expensive process.

Future of Data Warehouse

Even with its limitations Enterprise Data Warehouse will continue to have its place in analytics but with changing time and new technologies booming the industry, its architecture and development process would need to be upgraded to match current market requirements. Big data technologies such as Map reduce, Hadoop will not replace data warehouse, and instead both technologies would run in parallel. Financial analysis and other applications associated with the data warehouse will still be important, and data warehouse itself will be a source of some of the data used in big data projects and will probably receive and store data from analysis of such projects. Few of the changes which can be anticipated in data warehouse field in coming years can be summed up as follows:
  • Real time Analytic: Enterprises will build “operational data warehouses” to combine data from multiple sources in real time and go beyond dashboards and reports to actually use their data in day-to-day operations.
  • Cloud computing which exits and is being used now, will become a requirement.
  • Architectural change to be able to process some segment of unstructured data.
  • A Big data store with ability to analyze huge volumes of data in near real time, and some mechanism to match metadata of the big data analysis to requirements of data warehouse.

Reference:

http://www.bisoftwareinsight.com/future-of-data-warehousing/

1 comment:

  1. SAP HANA is a wonderful in-memory groundwork be of assistance which inturn combos SAP sources programs in addition to pre-tuned web, memory space appliances, together with marketing sap hana training in hyderabad computing devices because of one of many SAP computer system . SAP is among the most leading application producer anywhere. The SAP HANA is meant to allow real-time rational sap hana training in bangalore
    as well as transferable flowing.

    ReplyDelete