Friday, December 9, 2011

IT Has 26 Words for Data Mining

As data proliferate, so do words for handling them
By Paul McFedries
December 2011

Illustration: Brian Stauffer

Intelligence about baseball had become equated in the public mind with the ability to recite arcane baseball stats. What [baseball statistician Bill] James's wider audience had failed to understand was that the statistics were beside the point. The point was understanding; the point was to make life on earth just a bit more intelligible. — Michael Lewis in Moneyball (2003)

Organizations of all sizes are sitting on mountains of data; what they really need are knowledge engineers who can excavate nuggets of ­valuable information from that data. Earlier this year (in "The Coming Data Deluge," IEEE Spectrum, February 2011), I mentioned the concept of data mining, which uses sophisticated software and database tools to extract nonobvious patterns, correlations, and useful information from large and complex data sets.

Data mining begins with data preprocessing: the gathering of the raw data, which is stored in a data warehouse or data mart. It continues with data cleansing, which removes unrelated or unnecessary data (called dirty data or noise) and looks for missing information.

As the quote from Michael Lewis suggests, the point of data mining is knowledge discovery—the extraction of nonobvious or surprising information hidden in a data set. In data-mining circles, it's axiomatic that the less obvious the knowledge extracted, the more valuable that knowledge is to the organization. Nonobvious patterns represent new opportunities, be it for research, productivity, marketing, or whatever. This is best illustrated by the legendary diapers and beer connection, where data miners allegedly noticed that retail sales of diapers and beer would often spike in tandem. Why? Because new dads asked to pick up diapers on the way home from work would also pick up beer. When retailers stocked the products next to one another, sales of both were said to skyrocket.

Another term for finding previously unseen connections in a data set, especially when there are more than two variables, is pattern mining, and the quarried patterns are called association rules.

Many data sets consist of large amounts of text, such as e-mail, so data-mining projects typically use textual analysis to dredge up connections within that data, a process known as text mining. Another promising avenue is audio mining (also called audio indexing), which is the process of extracting and indexing the words in an audio file and then using that index as data to be otherwise mined. It will come as no surprise that engineers have also come up with ingenious methods for indexing other types of media, including image mining and video mining. If the data set consists of geographical information, it is called spatial (or geospatial) mining. In this increasingly social world, researchers are turning to crowd mining, where they try to unearth useful knowledge from large databases of social information. On a more general level, Web mining refers to the harvesting of useful patterns from data sets of Web content, Web usage (such as server logs), and Web structure (such as hyperlinks).

If a data set is just too large to probe efficiently, data miners can often get away with sampling portions of it, a technique variously known as data dredging, data fishing, or data snooping.

Data mining sounds innocent enough on the surface, but privacy advocates warn that it can be used for nonbenign purposes. When Internet service providers and companies such as Google hoard massive data sets that detail the online activities of hundreds of millions of people, automated data mining methods can analyze that data to look for patterns of suspicious activity. As computer scientist Jonathan Zittrain has pointed out, "When governments begin to suspect people because of where they were at a certain time, it can get very worrying."

Whether it's a boon or a bane, informative or intrusive, you've seen here that the field of data mining is a rich source of new words and phrases. As I see it, my job here at IEEE Spectrum is to sift through the raw material of articles, papers, blogs, and books to uncover new lexical gems and then present them to you in this column. Call it word mining.

Tuesday, December 6, 2011

Stanford University: Natural Language Processing course

The following online course on Natural Language Processing from Stanford University (sorry for advertising a non-DIT course!) may be useful for some dissertations:

There is a range of other courses listed at the bottom that sound really interesting if you have time such as Probabilistic Graphic Models, Machine Learning and Information Theory.
Course Description
The course covers a broad range of topics in natural language processing, including word and sentence tokenization, text classification and sentiment analysis, spelling correction, information extraction, parsing, meaning extraction, and question answering, We will also introduce the underlying theory from probability, statistics, and machine learning that are crucial for the field, and cover fundamental algorithms like n-gram language modeling, naive bayes and maxent classifiers, sequence models like Hidden Markov Models, probabilistic dependency and constituent parsing, and vector-space models of meaning.

Tuesday, November 15, 2011

Useful Dataset sites

Dataset Lists

Dublin Dashboard

KDNuggests: Datasets for Data Mining

Google: Dataset Directory

StatLib---Datasets Archive

U.S. Census Bureau

Datasets: 2010 UK Election Results

Computer Vision Papers Datasets

Dataset Analytics Vocabulary

Datasets - DNA Analytics CGH

Datasets collected by bitly

MCFC Analytics

Truthy: Information Diffusion in Online Social Networks

Datasets from Competitions


Berlin Brain-Computer Interface

Netflix Prize


Santa Fe Time Series Competition Data Set B

Time Series Forecasting Grand Competition for Computational Intelligence

PAN 2012 - Uncovering Plagiarism, Authorship and Social Software Misuse

Brian Mac Namee recommends:

UC Irvine Machine Learning Repository

Central Statistics Office Ireland, also check out the 2011 census data

InfoChimps: Find data for apps & analytics

Data in Gapminder World

Welcome to the London Datastore


U.S. Music Preferences Data

Opinion Mining, Sentiment Analysis, and Opinion Spam Detection

A Big List From Mahout

What Is Data in Literary Studies?

Brendan Spillane recommends:

IMDB - Alternative Interfaces Export your data from social media as Excel or CSV

Garrett Duffy recommends:

LOGD Dataset Catalog

Alan Cooke recommends:



Colman McMahon recommends:

30 Places to Find Open Data on the Web

Finding Data on the Internet


Forbes: Special Report - Data Driven

Monday, November 14, 2011

Top 10 Business Intelligence, Analytics and CPM Stories

For Data Analytics

Please go to the following link:

register for this site, and look at the article
on the homepage entitled "Top 10 Business Intelligence,
Analytics and CPM Stories of 2010"


10. Mega-vendors boss the BI market – but their power isn’t absolute
In its 2010 BI Magic Quadrant report, consulting firm Gartner Inc. said that the usual mega-vendor suspects – IBM, Microsoft, Oracle and SAP – continue to dominate the BI software market. But customer satisfaction levels were down for some of them, according to Gartner – a situation that SAP, for one, tried to remedy via expanded support for BusinessObjects users. Gartner found the same kind of forces at play among CPM vendors: A separate Magic Quadrant report split that market into CPM integrators and innovators, and Gartner later said that Software as a Service (SaaS) and pure-play CPM vendors topped the customer satisfaction rankings in a survey of vendor-supplied reference users. That’s all the more reason to make sure you buy the right BI, analytics and CPM tools for your organization.

9. Companies still eyeing BI consolidation/standardization
More than half of the respondents to a survey on BI priorities and challenges last March reported that their organizations were using multiple BI tools. With so much BI software in place, many organizations continue to eye BI tool consolidation as a way to clean up their systems and reduce costs. Florida State University is one example: A multi-year BI standardization project ended up saving the school about $350,000 in software license and maintenance fees as well as support costs, according to CIO Michael Barrett. But companies would be smart not to jump on the BI consolidation bandwagon without giving it a lot of thought first, cautioned Baseline Consulting’s Jill Dyche: “Shelfware is one thing, but I’d strongly advise you not to rip any type of valuable reporting or analytics capability out of the hands of an earnest and well-meaning business user.”

8. CPM’s horizons broaden, but usage remains relatively low
While most businesses are sold on using BI tools, CPM software is a completely different ballgame. Despite the technology’s potential benefits, CPM adoption levels remain low, according to analysts and our survey. But among the organizations that are using CPM tools, a growing number are looking to the software for help with more than just forecasting, budgeting and planning. In addition, on-demand CPM software is helping small and medium-sized businesses overcome barriers to adopting the technology.

7. Vendors push mobile BI – is anyone listening?
With almost everyone (and their pets) owning smartphones and more and more people buying iPads, BI vendors increasingly are pushing mobile BI software for use in accessing reports and executive dashboards on mobile devices. But mobile BI doesn’t appear to be a major priority for a lot of companies at this point. For example, only about 30% of the respondents to a survey conducted by consultant Howard Dresner said they were actively using mobile BI tools, and there was an almost even split on whether mobile BI is an important technology. The real value of mobile BI, according to Jill Dyche, “lies in the field, or in the stores, or on the manufacturing floor” – as a tool for end users who “need information on demand” in order to do their jobs.

6. Social media analytics enters the picture
As more and more people use social networking sites such as Facebook and Twitter, companies increasingly are turning to those sites to engage their customers and track what people think of – and are saying about – their products and services. And BI and analytics vendors are offering tools designed to help businesses mine and make sense of social media data. Last spring, for example, SAS unveiled a social media analytics suite for use in analyzing blog posts, tweets and Facebook status updates. But some analysts and BI professionals have questions about the functionality and maturity of the social media analytics software that’s currently available. For now, experienced users said, the key to social media analytics success for organizations that are pursuing the technology lies in commitment, experimentation – and patience.

5. A heavy layer of fog obscures visibility of agile BI
Agile business intelligence emerged as a much-discussed concept during 2010, but there’s still a lot of confusion about what agile BI really is. At a TDWI conference in August, some attendees thought it referred to applying agile development principles to their BI environments, others thought it meant the ability of BI to help an organization become more adaptable, and still more thought it was just another buzzword. Wayne Eckerson, then research director at TDWI and now head of research for TechTarget’s Enterprise Applications Media Group, thinks agile BI includes elements of all three of those viewpoints. It’s more of a mentality aimed at making businesses “go as fast as possible” than a specific methodology, Eckerson said. On the other hand, Dyche’s take is that “many companies are attracted to agile [BI] approaches because they don’t have the organizational discipline to instill solid BI development processes.” Ouch!

4. SaaS BI steps into the limelight
SaaS BI software has been around for years, but the cloud-based technology – which holds out the promise of faster deployments and reduced hardware and system management requirements – finally began stepping out of the BI shadows during 2010. SAP helped generate some of the buzz by releasing a new SaaS BI product suite that it claimed could bring “BI to the rest of us.” And companies such as Genband Inc. and Wine Management Systems are actively using SaaS BI and reporting tools to streamline the process of building reports for business users or to enable customers to create their own customized reports. Still, cloud-based BI might not be right for everyone; it’s important to know about and prepare for the challenges and obstacles that come with using SaaS BI software.

3. Pervasive BI, expanded use of tools become bigger BI priorities
A growing number of companies say that they’re looking to make their BI systems more pervasive by giving more businesses users access to the technology. In addition, more and more organizations are working to broaden their use of BI tools beyond basic reporting and data analysis. But efforts to expand the BI process can take years to complete because of data quality problems and other challenges. As mentioned above, SaaS and on-demand BI tools could help enable pervasive BI deployments; technologies such as data visualization, social media analytics and unstructured data analysis are also seen as having potential for spreading BI to more business users. But regardless of how businesses reach the pervasive peak, training end users is the key to successful pervasive BI projects, according to Rick Sherman, founder of consulting firm Athena IT Solutions. His reasoning is simple: People won’t use the technology if they don’t know how or why they should be using it.

2. Still the king: Excel continues its BI tool supremacy
Go to any BI or data warehousing conference and you’ll likely hear about the evils and data management disasters that come with all of the Excel-based “spreadmarts” that business users refuse to let go of. In fact, you might think that Excel is akin to the bubonic plague – and for a lot of businesses with poor spreadsheet management practices, you might be right. But according to Gartner analysts and attendees at the firm’s annual Business Intelligence Summit, it’s time for IT and BI managers to wave the white flag on using Excel for BI purposes. Their advice: Make your peace with spreadsheets and focus on developing processes for properly using Excel in BI projects. That was music to Microsoft’s ears, of course. Hoping to further capitalize on Excel’s continuing BI popularity, the software vendor released a PowerPivot for Excel add-in that lets end users integrate nearly unlimited amounts of data into their spreadsheets for analysis – although it also added a SharePoint version with management capabilities designed to help ease the collective minds of IT groups.

1. Interest in predictive analytics heats up
A relatively small number of the organizations that responded to the 2010 survey were using predictive analytics tools – just 16%. But 48% said that they planned to add predictive analytics software within the next 12 months, giving it the top spot on the analytics technology adoption list. Industry analysts also see predictive analytics as the next big battleground for BI vendors, which increasingly are developing or acquiring predictive analytics technology with the goal of incorporating it into their core platforms. In October, for example, IBM announced a new version of its Cognos BI software with predictive analytics capabilities built in. Thus far, many of the early adopters of predictive analytics are focusing not on wider market and economic trends but on individual customer analysis in an effort to understand what specific customers are likely to buy so that marketing campaigns and up-sell offers can be tailored to them.

Sunday, November 13, 2011


Sabermetrics is something I've become very interested in since I was over in America for a while. Sabermetrics is the analysis of baseball through statistics that measure in-game activity.

Since there is such a lot of data available, I've written loads of code to explore what factors are the key criteria to determine if a team will win or lose -- I have had very little success, but here's something interesting to think about:

2010 became known as the "Year of the Pitcher" as opposed to previous years where it was the batter who determined the outcome of the game. Commentators have suggested that it may be rigorous testing and penalties for performance-enhancing drug use as a possible factor for this. Runs per game fell to their lowest level in 18 years, and the strikeout rate was higher than it had been in half a century.

Thursday, November 10, 2011

The Literary Digest Survey

The Literary Digest is almost certainly best-remembered today for the circumstances surrounding its demise. It conducted a "straw poll" regarding the likely outcome of the 1936 presidential election. The poll showed that the Republican governor of Kansas, Alf Landon, would likely be the overwhelming winner. This seemed possible to some, as the Republicans had fared well in Maine, where the congressional and gubernatorial elections were then held in September - as opposed to the rest of the nation, where these elections were held in November along with the presidential election, like today. This seemed especially likely in light of the conventional wisdom, "As Maine goes, so goes the nation", a truism coined because Maine was regarded as a "bellwether" state which usually supported the winning candidate's party.