Big Data is everywhere. As's chief scientist Hilary Mason likes to say: “Big Data usually refers to a dataset that is too big to fit into your available memory, or too big to store on your own hard drive, or too big to fit into an Excel spreadsheet.” Exactly. In order to inspire you in your own data-wrangling quest, we've put together some of the more amusing and interesting cases—ones where the latest tools were used to provide some noteworthy results.

Proctor & Gamble

Take a look at the image above. That’s Proctor and Gamble’s Business Sphere big data situation room in their Cincinnati HQ. A Big Data analyst drives the large screens that display data visualizations on sales, market share, ad spending and the like. Everyone in the meeting sees the same information based on billions of daily transactions of P&G products. P&G isn’t after new data types; it still wants to share and analyze point-of-sale, inventory, ad spending, and shipment data. What’s new is the higher frequency and speed at which P&G gets that data, and the finer granularity with which it can act on this information. The company has about two-thirds of the real-time data it needs, and looks to Big Data in order to help determine the reasons behind sales dips or price changes.


Let's move on from soap to planes, trains and automobiles. Airlines have begun harnessing the immediacy and accessibility of Twitter to determine when customers are frustrated (or satisfied) by their experience, and take appropriate action. Jeffrey Breen of Cambridge Aviation Research put together a Twitter data-mining platform, using various R-based tools, to show sentiment analysis. JetBlue and Southwest had the most positive scores. People aren’t the only things being analyzed in transit. FedEx is collaborating with General Electric and Columbia University to track where their electric trucks are charged in the New York City area. By measuring electric load and miles driven, they can determine the viability of electric delivery trucks as a possible mainstay of the fleet, and how much of a charge each vehicle requires to perform its next day's worth of deliveries. We know that many carmakers have put some fairly sophisticated electronics into their vehicles. Indeed, some current models come with as many ports as your average desktop PC. But Ford is going one step further and aggregating sensing and app usage data from its four million cars on the road using its Sync technology. Drivers willing to share how many miles they’ve traveled and other vehicular usage data with State Farm and Ford can obtain discounts between 10 and 40 percent off their insurance premiums. And when it comes to trains, a number of transit systems around the world now provide real-time maps that show you where each train and bus will arrive, such as the city of Helsinki, Finland does here.


Big Data has also reached the world of ovens both large and small. The R language is being used by one steel mill to figure out at what temperatures to set their furnaces for particular steel products. One team claims “the new temperature prediction model will allow for the optimization of process stability, throughput and material quality in the steel plant, especially in ladle treatment.” Oh, and save energy, too. Even the smallest ovens can benefit from data analysis. The supply and maintenance of hospital autoclaves, used for sterilizing instruments and rigged with sensors and cellular Internet connections, can be optimized through the use of data—without ever having to actually look at the machines themselves. Technicians simply collect data about uptime, the need for repairs, and detergent levels via a smartphone app that hospital employees can use to deploy their repairmen.


Jeff Jonas is a data scientist who now works for IBM. One of his jobs involved designing the casino security systems in Las Vegas. He worked for the surveillance intelligence groups of several casinos, where he assisted in the automation of various manual processes, adding facial recognition software that was key to slowing down the MIT card counting group. "We built [another] system to immediately identify risk in real time so they could get these people out of the casino quickly," he said. This software is still offered by IBM as its InfoSphere Identity Insight event processing and identity tracking technology. Jonas spends a lot of time thinking about what constitutes identity and how to manage that information: “If someone has three phone numbers—no big deal. On the other hand, if someone has five different dates of birth, that just doesn't seem quite right, does it? That would be confusing.” Why is this important? “Well, if you are looking to analytics to make important decisions,” he added, “wouldn't you want to know during the decision-making process if there was related confusion... before [any] action is taken?"

The News

If you are looking for large content repositories, you probably can't get much larger than the article archive of the Associated Press. Earlier this year, they paired with MarkLogic to create a content analysis tool for searching the millions of articles in their archives, with an eye toward building custom products for customers. The MarkLogic partnership wasn’t the AP’s first plunge into the data game. The organization initially tried to implement a more traditional relational-database structure, only to run into problems. With the MarkLogic tool in place, however, it can receive precise returns in seconds or minutes instead of days or weeks. This quicker response time has already transformed the AP’s B2B product offerings, and helps them to manage searching for unstructured content in near real-time.


Still think Big Data is a lot of bull? Well, not according to the USDA. With eight million Holstein dairy cows in the United States, there is exactly one bull that has been scientifically calculated to be the very best in the land. He goes by the name of Badger-Bluff Fanny Freddie, and he has 346 daughters on the books already. USDA research geneticists reviewed pedigree records, delving into bovine attributes such as milk production and fat and protein content, in order to optimize the breed and find the top bull. To give you an idea of how this industry has changed thanks to analytics: in 1942 the average dairy cow produced less than 5,000 pounds of milk in its lifetime; now the average cow produces over 21,000 pounds of milk.

Big Data: Making Things Better

And finally, some words of wisdom from Mingsheng Hong, Chief Data Scientist at Hadapt. Speaking at a recent Boston-area meet-up, he asserted that the data scientist is the new product manager: “Data scientists are taking a data-driven position to make the product better,” he said. How true.