Can Data Mining Catch Terrorists?

May 24, 2006
New theories on how to look at data point to ways to catch terrorists, but it's no easy task

When Gen. Michael Hayden faced Congress last week for a pre-confirmation grilling as President Bush's nominee to lead the Central Intelligence Agency, he started by calling intelligence gathering "the football in American political discourse" since the 9/11 terrorist attacks. Then, when pressed about a megadatabase of phone records of U.S. citizens allegedly compiled under his watch at the National Security Agency, Hayden punted.

The nominee declined to discuss the sensitive issue in open session or otherwise address the wide--and sometimes wild--speculation about how much phone data the feds have collected and what they're doing with it. The issue got new life in a May 11 story in USA Today, which reported that AT&T, BellSouth, and Verizon had turned over phone call records of tens of millions of Americans to populate the NSA database. The purpose of the data collection, according to USA Today, is to identify potential terrorist activity. But privacy advocates teed off on how such a database might be misused.

The technology certainly exists to assemble a massive phone-records database, but it's not clear whether the NSA has the volume of data it would need to get a complete picture of terrorist activity or the data mining algorithms necessary to tell the difference between calls among friends and those among terrorists. Businesses and government agencies routinely mine multiterabyte databases to create meaning out of minutia. But the stealth nature of the terrorism business would make connecting the dots infinitely harder.

Indeed, it's unclear just what data the NSA has in hand. BellSouth and Verizon both denied sharing phone records in bulk with the intelligence agency; AT&T was sketchy about its participation; and Bush was noncommittal on whether such a database even exists.

Here's what we do know: The NSA is a sophisticated user of database technology--Larry Ellison has long said the NSA is one of Oracle's earliest customers. We also know that government agencies are intensely interested in data mining. A 2004 survey by the Government Accountability Office found that federal agencies were engaged in or planning 199 data mining projects, including 122 involving personal data. A database of phone records wouldn't be hard to create; the data exists.

We also know that terrorists make phone calls. After the 2001 attacks, the government determined that the 19 terrorists had made 206 international calls from the United States, according to press reports. A logical step for data analysts would be to search through phone records to see if there are other networks of people whose calls followed similar patterns.

Social Connections

In data mining, the practice of looking for underlying connections between people is called social network analysis. Phone data is useful because it helps expose relationships and associations among different groups. With social network analysis, contacts are commonly laid out graphically to illustrate connections and find patterns. At the simplest level, this could be shown as links similar to the spokes of a wheel leading to one source, indicating that a person holds a leadership position within a terrorist cell. Looking deeper, it could uncover relationships, such as two suspected terrorists linked only through a third, unknown person.

Valdis Krebs, founder of social networking analysis company OrgNet.com, conducted his own analysis of the 9/11 terrorists by collecting information from press reports such as who called whom, the addresses shared by the terrorists and their known associates, and information that they used the same frequent flier number. Krebs found that more links led to the group's leader, Mohammad Atta, than to any other terrorist.

Social network analysis could seek to create a "map" that shows characteristics unique among terrorist networks. When a credit card is used to make a very small purchase followed not much later by an attempt to make a large one, banks recognize that pattern as a characteristic of fraud. The challenge for the NSA and other intelligence agencies is to find enough unique characteristics to differentiate terrorist social networks from those of nonterrorists. "There are no clear-cut definitions for what defines terrorist behaviors," says Chris Westphal, CEO of Visual Analytics, which does data analysis for intelligence and law enforcement agencies. Financial transactions, wire transfers, border crossings, public records, travel records, and phone records might all be used to inform the process.

It's unknown to what extent the government may be using social network analysis, but the feds are clearly using data mining to fight terrorism. Fourteen of the 199 data mining projects revealed by the GAO in 2004 involved analyzing intelligence and detecting terrorist activities. Eight of those programs involved private-sector data, including personal information provided by data aggregators.

One data mining effort within the Defense Department, called Pathfinder, involves analyzing government and private-sector databases, including rapidly comparing and searching multiple large databases for anti-terrorism intelligence. The FBI's Foreign Terrorist Tracking Task Force culls data from the Department of Homeland Security, the FBI, and public data sources to prevent foreign terrorists from entering the country. Other tools for counterterrorism include technology from Autonomy that searches Word documents across various intelligence agency databases; Verity's K2 Enterprise, which mines data from the intelligence community and through Internet searches; and Insight's Smart Discovery, which looks into and categorizes data in unstructured text.

No matter how you slice it, what the government is doing can't be easy. There are many data sources, and the integrity of data is different for each. "The analyst is faced with trying to find something of importance across all these sources potentially containing billions of records," Westphal says.

In the aftermath of the NSA brouhaha, Congress wants to find out if the mining of phone call data can accurately identify patterns and transactions and develop predictive models--without invading the privacy of innocent citizens, says Sen. John Sununu, R-N.H. "The key is bringing in oversight, asking tough questions, making sure the appropriate information is provided," Sununu said in an interview with InformationWeek.

Privacy Matters

On the issue of privacy, the NSA might learn something from the business world. Retailers, for example, mine data on customer interactions and purchase histories to determine promotions or in-store placement, all without invading customer privacy.

Mining For Terrorists The government for years has analyzed data for patterns and relationships that could point to terrorists Defense Department's Special Operations Command conducted social network analysis on al-Qaida as part of its Able Danger program before the 9/11 attack As of 2004, there were 14 nonsecret active or planned data mining projects for intelligence gathering and counterterrorism efforts across 52 federal agencies Intelligence agencies, including the NSA and CIA, have contracts with numerous data mining vendors, including Cognos, IBM, and Teradata The Defense Advanced Research Project Agency's Terrorism Information Awareness program, a project to mine vast amounts of personal data to identify terrorists, lost congressional funding in 2003 after public outcry over privacy concerns Photo by Jason Reed/Reuters

The NSA declined to comment for this story. If published reports are correct, however, its database would consist partially or entirely of call records. These records include outgoing and incoming phone numbers, time stamps, and other information, such as whether the call had been forwarded, but not names.

USA Today reported that AT&T, BellSouth, and Verizon gave the government access to call data records starting in late 2001. The ambitious goal, according to an unnamed source quoted by the paper, is to put "every call ever made" in the United States into the database. Verizon and BellSouth last week said they weren't involved, though Verizon didn't specify whether MCI, which it acquired last year, ever participated in such activity.

AT&T, which neither confirmed nor denied the report, handles about a third of the calls made in the United States and operates some 49.4 million phone lines. AT&T manages a database called Hawkeye that contained 312 terabytes of uncompressed data as of September, representing 1.88 trillion call records. That comes out to 166 bytes per call record.

Say the number of calls made by AT&T customers averages about 10 per phone line a day. If the NSA has access to five years of AT&T calls, its alleged database would contain about 150 terabytes of call records. Compare that with the largest commercial databases. As of late last year, Wal-Mart stored about 583 terabytes of data in a massively parallel, 1,000-processor NCR Teradata data warehouse, and it was adding a billion records a day.

Heavy-Duty Management

Any database software the NSA might be using would need vast amounts of storage and heavy-duty data management capabilities. Surveys by Winter Corp., a database consulting firm, have found that the largest databases are tripling in size every two years. "It's got to be able to load huge volumes of data rapidly and in a highly parallel way, and to search data in a highly parallel and efficient way," says company president Richard Winter.

A handful of commercial relational databases--from IBM, Oracle, Sybase, and Teradata--might be able to handle a vast volume of phone records, or the NSA could build such a database itself. AT&T, for example, has contracts with Teradata and IBM, but the carrier's big Daytona database was developed internally.

More powerful servers, falling storage prices, and new search and data mining techniques are all working in the NSA's favor. "Ten years ago, you couldn't have accomplished the same thing," Winter says. "It would have been too expensive to put all the information online, and we didn't have the systems capable of searching and mining at high speed."

Still, it's questionable how successful the NSA could be mining data on just some of the calls made within the United States. More than 1,000 wireless carriers, Internet service providers, rural phone companies, voice-over-IP service providers, and long distance companies handle phone calls. For a complete picture, the NSA would need to draw in much of that data, and the more data, the bigger the task. "The history of the intelligence community is information glut," says Mark Pollitt, a former FBI agent and an adjunct professor at Johns Hopkins' School of Professional Studies in Business and Education. "We're good at collecting stuff, but how do you figure out if any of it is any good? This is perhaps the toughest issue with regard to counterterrorism."

with Larry Greenemeier and Elena Malykhina

Photo by Jim Watson/AFP