The details remain murky about the NSA's foreign surveillance programs targeting communications and Internet records collected by American companies. So, too, are the terms associated with some of the stories. What, exactly, is "metadata"? What does an "algorithm" do? We've tried to explain a few of these terms below:
The term "data" is the plural for datum, or a single piece of raw factual information. Metadata is data about data, or information that describes data.
Americans encounter metadata of all types daily. Consider music files on a computer, for example. The song in this scenario would be data. The details about the song the album and artist name, the length of the song, the year it was recorded would be the metadata. Another example is a digital picture downloaded from your camera to a computer. The pixels that collectively create the image are the data. The date and time the photo was taken is the metadata.
In the NSA context, metadata has been mentioned in relation to a massive government database of telephone calls collected by the agency from American telecommunications companies. Much about the program, intended as a national security tool, is secret. But The Guardian and others have reported that the agency collects phone numbers, call locations and other descriptive information not the voice content (the data) of the phones themselves.
Today's computers are cheaper and much more powerful than ever before. We can use them to store and analyze data sets so large that traditional methods for interacting with them have become obsolete.
In their book, Big Data: A Revolution That Will Transform How We Live, Work, and Think, authors Viktor Mayer-Schonberger and Kenneth Cukier talk about systems that can process "exabytes" of data. A full-length feature film in digital form, they write, can be compressed into a 1 gigabyte file. An exabyte is 1 billion of those movies. It was much more expensive to deal with data like that years ago. That's true for companies, and also the government.
In addition to sheer size, the term refers to an understanding by scientists that big data can be messy, and that that's OK. An example in the book notes user-generated tags on the photo-sharing service Flickr. It's impossible to get millions of users to conform to a strict, structured set of descriptive tags about their images. But the users' nonstandardized tags think "sun," "sunny" or "sunshine" are still useful for searching and finding photos.
There's also a growing acceptance, the authors argue, that big data users can focus on correlation, not causation, and still extract value from large data sets. Take, for example, Amazon. The company realized early that computers were better than humans at associating products to other products, and using the results to suggest other products for its customers. The company doesn't need to know why people who like Hemingway also buy Fitzgerald. It doesn't matter. They just correlate.
An algorithm is a set of instructions for solving a problem, step by step. It's sort of like a recipe from a cookbook, or a knitting pattern, except written for a computer. For example, when you're using a spreadsheet and you ask it to sort a list of numbers, the computer follows an algorithm to put the numbers in the correct order.
A company like Google uses complex algorithms to understand your search history and the search terms used by others to refine search results. In theory the agency could identify a foreign phone number associated with a terrorism suspect, and use an algorithm to identify related numbers, especially those called most often or at particular times.
A server is a computer not that different from a desktop or laptop computer, really that you access over a network. The computer that keeps your email is a server, as are the computers that run this website. Servers, often stacked together in large computing rooms or data farms make the Internet run. The NSA has many, many servers.
The term has been important in recent days because the first stories about the Internet surveillance program, known as Prism, suggested that the agency was tapping directly into major companies' servers. The Guardian story reported that that program "allows the agency to directly and unilaterally seize the communications off the companies' servers." The Washington Post also reported on the "tapping directly into the central servers." The first stories about the Prism surveillance program have since been clarified after the companies denied giving the government unfettered access to its users.
The New York Times reported that, "instead of adding a back door to their servers, the companies were essentially asked to erect a locked mailbox and give the government the key." Facebook, The Times reported, only turns over information after its lawyers review a legal request by the government under the Foreign Intelligence Surveillance Act.
Data mining refers to asking questions of large data sets to reveal previously unknown facts, patterns and trends. The NSA might use these methods to probe the foreign resident's activities to gather intelligence. We don't know precisely what the agency does, but one could imagine a system that monitors spikes in communications, for example. Or an increase in communications in a particular country about a certain subject.
It's important to note that the agency isn't authorized to randomly mine Americans' call metadata, even if it stores the information. The agency can, however, use the data to track cellphones connected with foreign terrorism cells, as former NSA chief Michael Hayden told NPR this week. It can also mine social media and other Web activity by foreigners.
What terms have we missed? Let us know in the comments.
Matt Stiles is data editor on NPR's News Applications team. Follow him on Twitter at @stiles.