ANY list of the leading novelists of the 19th century, writing in English, would almost surely include Charles Dickens, Thomas Hardy, Herman Melville, Nathaniel Hawthorne and Mark Twain.
But they do not appear at the top of a list of the most influential writers of their time. Instead, a recent study has found, Jane Austen, author of "Pride and Prejudice, " and Sir Walter Scott, the creator of "Ivanhoe," had the greatest effect on other authors, in terms of writing style and themes.
These two were "the literary equivalent of Homo erectus, or, if you prefer, Adam and Eve," Matthew L. Jockers wrote in research published last year. He based his conclusion on an analysis of 3,592 works published from 1780 to 1900. It was a lot of digging, and a computer did it.
The study, which involved statistical parsing and aggregation of thousands of novels, made other striking observations. For example, Austen's works cluster tightly together in style and theme, while those of George Eliot (a k a Mary Ann Evans) range more broadly, and more closely resemble the patterns of male writers. Using similar criteria, Harriet Beecher Stowe was 20 years ahead of her time, said Mr. Jockers, whose research will soon be published in a book, "Macroanalysis: Digital Methods and Literary History" (University of Illinois Press).
These findings are hardly the last word. At this stage, this kind of digital analysis is mostly an intriguing sign that Big Data technology is steadily pushing beyond the Internet industry and scientific research into seemingly foreign fields like the social sciences and the humanities. The new tools of discovery provide a fresh look at culture, much as the microscope gave us a closer look at the subtleties of life and the telescope opened the way to faraway galaxies.
"Traditionally, literary history was done by studying a relative handful of texts," says Mr. Jockers, an assistant professor of English and a researcher at the Center for Digital Research in the Humanities at the University of Nebraska. "What this technology does is let you see the big picture -- the context in which a writer worked -- on a scale we've never seen before."
Mr. Jockers, 48, personifies the digital advance in the humanities. He received a Ph.D. in English literature from Southern Illinois University, but was also fascinated by computing and became a self-taught programmer. Before he moved to the University of Nebraska last year, he spent more than a decade at Stanford, where he was a founder of the Stanford Literary Lab, which is dedicated to the digital exploration of books.
Today, Mr. Jockers describes the tools of his trade in terms familiar to an Internet software engineer -- algorithms that use machine learning and network analysis techniques. His mathematical models are tailored to identify word patterns and thematic elements in written text. The number and strength of links among novels determine influence, much the way Google ranks Web sites.
It is this ability to collect, measure and analyze data for meaningful insights that is the promise of Big Data technology. In the humanities and social sciences, the flood of new data comes from many sources including books scanned into digital form, Web sites, blog posts and social network communications.
Data-centric specialties are growing fast, giving rise to a new vocabulary. In political science, this quantitative analysis is called political methodology. In history, there is cliometrics, which applies econometrics to history. In literature, stylometry is the study of an author's writing style, and these days it leans heavily on computing and statistical analysis. Culturomics is the umbrella term used to describe rigorous quantitative inquiries in the social sciences and humanities.
"Some call it computer science and some call it statistics, but the essence is that these algorithmic methods are increasingly part of every discipline now," says Gary King, director of the Institute for Quantitative Social Science at Harvard.
Cultural data analysts often adapt biological analogies to describe their work. Mr. Jockers, for example, called his research presentation "Computing and Visualizing the 19th-Century Literary Genome."
Such biological metaphors seem apt, because much of the research is a quantitative examination of words. Just as genes are the fundamental building blocks of biology, words are the raw material of ideas.
"What is critical and distinctive to human evolution is ideas, and how they evolve," says Jean-Baptiste Michel, a postdoctoral fellow at Harvard.
Mr. Michel and another researcher, Erez Lieberman Aiden, led a project to mine the virtual book depository known as Google Books and to track the use of words over time, compare related words and even graph them.
Google cooperated and built the software for making graphs open to the public. The initial version of Google's cultural exploration site began at the end of 2010, based on more than five million books, dating from 1500. By now, Google has scanned 20 million books, and the site is used 50 times a minute. For example, type in "women" in comparison to "men," and you see that for centuries the number of references to men dwarfed those for women. The crossover came in 1985, with women ahead ever since.
In work published in Science magazine in 2011, Mr. Michel and the research team tapped the Google Books data to find how quickly the past fades from books. For instance, references to "1880," which peaked in that year, fell to half by 1912, a lag of 32 years. By contrast, "1973" declined to half its peak by 1983, only 10 years later. "We are forgetting our past faster with each passing year," the authors wrote.
JON KLEINBERG, a computer scientist at Cornell, and a group of researchers approached collective memory from a very different perspective.
Their work, published last year, focused on what makes spoken lines in movies memorable. Sentences that endure in the public mind are evolutionary success stories, Mr. Kleinberg says, comparing "the fitness of language and the fitness of organisms."
As a yardstick, the researchers used the "memorable quotes" selected from the popular Internet Movie Database, or IMDb, and the number of times that a particular movie line appears on the Web. Then they compared the memorable lines to the complete scripts of the movies in which they appeared -- about 1,000 movies.
To train their statistical algorithms on common sentence structure, word order and most widely used words, they fed their computers a huge archive of articles from news wires. The memorable lines consisted of surprising words embedded in sentences of ordinary structure. "We can think of memorable quotes as consisting of unusual word choices built on a scaffolding of common part-of-speech patterns," their study said.
Consider the line "You had me at hello," from the movie "Jerry McGuire." It is, Mr. Kleinberg notes, basically the same sequence of parts of speech as the quotidian "I met him in Boston." Or consider this line from "Apocalypse Now": "I love the smell of napalm in the morning." Only one word separates that utterance from this: "I love the smell of coffee in the morning."
This kind of analysis can be used for all kinds of communications, including advertising. Indeed, Mr. Kleinberg's group also looked at ad slogans. Statistically, the ones most similar to memorable movie quotes included "Quality never goes out of style," for Levi's jeans, and "Come to Marlboro Country," for Marlboro cigarettes.
But the algorithmic methods aren't a foolproof guide to real-world success. One ad slogan that didn't fit well within the statistical parameters for memorable lines was the Energizer batteries catchphrase, "It keeps going and going and going."
Quantitative tools in the humanities and the social sciences, as in other fields, are most powerful when they are controlled by an intelligent human. Experts with deep knowledge of a subject are needed to ask the right questions and to recognize the shortcomings of statistical models.
"You'll always need both," says Mr. Jockers, the literary quant. "But we're at a moment now when there is much greater acceptance of these methods than in the past. There will come a time when this kind of analysis is just part of the tool kit in the humanities, as in every other discipline."interact
This article originally appeared in The New York Times.