A couple of decades later they spoke of atoms and communism. By the 1990s, buzzwords prevailed.
Jon Kleinberg, a professor of computer science at Cornell University, Ithaca, N.Y., has developed a method for a computer to find the topics that dominate a discussion at a particular time by scanning large collections of documents for sudden, rapid bursts of words. Among other tests of the method, he scanned presidential State of the Union addresses from 1790 to the present and created a list of words that eerily reflects historical trends. The technique, he suggests, could have many "data mining" applications, including searching the Web or studying trends in society as reflected in Web pages.
Kleinberg will emphasize the Web applications of his searching technique in a talk, "Web Structure and the Design of Search Algorithms," at the annual meeting of the American Association for the Advancement of Science (AAAS) in Denver on Feb. 18. He is taking part in a symposium on "Modeling the Internet and the World Wide Web"
Kleinberg says he got the idea of searching over time while trying to deal with his own flood of incoming e-mail. He reasoned that when an important topic comes up for discussion, keywords related to the topic will show a sudden increase in frequency. A search for these words that suddenly appear more often might, he theorized, provide ways to categorize messages.
He devised a search algorithm that looks for "burstiness," measuring not just the number of times words appear, but the rate of increase in those numbers over time. Programs based on his algorithm can scan text that varies with time and flag the most "bursty" words. "The method is motivated by probability models used to analyze the behavior of communication networks, where burstiness occurs in the traffic due to congestion and hot spots," he explains.
In his own e-mail -- largely from other computer scientists -- he quickly found keywords relating to hot topics. In mail from students he found bursts in the word "prelim" shortly before each midterm exam. Later, he tried the same technique on the texts of State of the Union addresses, all of which are available on the Web, from Washington in 1790 through George W. Bush in 2002. From these speeches he produced a long list of words (see attached table) that summarizes American politics from early revolutionary fervor up to the age of the modern speechwriter.
While we already know about these trends in American history, Kleinberg points out, a computer doesn't, and it has found these ideas just by scanning raw text. So such a technique should work just as well on historical records in obscure situations where we have no idea what the important terms or keywords are. It might even be used to screen e-mail "chatter" by terrorists. Sociologists, Kleinberg adds, may find it interesting to look for trends in personal Web logs popularly known as "blogs."
For searching the Web, Kleinberg suggests, such a technique could help zero in on what a searcher wants by recognizing the time context of such material as news stories. For instance, he says, a person searching for the word "sniper" today is likely to be looking for information about the recent attacks around the nation's capital -- but the same search nearly four decades ago might have come from someone interested in the Kennedy assassination.
In his AAAS talk Kleinberg also explores other Web-searching techniques. A few years ago, he suggested that a way to find the most useful Web sites on a particular subject would be to look at the way they are linked to one another. Sites that are "linked to" by many others are probably "authorities." Sites that link to many others are likely to be "hubs." The most authoritative sites on a topic would be the ones that are linked to most often by the most active hubs, he reasoned. A variation on this idea is used by Google, and a more formal version is being used in a new search engine called Teoma http://www.teoma.com .
Kleinberg and others have found that despite its anarchy, there is a great deal of "self-organization" on the Web. In a variation on the "six degrees of separation" idea, Kleinberg says, almost every site on the Web can be reached from almost any other through a series of steps. The structure seems to be a bit like the Milky Way galaxy, with a very dense "core" of heavily interconnected sites surrounded by less dense regions. Nodes outside the core are divided into three categories: "upstream" nodes that link to the core but cannot be reached from it; "downstream" nodes that can be reached from the core but don't link back to it; and isolated "tendrils" that are not linked directly to the core at all.
Within this structure there are many "communities" of sites representing common interests that are extensively linked to one another. So, Kleinberg suggests, searches might be done by following along the link paths from one site to another, as well as just scanning an index of everything.
"Deeper analysis, exposing the structure of communities embedded in the Web, raises the prospect of bringing together individuals with common interests and lowering barriers to communication," Kleinberg concludes.
Related World Wide Web sites: The following site provides additional information on this news release
Jon Kleinberg's page, with links to papers: http://www.cs.cornell.edu/home/kleinber/ .
The 150 term bursts of highest weight in Presidential State of the Union Addresses, 1790-2002
Word
Interval of burst
gentlemen
1790 - 1800
militia
1801 - 1816
british
1809 - 1814
enemy
1812 - 1814
savages
1812 - 1819
spain
1818 - 1821
likewise
1818 - 1824
chambers
1833 - 1835
french
bank
1833 - 1836
france
1834 - 1835
texas
1843 - 1846
annexation
1844 - 1846
mexican
1845 - 1847
her
1846 - 1847
mexico
steamers
1847 - 1849
oregon
1847 - 1852
california
1848 - 1852
kansas
1856 - 1858
slavery
1857 - 1860
whilst
slaves
1859 - 1863
rebellion
1861 - 1871
emancipation
1862 - 1864
paper
1867 - 1868
coinage
1877 - 1886
silver
1884 - 1885
1889 - 1891
spanish
1897 - 1898
cuba
1897 - 1899
puerto
1898 - 1901
reserves
1901 - 1904
forest
1901 - 1905
forests
1907 - 1908
interstate
marketing
1919 - 1929
tile
1922 - 1928
ought
1925 - 1926
veterans
1925 - 1931
relief
1929 - 1935
depression
1930 - 1937
recovery
banks
1931 - 1934
democracy
1937 - 1941
wartime
1941 - 1947
production
1942 - 1943
fighting
1942 - 1945
japanese
war
peacetime
1945 - 1947
program
1946 - 1948
wage
1946 - 1949
housing
1946 - 1950
atomic
1947 - 1959
collective
1947 - 1961
aggression
1949 - 1955
defense
1951 - 1952
free
1951 - 1953
soviet
korea
1951 - 1954
communist
1951 - 1958
1954 - 1956
alliance
1961 - 1966
1961 - 1967
poverty
1963 - 1969
propose
1965 - 1968
tonight
1965 - 1969
billion
1966 - 1969
vietnam
1966 - 1973
america
1970 - 1972
goal
1970 - 1974
inflation
1971 - 1980
energy
1974 - 1978
oil
1974 - 1981
significant
ensure
1974 - 1988
nuclear
1975 - 1981
strategic
percent
1975 - 1984
major
1977 - 1983
we've
1978 - 1980
commitment
1978 - 1981
sector
1978 - 1986
nation's
1979 - 1981
1979 - 1983
1980 - 1981
1980's
1980 - 1982
initiatives
1980 - 1985
afghanistan
1980 - 1988
1981 - 1982
programs
1981 - 1983
women
1981 - 1984
chamber
1982 -
that's
we're
deficits
1982 - 1988
america's
1982 - 1992
spending
1982 - 1995
it's
1982 - 1996
there's
we'll
1982 - 1998
they're
1982 - 1999
can't
1983 -
child
i'm
1983 - 1998
1984 -
tell
1984 - 1995
freedom
1985 - 1991
don't
1986 -
1986 - 1991
let's
1987 -
get
1987 - 1995
kids
let
businesses
1990 -
got
parents
something
1990 - 1997
cuts
1991 -
families
crime
1991 - 1996
cut
jobs
1991 - 1998
hard
1991 - 1999
know
children
1992 -
thank
health
1992 - 1994
want
1992 - 1995
you
americans
1994 -
medicare
school
welfare
1994 - 1997
bipartisan
1995 -
college
communities
working
1995 - 1996
1996 -
challenge
schools
teachers
21st
1997 -
ask
century
help
1998 -
1998 - 1999
E-Mail: deb27@cornell.edu