28 October 2009

A challenge to the computer folks

Something I'd like to be able to do is to track the citation history backwards from a given paper.  But I want a couple of things that it looks like typical bibliographic sources don't do.  As matters of computer or library science, I don't think they're terribly difficult.  I've seen things done which strike me as much more complex.

Let's start with some paper, call it paper A.  It cites, say, 15 papers (papers B, second generation).  Each of those cites another, say 15, which at least temporarily means a list of 225 papers (C, third generation).  Easy to get the list of papers cited by paper A (the 15 papers B1..B15), but significant manual effort, it seems, to get the collected list of papers C1..C225.  One thing I would like, however, and which seems completely unsupported, is that I'd like a count of how many times each paper shows up in this tree.  Some of the papers in the second generation probably cite others in the second generation.  And it's near certainty that many of the third generation papers are cited by several of the second, and probably a good number of third generation cite each other.  This is pretty much just a simple social network kind of analysis -- some papers have lots of friends, and some not so much.  I'd like to see which papers are highly connected, and which aren't, working within the group established by papers cited by a paper of my interest (actually won't be one of my own in practice) and lines of reference descent from there.


The second sort of thing I'd like to see is for the chart to be continued through enough generations that sources like Newton's Principia start appearing on the list.  I'm curious how many generations, in terms of citation history, modern work is removed from some of the landmark sources.  Unfortunately, it seems that the bibliographic databases I have access to die out in the mid 1980s, which is a long time from when I want to be getting to.

16 comments:

Anonymous said...

You could probably reuse a lot of the work done to produce an Erdős number calculator. You should seek out and talk to the folks who put those together and see if they're willing to share code.

Alastair said...

Obviously you are a bit stuck at present with the online libraries only going back to 1980, but even if everything is digitised you will get stuck again in 19th Century before the time of proper citations and lists of references.

However, I can give you a help getting back to Newton. Find a reference to Fourier's 1827/1824 article translated by W. here. Fourier cites Horace de Saussure whom I have translated here. The reference to Newton is in the letter to the Journal de Paris.

Interestingly, Saussure uses Newton to back his claim that the heliothermometer works by absorbing heat, not by reflections or back-radiation which is the current idea!

andrewt said...

Google scholar finds plenty of recent papers citing Newton's books.
So I suspect you probably have a citation chain of length 4 to Principia and possibly of length 3 where by length 3 I mean:

Grumbine cites A which cites B which cites Principia.

carrot eater said...

Yeah, that's exhausting the capabilities of Web of Science.

It'd be a good project though.

It's weird that I was thinking about something similar the other day: important papers get cited thousands of times, but at some point, that work becomes so ingrained in the fundamentals you don't bother citing it anymore. You were thinking Newton; I was thinking Darwin.

The number of times such authors are cited will be huge, but not nearly huge enough to reflect their impact. So tracing out the entire tree, as you suggest, would give a better gauge of the overall impact.

Steinn said...

ADS should be able to do this relatively trivially and their database is complete back to 17thC with some stuff going back over 2000 years

Robert Grumbine said...

Alastair, thanks

andrewt:
Oh, sure, no problem to find citations -- from all over -- to any given paper. My challenge is more subtle. Find a citation chain (and preferably all of them) between any given paper and any target paper. Given an arbitrary paper, how can it be (and can it be) linked back to the Principia. Starting with the Principia and then building forward would work, if there tools to support that approach too.

carrot:
I tend to think about the converse matter. There are papers and people who were actually much more influential than their 'fame' would suggest. This sometimes shows up in citation patterns.

Steinn:
ADS (http://adsabs.harvard.edu/abstract_service.html) doesn't have quite what you seem to think. Their listing of papers is indeed extensive, so it's easy to find all papers written by, say, Alar Toomre. But, as it notes when you follow up, they don't have lists of papers cited by all papers. For Toomre, none of his 10 most recent papers have a reference list yet. 1983 is the most recent one that does. Selecting that as my starting point, I immediately get the message:
"References for 1983IAUS..100..177T from the ADS Database
The Reference database in the ADS is NOT complete. Please keep this in mind when using the ADS Reference lists."

For this paper that does have a reference list (13 papers) I can select all the papers he cited and ask for all papers that they cite as well (only 1 doesn't have a reference list).

Ah, it does put in the number of citations to the paper from the selected group of papers. A little hidden (and why give an integer to 3 decimal places?) So it's a start -- if the citation lists were all present. They're not, and the probability declines as you go back in time. But that incompleteness makes it unsuitable for going from paper A (say Toomre's 1983 WARP paper) back to the Principia.

Still, to the extent it's more or less complete for the more recent papers, it's a definite step forward on my first challenge.

Anonymous said...

I've been wishing for the ability to graphically show the links back from a paper for a while as well.

In the social sciences there's a tendency for 'obliteration by incorporation' whereby the greatest ideas seems to become common without reference to their originator. (Hence very few named laws in a field with fewer laws anyhow).

I work mainly in the field of business academia and it would be great to be able to plot reference 'chains' to see how ideas evolved and if there are new ideas instead of (as one collleague advises) only go back a decade or so.

Finally, there was a paper a few years back called 'Read before you cite' which analysed copied typos in references and concluded that a vast number of references were never actually read or were at least 'scraped' from other papers.

I look forward to seeing if anyone comes up with a response.

quasarpulse said...

Well, the second problem is a fairly computationally-simple problem in graph theory - specifically, this one:
http://en.wikipedia.org/wiki/Shortest_path_problem

The first is a matter of counting a degree of a vertex in a subgraph, which I believe is also a relatively simple matter.

There really aren't any technical limitations preventing it from happening - the bulk of the work would be data entry. Somebody's going to have to do all that data entry, though.

dagon said...

Something better than this
http://wokinfo.com/products_tools/multidisciplinary/webofscience/citmap/

dagon said...

I can do this if there exists a way to download a file containing at paper titles and their references.

If not the only other option I can think of is to scrape the data from ADS live for each query, which will be slow.

James Annan said...

Re "read before you cite", I think it's a bit silly because regardless of whether I have read the ref (which I almost always do) I often copy the ref from somewhere (including google scholar, which is not infallible) rather than deducing it from the paper itself and typing it in by hand. The paper may not even always have the full details that I want, although maybe I'm being old-fashioned in still trying to use page numbers when the e-version is the main record. In the good old days, the hard copy was generally hidden somewhere in a heap on my desk, *especially* if I had read it!

Robert Grumbine said...

anon:
In physical science, we moved away from calling things 'laws' some time ago. So I guess it's more that you're in a newer field. In physical oceanography, also a young field, I can't offhand think of anything we call a 'law'.

Your point, though, about obliteration by incorporation, is interesting. Probably also true in my field(s). Not exactly obliteration, but things become assumed 'common knowledge' awfully fast. Too much so for someone like me who likes to track back to the origins of ideas.

quasarpulse:
The challenge seems to be in the database side, and dealing with its scale (apparently), rather than the logic of it. As you note, the fundamental problem is a well-known one, with some reasonably good solutions.

dagon:
The web of science 2 generation limit was one of the examples I had in mind as not doing what I wanted. Also, it doesn't show how many times any given paper shows up. What I'm looking for is more in line of the facebook 'friendwheel', where all papers would be on the cloud, and their inter-linkages (and link counts) would be shown.

James:
We're all certainly capable of introducing our own errors in typing a reference, so I'm not too concerned about the rule there. But I do wish people would read the sources they cite. Read Ekman's 1905 paper and compare that with what he's usually cited as having done. Sometime between then and, say, 1980, it became normal to cite him as if he'd only looked at the case for infinitely deep ocean, infinitely steady winds, with absolutely constant friction. He did that, but he also did several other, more realistic cases.

EliRabett said...

What you want exists for patents
http://www.delphion.com/help/citelink_help


and for web of science
http://cm.isiknowledge.com/support/help/h_citation_map.html

or you could google citation mapping.

Anonymous said...

James A. - Read before you debunk

Wow you critiqued and debunked the paper before you read it, let alone cited it. How great is that?!?

BTW it was written before googlescholar and, of course, they did try to allow for the obvious flaws, and as they said:

"Surely, in the pre-internet era it took almost equal effort to copy a reference as to type in
one’s own based on the original, thus providing little incentive to copy if someone has indeed read, or at the very least has procured access to the original. Moreover, if someone accesses the original by tracing it from the reference list of a paper with a misprint, then with a high likelihood, the misprint has been identified and
will not be propagated."

From:
Read before you cite!
M.V. Simkin and V.P. Roychowdhury
Department of Electrical Engineering, University of California, Los Angeles, CA
90095-1594
Abstract. We report a method of estimating what percentage of people who cited a paper
had actually read it. The method is based on a stochastic modeling of the citation process that
explains empirical studies of misprint distributions in citations (which we show follows a Zipf law).

Our estimate is only about
20% of citers read the original

P. Lewis said...

A little off topic, but since it has been raised and commented upon...

I remember reading the New Scientist article "Scientists exposed as sloppy reporters" about Simkin's and Roychowdhury's "Read before you cite!” arXiv note. Simkin and Roychowdhury followed this up a year or so later with “Stochastic modeling of citation slips”, later published in the journal Scientometrics.

These papers piqued my interest at the time because they didn’t seem to be a surprising set of findings, given my experience. I could relate that experience, but since it is largely anecdotal it doesn’t materially add to the Simkin and Roychowdhury papers and so is omitted.

In closing, I’ll just add this thought and a rather rhetorical question. I know that by and large one doesn't reference a reference (except where so-and-so is "as cited by X and Y" because it has not been possible to obtain the original reference), but technically, since the reference list is part of the body of work (and might also be subject to copyright), this copying (errors included) could be regarded as a plagiaristic practice (and breach of copyright I suppose). If the same situation arose where some main body text had been copied from somewhere else (errors included) without attribution, then people would say that that's probably a tell-tale sign of plagiarism. So, why not with a reference list?

James Annan said...

Anon,

Jump to conclusions much?

Where did you get the impression that I commented on the paper before reading it? It is easily available on the web, as the most rudimentary google search will show.

I don't think the date of the paper invalidates my suggestion that references will often be copied from something to hand rather than necessarily drawn from the original paper, even if the writer has the original paper. That may be *more* likely back in the days when the original paper actually is paper, hidden in a stack in a filing cabinet somewhere...

Plus, I've been sharing bibtex files for years.