âJust when I thought I was out, they pull me back in!â With a sly grin that Iâd soon come to recognize, Paul Ginsparg quoted Michael Corleone from The Godfather. Ginsparg, a physics professor at Cornell University and a certified MacArthur genius, may have little in common with Al Pacinoâs mafia don, but both are united by the feeling that they were denied a graceful exit from what theyâve built.
Nearly 35 years ago, Ginsparg created arXiv, a digital repository where researchers could share their latest findingsâbefore those findings had been systematically reviewed or verified. Visit arXiv.org today (itâs pronounced like âarchiveâ) and youâll still see its old-school Web 1.0 design, featuring a red banner and the seal of Cornell University, the platformâs institutional home. But arXivâs unassuming facade belies the tectonic reconfiguration it set off in the scientific community. If arXiv were to stop functioning, scientists from every corner of the planet would suffer an immediate and profound disruption. âEverybody in math and physics uses it,â Scott Aaronson, a computer scientist at the University of Texas at Austin, told me. âI scan it every night.â
Every industry has certain problems universally acknowledged as broken: insurance in health care, licensing in music, standardized testing in education, tipping in the restaurant business. In academia, itâs publishing. Academic publishing is dominated by for-profit giants like Elsevier and Springer. Calling their practice a form of thuggery isnât so much an insult as an economic observation. Imagine if a book publisher demanded that authors write books for free and, instead of employing in-house editors, relied on other authors to edit those books, also for free. And not only that: The final product was then sold at prohibitively expensive prices to ordinary readers, and institutions were forced to pay exorbitant fees for access.
The âfree editingâ academic publishers facilitate is called peer review, the process by which fellow researchers vet new findings. This can take months, even a year. But with arXiv, scientists could post their papersâknown, at this unvetted stage, as preprintsâfor instant and free access to everyone. One of arXivâs great achievements was âshowing that you could divorce the actual transmission of your results from the process of refereeing,â said Paul Fendley, an early arXiv moderator and now a physicist at All Souls College, Oxford. During crises like the Covid pandemic, time-sensitive breakthroughs were disseminated quicklyâparticularly by bioRxiv and medRxiv, platforms inspired by arXivâpotentially saving, by one studyâs estimate, millions of lives.
While arXiv submissions arenât peer-reviewed, they are moderated by experts in each field, who volunteer their time to ensure that submissions meet basic academic standards and follow arXivâs guidelines: original research only, no falsified data, sufficiently neutral language. Submissions also undergo automated checks for baseline quality control. Without these, pseudoscientific papers and amateur work would flood the platform.
In 2021, the journal Nature declared arXiv one of the â10 computer codes that transformed science,â praising its role in fostering scientific collaboration. (The article is behind a paywallâunlock it for $199 a year.) By a recent count, arXiv hosts more than 2.6 million papers, receives 20,000 new submissions each month, and has 5 million monthly active users. Many of the most significant discoveries of the 21st century have first appeared on the platform. The âtransformersâ paper that launched the modern AI boom? Uploaded to arXiv. Same with the solution to the Poincaré conjecture, one of the seven Millennium Prize problems, famous for their difficulty and $1 million rewards. Just because a paper is posted on arXiv doesnât mean it wonât appear in a prestigious journal someday, but itâs often where research makes its debut and stays openly available. The transformers paper is still routinely accessed via arXiv.
For scientists, imagining a world without arXiv is like the rest of us imagining one without public libraries or GPS. But a look at its inner workings reveals that it isnât a frictionless utopia of open-access knowledge. Over the years, arXivâs permanence has been threatened by everything from bureaucratic strife to outdated code to even, once, a spy scandal. In the words of Ginsparg, who usually redirects interview requests to an FAQ documentâon arXiv, no lessâand tried to talk me out of visiting him in person, arXiv is âa child I sent off to college but who keeps coming back to camp out in my living room, behaving badly.â
Ginsparg and I met over the course of several days last spring in Ithaca, New York, home of Cornell University. Iâll admit, I was apprehensive ahead of our time together. Geoffrey West, a former supervisor of Ginspargâs at Los Alamos National Laboratory, once described him as âquite a characterâ who is âinfamous in the communityâ for being âquite difficult.â He also said he was âextremely funnyâ and a âgreat guy.â In our early email exchanges, Ginsparg told me, upfront, that stories about arXiv never impress him: âSo many articles, so few insights,â he wrote.
At 69 years old, Ginsparg has the lean build of a retired triathlete, his knees etched with scars collected over a lifetime of hiking, mountain climbing, and cycling. (He still leads hikes on occasion, leaving younger scientists struggling to keep up.) His attire was always relaxed, as though heâd just stepped off the Camino de Santiago, making my already casual clothes seem overdressy. Much of our time together was spent cycling the townâs rolling hills, and the maximum speed on the ebike I rented could not keep up with his efficient pedaling.
Invited one afternoon to Ginspargâs office in Cornellâs physics building, I discovered it to be not âmessy,â per se, because that suggests it could be cleaned. Instead, the objects in the room seemed inert, long since resigned to their fate: unopened boxes from the 1990s, piles of Physics Today magazines, an inexplicable CRT monitor, a tossed-aside invitation to the Obama White House. New items were occasionally added to the heap. I spotted a copy of Stephen Wolframâs recent book, The Second Law, with a note from Wolfram that read, âSince you canât find it on arXiv :)â The only thing that seemed actively in use was the blackboard, dense with symbols and equations related to quantum measurement theory, sprawling with bra-ket notation.
As he showed me around the building and his usual haunts, Ginsparg was gregarious, not letting a single detail slip by: the nesting patterns of local red-tailed hawks, the comings and goings of the dining staff, the plans for a new building going up behind his office. He was playful, even prankish. Midway through telling me about a podcast he was listening to, Ginsparg suddenly stopped and said, âI like your hair color, by the way, it works for youââmy hair is dyed ash gray, if anyone caresâbefore seamlessly transitioning to a story about a hard drive that had failed him.
The drive, which he had sent for recovery, contained a language model, Ginspargâs latest intellectual fascination. Among his litany of peeves is that, because arXiv has seen a surge in submissions in recent times, especially in the AI category, the number of low-quality papers has followed a similar curveâand arXiv has nowhere near enough volunteers to vet them all. Hence his fussing with the drive, part of a quest to catch subpar submissions with what he calls âthe holy grail crackpot filter.â And Ginsparg thinks, as he often has in arXivâs three-decade history, that the quality would not be up to snuff if he doesnât do it himself.
Long before arXiv became critical infrastructure for scientific research, it was a collection of shell scripts running on Ginspargâs NeXT machine. In June 1991, Ginsparg, then a researcher at Los Alamos National Laboratory, attended a conference in Colorado, where a fateful encounter took place.
First came a remark from Joanne Cohn, a friend of Ginspargâs and a postdoc at the Institute for Advanced Study in Princeton, who maintained a mailing list for physics preprints. At the time, there was no centralized way to access these preprints. Unless researchers were on certain mailing lists âwhich were predicated on their affiliations with prestigious institutionsâor knew exactly whom to contact via email, they had to wait months to read new work in published journals.
Then came an offhand comment from a physicist worried about his computerâs storage filling up with emailed articles while he was traveling.
Ginsparg, who had been coding since high school, asked Cohn if sheâd considered automating the distribution process. She hadnât and told him to go ahead and do it himself. âMy recollection is that the next day heâd come up with the scripts and seemed pretty happy about having done it so quickly,â Cohn told me. âItâs hard to communicate how different it was at the time. Paul had really seen ahead.â
Hearing tales from and about Ginsparg, you canât help but see him as a sort of Forrest Gump figure of the internet age, who found himself at crucial junctures and crossed paths with revolutionary figures. As an undergrad at Harvard, he was classmates with Bill Gates and Steve Ballmer; his older brother was a graduate student at Stanford studying with Terry Winograd, an AI pioneer. The brothers both had email addresses and access to Arpanet, the precursor to the internet, at a time when few others did.
After earning his PhD in theoretical physics at Cornell, Ginsparg began teaching at Harvard. A career there wasnât to be: He wasnât granted tenureâHarvard is infamous for thisâand started looking for a job elsewhere. Thatâs when Ginsparg was recruited to Los Alamos, where he was free to do research on theoretical high-energy physics full-time, without other responsibilities. Plus, New Mexico was perfect for his active lifestyle.
When arXiv started, it wasnât a website but an automated email server (and within a few months also an FTP server). Then Ginsparg heard about something called the âWorld Wide Web.â Initially skepticalââI canât really pay attention to every single fadââhe became intrigued when the Mosaic browser was released in 1993. Soon after, Ginsparg built a web interface for arXiv, which over time became its primary mode of access. He also occasionally consulted with a programmer at the European Organization for Nuclear Research (CERN) named Tim Berners-Leeânow Sir Tim âInventor of the World Wide Webâ Berners-Leeâwhom Ginsparg fondly credits with grilling excellent swordfish at his home in the French countryside.
In 1994, with a National Science Foundation grant, Ginsparg hired two people to transform arXivâs shell scripts into more reliable Perl code. They were both technically gifted, perhaps too gifted to stay for long. One of them, Mark Doyle, later joined the American Physical Society and became its chief information officer. The other, Rob Hartill, was working simultaneously on a project to collect entertainment data: the Internet Movie Database. (After IMDb, Hartill went on to do notable work at the Apache Software Foundation.)
Before arXiv was called arXiv, it was accessed under the hostname xxx.lanl.gov (âxxxâ didnât have the explicit connotations it does today, Ginsparg emphasized). During a car ride, he and his wife brainstormed nicer-sounding names. Archive? Already taken. Maybe they could sub in the Greek equivalent of X, chi (pronounced like âkaiâ). âShe wrote it down and crossed out the e to make it more symmetric around the X,â Ginsparg said. âSo arXiv it was.â At this point, there wasnât much formal structure. The number of developers typically stayed at one or two, and much of the moderation was managed by Ginspargâs friends, acquaintances, and colleagues.
Early on, Ginsparg expected to receive on the order of 100 submissions to arXiv a year. It turned out to be closer to 100 a month, and growing. âDay one, something happened, day two something happened, day three, Ed Witten posted a paper,â as Ginsparg once put it. âThat was when the entire community joined.â Edward Witten is a revered string theorist and, quite possibly, the smartest person alive. âThe arXiv enabled much more rapid worldwide communication among physicists,â Witten wrote to me in an email. Over time, disciplines such as mathematics and computer science were added, and Ginsparg began to appreciate the significance of this new electronic medium. Plus, he said, âit was fun.â
As the usage grew, arXiv faced challenges similar to those of other large software systems, particularly in scaling and moderation. There were slowdowns to deal with, like the time arXiv was hit by too much traffic from âstanford.edu.â The culprits? Sergey Brin and Larry Page, who were then busy indexing the web for what would eventually become Google. Years later, when Ginsparg visited Google HQ, both Brin and Page personally apologized to him for the incident.
The biggest mystery is not why arXiv succeeded. Rather, itâs how it wasnât killed by vested interests intent on protecting traditional academic publishing. Perhaps this was due to a decision Ginsparg made early on: Upon submission, users signed a clause that gave arXiv nonexclusive license to distribute the work in perpetuity, even in the event of future publication elsewhere. The strategic move ensured that no major publishers, known for their typically aggressive actions to maintain feudal control, would ever seriously attempt to shut it down.
But even as arXivâs influence grew, higher-ups at Los Alamos never particularly championed the projectâwhich was becoming, one could argue, more influential than the lab itself. (This was, of course, long past the heyday of Oppenheimer depicted in Christopher Nolanâs middling 2023 docudrama.) Those early years at Los Alamos were âdreamlike and heavenly,â Ginsparg emphasized, the best job he ever had. But in 1999, a fellow physicist at the lab, Wen Ho Lee, was accused of leaking classified information to China. Lee, a Taiwanese American, was later cleared of wrongdoing, and the case was widely criticized for racial profiling. At the time, the scandal led to internal upheaval. There were travel restrictions to prevent leaks, and even discussions about subjecting employees to lie detector tests. âIt just got glummer and glummer,â Ginsparg said. It didnât help that a performance review that year labeled him âa strictly average performerâ with âno particular computer skills contributing to lab programs.â Also, his daughter had just been born, and there werenât schools nearby. He was ready to leave.
Ginsparg stops short of saying he âbroughtâ arXiv with him, but the fact is, he ended up back at his alma mater, Cornellâtenured, this timeâand so did arXiv. He vowed to be free of the project within âfive years maximum.â After all, his main job wasnât supposed to be running arXivâit was teaching and doing research. At the university, arXiv found a home within the library. âThey disseminate material to academics,â Ginsparg said, âso that seemed like a natural fit.â
A natural fit it was not. Under the hood, arXiv was a complex software platform that required technical expertise far beyond what was typically available in a university library. The logic for the submission process alone involved a vast number of potential scenarios and edge cases, making the code convoluted. Ginsparg and other early arXiv members I spoke to felt that the library failed to grasp arXivâs significance and treated it more like an afterthought.
On the libraryâs side, some people thought Ginsparg was too hands-on. Others said he wasnât patient enough. A âgood lower-level manager,â according to someone long involved with arXiv, âbut his sense of management didnât scale.â For most of the 2000s, arXiv couldnât hold on to more than a few developers.
There are two paths for pioneers of computing. One is a life of board seats, keynote speeches, and lucrative consulting gigs. The other is the path of the practitioner who remains hands-on, still writing and reviewing code. Itâs clear where Ginsparg standsâand how anathema the other path is to him. As he put it to me, âLarry Summers spending one day a week consulting for some hedge fundâitâs just unseemly.â
But overstaying oneâs welcome also risks unseemliness. By the mid-2000s, as the web matured, arXivâin the words of its current program director, Stephanie Orphanâgot âbigger than all of us.â A creationist physicist sued it for rejecting papers on creationist cosmology. Various other mini-scandals arose, including a plagiarism one, and some users complained that the moderatorsâvolunteers who are experts in their respective fieldsâheld too much power. In 2009, Philip Gibbs, an independent physicist, even created viXra (arXiv spelled backward), a more or less unregulated Wild West where papers on quantum-physico-homeopathy can find their readership, for anyone eager to learn why pi is a lie.
Then there was the problem of managing arXivâs massive code base. Although Ginsparg was a capable programmer, he wasnât a software professional adhering to industry norms like maintainability and testing. Much like constructing a building without proper structural supports or routine safety checks, his methods allowed for quick initial progress but later caused delays and complications. Unrepentant, Ginsparg often went behind the libraryâs back to check the code for errors. The staff saw this as an affront, accusing him of micromanaging and sowing distrust.
In 2011, arXivâs 20th anniversary, Ginsparg thought he was ready to move on, writing what was intended as a farewell note, an article titled âArXiv at 20,â in Nature: âFor me, the repository was supposed to be a three-hour tour, not a life sentence. ArXiv was originally conceived to be fully automated, so as not to scuttle my research career. But daily administrative activities associated with running it can consume hours of every weekday, year-round without holiday.â
Ginsparg would stay on the advisory board, but daily operations would be handed over to the staff at the Cornell University Library.
It never happened, and as time went on, some accused Ginsparg of âbackseat driving.â One person said he was holding certain code âhostageâ by refusing to share it with other employees or on GitHub. Ginsparg was frustrated because he couldnât understand why implementing features that used to take him a day now took weeks. I challenged him on this, asking if there was any documentation for developers to onboard the new code base. Ginsparg responded, âI learned Fortran in the 1960s, and real programmers didnât document,â which nearly sent me, a coder, into cardiac arrest.
Technical problems were compounded by administrative ones. In 2019, Cornell transferred arXiv to the school âs Computing and Information Science division, only to have it change hands again after a few months. Then a new director with a background in, of all things, for-profit academic publishing took over; she lasted a year and a half. âThere was disruption,â said an arXiv employee. âIt was not a good period.â
But finally, relief: In 2022, the Simons Foundation committed funding that allowed arXiv to go on a hiring spree. Ramin Zabih, a Cornell professor who had been a long-time champion, joined as the faculty director. Under the new governance structure, arXivâs migration to the cloud and a refactoring of the code base to Python finally took off.
One Saturday morning, I met Ginsparg at his home. He was carefully inspecting his sonâs bike, which I was borrowing for a three-hour ride we had planned to Mount Pleasant. As Ginsparg shared the route with me, he teasinglyâbut persistentlyâexpressed doubts about my ability to keep up. I was tempted to mention that, in high school, Iâd cycled solo across Japan, but I refrained and silently savored the moment when, on the final uphill later that day, he said, âI mightâve oversold this to you.â
Over the months I spoke with Ginsparg, my main challenge was interrupting him, as a simple question would often launch him into an extended monolog. It was only near the end of the bike ride that I managed to tell him how I found him tenacious and stubborn, and that if someone more meek had been in charge, arXiv might not have survived. I was startled by his response.
âYou know, one personâs tenacity is another personâs terrorism,â he said.
âWhat do you mean?â I asked.
âIâve heard that the staff occasionally felt terrorized,â he said.
âBy you?â I replied, though a more truthful response wouldâve been âNo shit.â Ginsparg apparently didnât hear the question and started talking about something else.
Beyond the dramaâif not terrorismâof its day-to-day operations, arXiv still faces many challenges. The linguist Emily Bender has accused it of being a âcancerâ for the way it promotes âjunk scienceâ and âfast scholarship.â Sometimes it does seem too fast: In 2023, a much-hyped paper claiming to have cracked room-temperature superconductivity turned out to be thoroughly wrong. (But equally fast was exactly that debunkingâproof of arXiv working as intended.) Then there are opposite cases, where arXiv âcensorsââso say criticsâperfectly good findings, such as when physicist Jorge Hirsch, of h-index fame, had his paper withdrawn for âinflammatory contentâ and âunprofessional language.â
How does Ginsparg feel about all this? Well, heâs not the type to wax poetic about having a mission, promoting an ideology, or being a pioneer of âopen science.â He cares about those things, I think, but heâs reluctant to frame his work in grandiose ways.
At one point, I asked if he ever really wants to be liberated from arXiv. âYou know, I have to be completely honestâthere are various aspects of this that remain incredibly entertaining,â Ginsparg said. âI have the perfect platform for testing ideas and playing with them.â Though he no longer tinkers with the production code that runs arXiv, he is still hard at work on his holy grail for filtering out bogus submissions. Itâs a project that keeps him involved, keeps him active. Perhaps, with newer language models, heâll figure it out. âItâs like that Al Pacino quote: They keep bringing me back,â he said. A familiar smile spread across Ginspargâs face. âBut Al Pacino also developed a real taste for killing people.â
Let us know what you think about this article. Submit a letter to the editor at [email protected].