ARTICLE
Definitions and terminology

Open Science infrastructure is one of the four pillars of Open Science in the UNESCO Recommendation on Open Science (2021)
Open science infrastructure is a form of knowledge infrastructure that makes it possible to create, publish and maintain open scientific outputs such as pûblication, data or softwares.
The Unesco recommendation of Open Science approved in November 2021 define open science infrastructures as “shared research infrastructures that are needed to support open science and serve the needs of different communities”[footnote “UNESCO Recommendation on Open Science, 2021, CL/4363”]. The SPARC report on European Open Science Infrastructure include the following activities within the range of open science infrastructures: “We define Open Access & Open Science Infrastructure as sets of services, protocols, standards and software contributing to the research lifecycle – from collaboration and experimentation through data collection and storage, data organization, data analysis and computation, authorship, submission, review and annotation, copyediting, publishing, archiving, citation, discovery and more”[footnote “Ficarra et al. 2020, p. 7”]
Infrastructure
The use of the term “infrastructure” is an explicit reference to the physical infrastructures and networks such as power grids, road networks or telecommunications that made it possible to run complex economic and social system after the industrial revolution: “The term infrastructure has been used since the 1920s to refer collectively to the roads, power grids, telephone systems, bridges, rail lines, and similar public works that are required for an industrial economy to function (…) If infrastructure is required for an industrial economy, then we could say that cyberinfrastructure is required for a knowledge economy”[footnote “Atkins 2003, p. 5”]. The concept of infrastructure was notably extended in 1996 to forms of computer-mediated knowledge production by Susan Leigh Star and Karen Ruhleder, through an empirical observation of an early form of open science infrastructure, the Worm Community System.[footnote “Star & Ruhleder 1996”] This definition has remained influential through the next two decades in science and technology studies[footnote “Karasti et al. I 2016, p. 4”] and has affected the policy debate over the building of scientific infrastructure since the early 2000s[footnote “Atkins 2003, p. 5”]
Open science infrastructure have specific properties that contrast them with other forms of open science projects or initiatives:
Open science infrastructures are not simply a technical product but embed a set of tools, institutions and social norms[footnote “Fecher et al. 2021, p. 500”][footnote “Edwards et al. 2006, p. 6”] Consequently, infrastructures are not always visible as they can be largely hidden under the routine of normal activities[footnote “Moore 2019, p. 121: “infrastructures are not easily divisible, recognisable or compartmentalised””][footnote “Okune et al. 2018, p. 3”] The resilience and tacitness of the infrastructures makes it especially difficult to identify the real contributions and “labour cost” of open science work, as it remains “invisible in the university system”.[footnote “Moore 2019, p. 143”] This make it also difficult to allocate funding effectively as critical infrastructure may remain undetected by funding bodies.[footnote “Neylon 2018, p. 1”]
Open science infrastructures are durable and resilient. They are expected to run on a long term basis and multiple research programs relies on.[footnote “Atkins 2003, p. 5”][footnote “Fecher et al. 2021, p. 500”] To some extent, infrastructure are successful when they are forgotten and become an integral part of routine research activities: “Infrastructure at its best is invisible. We tend to only notice it when it fails.”[footnote “Neylon et al. 2015”]
Open science infrastructures can be shared and used by different actors and communities. It must be sufficiently consistent to remain coordinated and yet it have to welcome a diverse array of local uses: “an infrastructure occurs when the tension between local and global is resolved”.[footnote “Star & Ruhleder 1996”] Predefined agreement on the scope and the governance of the infrastructure within all stakeholders is a critical step.[footnote “Bos et al. 2007, p. 667”]
Openness and the commons
Open science infrastructures are open, which differentiate them with other scientific and knowledge infrastructure and, more specifically, with subscription-based commercial infrastructures. Openness is both a core value and a directing principle that affect the aims, the governance and the management of the infrastructure. Open science infrastructure face similar issues met by other open institutions such as open data repositories or large scale collaborative project such as Wikipedia: “When we study contemporary knowledge infrastructures we find values of openness often embedded there, but translating the values of openness into the design of infrastructures and the practices of infrastructuring is a complex and contingent process”.[footnote “Karasti et al. IV 2016, p. 5”]
The conceptual definition of open science infrastructures has been largely influenced by the analysis of Elinor Ostrom on the commons and more specifically on the knowledge commons. In accordance with Ostrom, Cameron Neylon understates that open infrastructures are not only characterized by the management of a pool of common resources but also by the elaboration of common governance and norms.[footnote “Neylon 2018, p. 7”] The economic theory of the commons make it possible to expand beyond the scope of limited scope of scholar associations toward large scale community-led initiatives: “Ostrom’s work (…) provides a template (…) to make the transition from a local club to a community-wide infrastructure.”[footnote “Neylon 2018, p. 7-8”] Open science infrastructure tend to favor a non-for profit, publicly-funded model with strong involvement from scientific communities, which disassociate them from privately-owned closed infrastructures: “open infrastructures are often scholar-led and run by non-profit organisations, making them mission-driven instead of profit-driven.”[footnote “Kraker 2021, p. 2”] This status aims to ensure the autonomy of the infratructure and prevent their incorporation into commercial infrastructure.[footnote “Future of scholarly publishing 2019”] It has wide range implications on the way the organization is managed: “the differences between commercial services and non-profit services permeated almost every aspect of their responses to their environment”[footnote “Fecher et al. 2021, p. 505”].
Open science infrastructures are not only a more specific subset of scientific infrastructures and cyberinfrastructures but may also include actors that would not fall into this definition. “Open access publication platforms” such as Scielo, OpenEdition or the Open Library of Humanities are considered an integral part of open science infrastructures in the UNESCO definition[footnote “UNESCO Recommendation on Open Science, 2021, CL/4363”] and in several literature review[footnote “lewis 2020, p. 6”] and policy reports[footnote “Ficarra et al. 2020, p. 8”], whereas they were usually considered as a separate entities in the policy debate on cyberinfrastructure and e-infrastructures.[footnote “Dacos 2013”] In the 2010 report of the European Commission on e-infrastructure, scientific publishing plaforms are “not e-Infrastructures but closely related to it”.[footnote “Role of e-Infrastructure 2010, p. 222”]
Open science infrastructures may also incorporate additional values and ethical principles. Samuel Moore has theorized a form of care-full scholarly commons that does not exist yet but would incorporate latent forms of open science infrastructure and communities: “In addition to sharing resources with other projects, commoning also requires commoners to adopt an outwardly-focused, generous attitude to other commons projects, redirecting their labour away from proprietary.”[footnote “Moore 2019, p. 183”] In 2018, Okune et al. introduced a similar concept of “inclusive knowledge infrastructures” that “deliberately allow for multiple forms of participation amongst a diverse set of actors (…) and seek to redress power relations within a given context.”[footnote “Okune et al. 2018, p. 3”]
Principles for open science infrastructures
In 2015 Principles for Open Scholarly Infrastructure have laid out an influential prescriptive definition of open science infrastructures. Subsequent definitions and terminologies of open science infratructures have been largely elaborated on this basis.[footnote “Ross-Hellauer et al. 2020, p. 13”][footnote “Ficarra et al. 2020, p. 7”][footnote “SPARC 2020”] The text has also influenced the definition of open science infrastructure retained by the UNESCO in November 2021[footnote “(https://en.unesco.org/sites/default/files/comments_osr_partner_open_science_mooc_document.pdf Open Science MOOC Response to UNESCO DraftOpen Science Recommendations), December 30, 2020”].
The Principles attempt to to hybridize the framework of infrastructure studies with the analysis of the commons initiated by Elinor Ostrom. The principles develop a series of recommendations in three critical areas to the success of open infrastructures:
Governance: the governance of the infrastruture should be open and accountable to the scientific communities it aims to serve. Specific measures should ensure that the management of the organization is transparent and diverse.[footnote “Neylon et al. 2015”]
Sutainability: the core activities of organization should be covered by recurring funds. Short-term subventions should be limited to short-term projects. Whil the organization could charge for services, it should not extend to the data that should remain “a community property”.[footnote “Neylon et al. 2015”]
Insurance: the technical infrastructure and the output of the organization are open. This ensure that the infrastructure can be recreated if necessary (in the jargon of open source, it becomes “forkable”).[footnote “Neylon et al. 2015”]
The text ends by mentioning several potential consequences of the principles. The authors advocate for a responsible centralization, that embodies a different than the large web commercial platforms like Google and Facebook while still maintaining the important benefit of centralized infrastructures: “we will be able to build accountable and trusted organisations that manage this centralization responsibly”.[footnote “Neylon et al. 2015”] Existing examples of large open infrastructure include ORCID, the Wikimedia Foundation or CERN.
A more critical reception has focused on the underlying political philosophy of the Principles.[footnote “Moore 2019”][footnote “Okune et al. 2018”] While the scientific community is a key part of the governance of open science infrastructure, Samuel Moore underline that it is never precisely defined, which raised potential issues of under-representation of minority groups:
[this] raises questions over who is the community that gets to govern and exclude, and what gives them the right to decide the conditions These questions are especially relevant for understandings of the commons that are all-encompassing or operate on a large scale, which tend to favour more powerful stakeholders, wealthy disciplines and countries in the Global North. Such commons treat subjects in a political vacuum rather than embedded in a particular situation and entangled in a number of different relationships and projects with asymmetrical power structures.[footnote “Moore 2019, p. 173”]
History
Early developments (1950–1990)

The Sputnik launch has triggered one of the first major debate on scientific infrastructure
Scientific projects have been among the earliest use case for digital infrastructure. The theorization of scientific knowledge infrastructure even predates the development of computing technologies. The knowledge network envisioned by Paul Otlet or Vannevar Bush already incorporated numerous features of online scientific infrastructures.[footnote “Borgman 2007, p. 40”]
After the Second World War, the United States faced a “periodical crisis”: existing journals could not keep up with the rapidly increasing scientific output[footnote “Wouters 1999, p. 61”]. The issue became politically relevant after the successful launch of Sputnik: “The Sputnik crisis turned the librarians’ problem of bibliographic control into a national information crisis.”[footnote “Wouters 1999, p. 62”]. The emerging computing technologies were immediately considered as a potential solution to make a larger amount of scientific output readable and searchable. Access to foreign language publication was also a key issue that was expected to be solved by machine translation: in the 1950s, a significant amount of scientific publications were not available in English, especially the one coming from the Soviet block.
Influent members of the National Science Foundation like Joshua Ledeberg advocated for the creation of a “centralized information system”, SCITEL that would at first coexist with printed journals and gradually replace them altogether on account of its efficiency[footnote “Wouters 1999, p. 60”]. In the plan laid out by Ledeberg to Eugen Garfield in November 1961, the deposit would index as much as 1,000,000 scientific articles per year. Beyond full-text searching, the infrastructure would also ensure the indexation of citation and other metadata, as well as the automated translation of foreign language articles[footnote “Wouters 1999, p. 64”].
Although it anticipates key features of online scientific platforms, the SCITEL plan was technically irrealistic at the time. The first working prototype on an online retrieval system developed in 1963 by Doug Engelhart and Charles Bourne at the Stanford Research Institute was heavily constrained by memory issues: no more than 10,000 words of a few documents could be indexed[footnote “Bourne & Hahn 2003, p. 16”].

The indexation process of citations in MEDLARS, an early scientific infrastructure for publications in medicine
Instead of a general purpose publishing platform, the early scientific computing infrastructures focused on specific research areas, such as MEDLINE for medicine, NASA/RECON for space engineering or OCLC Worldcat for library search: “most of the earliest online retrieval system provided access to a bibliographic database and the rest used a file containing another sort of information—encyclopedia articles, inventory data, or chemical compounds.”[footnote “Bourne & Hahn 2003, p. 12”] This early development of scientific computing affected a large variety of disciplines and communities, including the social sciences: “The 1960s and 1970s saw the establishment of over a dozen services and professional associations to coordinate quantitative data collection”.[footnote “Shankar et al. 2016, p. 63”] Yet these infrastructures were mostly invisible to researchers, as most of the research was done by professional librarians. Not only were the search operating systems complicated to use, but the search has to be performed very efficiently given the prohibitive cost of long distance telecommunication[footnote “Regazzi 2015, p. 128”]. To become technically feasible, scientific infrastructure could never be open and became fundamentally hidden to their end users:
The designers of the first online systems had presumed that searching would be done by end users; that assumption undergirded system design. MEDLINE was intended to be used by medical researchers and clinicians, NASA/RECON was designed for aerospace engineers and scientists. For many reasons, however, most users through the seventies were librarians and trained intermediaries working on behalf of end users. In fact, some professional searchers worried that even allowing eager end users to get at the terminals was a bad idea.[footnote “Bourne & Hahn 2003, p. 397”]
The development of digital infrastructure for scientific publication was largely undertaken by private companies. In 1963, Eugene Garfield created the Institute for Scientific Information that aimed to transform the projects initially envisioned with Lederberg into a profitable business. The Science Citation Index relied on a computational processing of citation data. It had a massive and lasting influence on the structuration of global scientific publication in the last decades of the 20th century, as its most important metrics, the Journal Impact Factor, “ultimately came to provide the metric tool needed to structure a competitive market among journal[footnote “Future of scholarly publishing 2019, p. 15”]. Garfield also successfully launched Current Contents, a periodic compilation of scientific abstracts that acted as a simplified commercial version of the central deposit envisioned within SCITEL. Rather than being replaced by a centralized information system, leading scientific publishers have been able to develop their own information infrastructure that ultimately reinforced their business position. By the end of the 1960s, the dutch publisher Elsevier and the german publisher Springer have started to computarize their internal data, as well as the management of the journal reviews[footnote “Andriesse 2008, p. 189”].
Until the advent of the web, the landscape of scientific infrastructures remained fragmented.[footnote “Campbell-Kelly & Garcia-Swartz 2013”] Projects, and communities relied on their own unconnected networks at a national or institutional level: “the Internet was nearly invisible in Europe because people there were pursuing a separate set of network protocols”.[footnote “Berners-Lee & Fischetti 2008, p. 17”] The birthing place of the World Wide Web, the CERN, had its own version of Internet, CERN-Net and also supported its own protocol for e-mail exchange.[footnote “Berners-Lee & Fischetti 2008, p. 18”] The European Space Agency used its own iteration of the RECON system also used by NASA engineers (ESRO/RECON).[footnote “Bourne & Hahn 2003, p. 304”] The insulated scientific infrastructures could hardly be connected before the advent of the web. Communication between scientific infrastructures was not only challenging across space, but also across time. Whenever a communication protocol was no longer maintained, the data and knowledge it disseminated was likely to disappear as well: “the relationship between historical research and computing has been durably affected by aborted projects, data loss and unrecoverable formats”.[footnote “Dacos 2013”]
The Web Revolution (1990–1995)
The World Wide Web was originally framed as an open scientific infrastructure. The project was inspired by ENQUIRE, an information management software commissioned to Tim Berners-Lee by the CERN for the specific needs of high energy physics. The structure of ENQUIRE was closer to an internal web of data: it connected “nodes” that “could refer to a person, a software module, etc. and that could be interlined with various relations such as made, include, describes and so forth”[footnote “Hogan 2014, p. 20”]. While it “facilitated some random linkage between information” Enquire was not able to “facilitate the collaboration that was desired for in the international high-energy physics research community”[footnote “Bygrave & Bing 2009, p. 30”]. Like any significant computing scientific infrastructure before the 1990s, the development of ENQUIRE was ultimately impeded by the lack of interoperability and the complexity of managing network communications: “although Enquire provided a way to link documents and databases, and hypertext provided a common format in which to display them, there was still the problem of getting different computers with different operating systems to communicate with each other”.[footnote “Berners-Lee & Fischetti 2008, p. 17”]
Sharing of data and data documentation was a major focus in the initial communication of the World Wide Web when the project was first unveiled in August 1991 : “The WWW project was started to allow high energy physicists to share data, news, and documentation. We are very interested in spreading the web to other areas, and having gateway servers for other data”[footnote “Tim Berners-Lee, “Qualifiers on Hypertext Links”, mail sent on August, 6 1991 to the alt.hypertext”].
The web rapidly superseded pre-existing online infrastructure, even when they included more advanced computing features. From 1991 to 1994, users of the Worm Community System, a major biology database on worms, switched to the Web and Gopher. While the Web did not include many advanced functions for data retrieval and collaboration, it was easily accessible. Conversely, the Worm Community System could only be browsed on specific terminals shared accross scientific institutions: “To take on board the custom-designed, powerful WCS (with its convenient interface) is to suffer inconvenience at the intersection of work habits, computer use, and lab resources (…) The World-Wide Web, on the other hand, can be accessed from a broad variety of terminals and connections, and Internet computer support is readily available at most academic institutions and through relatively inexpensive commercial services.[footnote “Star & Ruhleder 1996, p. 131″]”
The Web and similar protocols developed at the time have had a similar impact on scientific publications. Early forms of open access publishing were not developed by large scale institutional infrastructures but through small initiatives. Universal access, regardless of the operating system, made it possible to maintain and share community-driven electronic journals year before online commercial scientific publishings became viable:
In the late ‘80s and early ‘90s, a host of new journal titles launched on listservs and (later) the Web. Journals such as Postmodern Cultures, Surfaces, the Bryn Mawr Classical Review and the Public-Access Computer Systems Review were all managed by scholars and library workers rather than publishing professionals.[footnote “Moore 2020, p. 7”]
The first open-access repositories were individual or community initiatives as well. In August 1991, Paul Ginsbarg created the first inception of the arXiv project at the Los Alamos National Laboratory in answer to recurring storage issue of academic mailboxes on account of the increasing sharing of scientific articles[footnote “Feder, Toni (8 November 2021). Joanne Cohn and the email list that led to arXiv. Physics Today. doi:10.1063/PT.6.4.20211108a.”]
Building scientific infrastructures for the web (1995-2015)
The development of the World-Wide Web had rendered numerous pre-existing scientific infrastructure obsolete. It also lifted numerous restrictions and obstacles to online contribution and network management that made it possible to attempt more ambitous project. By the end of the 1990s, the creation of public scientific computing infrastructure became a major policy issue[footnote “Borgman 2007, p. 21.”]. The first wave of web-based scientific projects in the 1990s and the early 2000s revealed critical issues of sustainability. As funding was allocated on a specific time period, critical databases, online tools or publishing platforms could hardly be maintained[footnote “Dacos 2013.”] and project managers were faced with a valley of death “between grant funding and ongoing operational funding”.[footnote “Skinner 2019, p. 6.”].
Several competing terms appeared to fill this need. In the United States, the cyber-infrastructure was used in a scientific context by a US National Science Foundation (NSF) blue-ribbon committee in 2003: “The newer term cyberinfrastructure refers to infrastructure based upon distributed computer, information and communication technology. If infrastructure is required for an industrial economy, then we could say that cyberinfrastructure is required for a knowledge economy.”[footnote “Atkins 2003, p. 5”] E-infrastructure or e-science were used in a similar meaning in the United Kingdom and European countries.
Thanks to “sizable investments”[footnote “Eccles et al. 2009”], major national and international infrastructures have been incepted from the initial policy discussion in the early 2000s to the economic crisis of 2007-2008, such as the Open Science Grid, BioGRID, the JISC, DARIAH or the Project Bamboo.[footnote “Dacos 2013”][footnote “Role of e-Infrastructure 2010”] Specialized free software for scientific publishing like Open Journal Systems became available after 2000. This development entailed a significant expansion of non-commercial open access journals by facilitating the creation and the administration of journal website and the digital conversion of existing journals.[footnote “OA Diamond Study 2021, p. 93”] Among the non-commercial journals registered to the Directory of Open Access Journals, the number of annual creation has gone from 100 by the end of the 1990s to 800 around 2010, and not evolved significantly since then.[footnote “OA Diamond Study 2021, p. 30”]
By 2010, infrastructure are “no longer in infancy” and yet “they are also not yet fully mature”.[footnote “Eccles et al. 2009”] While the development of the web solved a large range of technical issues regarding network management, bulding scientific infrastructure remained challenging. Governance, communication accross all involved stakeholders, and strategical divergences were major factors of success or failure. One of the first major infrastructure for the humanities and the social science, the Project Bamboo was ultimately unable to achieve its ambitious aims: “From the early planning workshops to the Mellon Foundation’s rejection of the project’s final proposal attempt, Bamboo was dogged by its reluctance and/or inability to concretely define itself”.[footnote “Dombrowski 2014, p. 334”] This lack of clarity was further aggravated by recurring communication missteps between the project iniators and the community it aimed to serve. “The community had spoken and made it clear that continuing to emphasize Service-oriented architecture would alienate the very members of the community Bamboo was intended to benefit most: the scholars themselves”.[footnote “Dombrowski 2014, p. 329”] Budgets cuts following the economic crisis of 2007-2008 underlined the fragility of ambitious infrastructure plans relying on a significant reccurring funds.[footnote “Dombrowski 2014, p. 331”]

Leading commercial ecosystems for scientific research
Leading commercial publishers were initially distanced by the unexpected rise of the Web for academic publication: the executive board of Elsevier “had failed to grasp the significance of electronic publishing altogether, and therefore the deadly danger that it posed—the danger, namely, that scientists would be able to manage without the journal”.[footnote “Andriesse 2008, p. 257-258”] The persistance of high revenues from subscription and the consolidation of the sector made it possible to fund the conversion of the pre-existing online services to the web as well as the digitization of past collections. By the 2010s, leading publishers have been “moving from a content-provision to a data analytics business”[footnote “Aspesi et al. 2019, p. 5”] and developed or acquired new key infrastructures for the management scientific and pedagogic activities: “Elsevier has acquired and launched products that extend its influence and its ownership of the infrastructure to all stages of the academic knowledge production process”.[footnote “Posada & Chen 2018, p. 6”]. Since it has expanded beyond publishing, the vertical integration of privately-owned infrastructures has become extensively integrated to daily research actvities.
The privatised control of scholarly infrastructures is especially noticeable in the context of ‘vertical integration’ that publishers such as Elsevier and SpringerNature are seeking by controlling all aspects of the research lifecycle, from submission to publication and beyond. For example, this vertical integration is represented in a number of Elsevier’s business acquisitions, such as Mendeley (a reference manager), SSRN (a pre-print repository) and Bepress (a provider of repository and publishing software for universities).[footnote “Moore 2019, p. 156”]
Toward open science infrastructures (2015-…)
The consolidation and expansion of commercial scientific infrastructure had entailed renewed calls to secure “community-controlled infrastructure”[footnote “Joseph 2018, p. 1”]. The acquisition of the open repositories Digital Commons and SSRN by Elsevier has highlighted the lack of reliability of critical scientific infrastructure for open science.[footnote “Boston 2021”][footnote “Joseph 2018”][footnote “Brembs et al. 2021”] The SPARC report on European Infrastructures underlines that “a number of important infrastructures at risk and as a consequence, the products and services that comprise open infrastructure are increasingly being tempted by buyout offers from large commercial enterprises. This threat affects both not-for-profit open infrastructure as well as closed, and is evidenced by the buyout in recent years of commonly relied on tools and platforms such as SSRN, bepress, Mendeley, and Github.”[footnote “Ficarra et al. 2020, p. 7”]
In contrast with the consolidation of privately-owned infrastructure, the open science movement “has tended to overlook the importance of social structures and systemic constraints in the design of new forms of knowledge infrastructures.”[footnote “Okune et al. 2018, p. 13”]. It remained mostly focused to the content of scientific research, with little integration of technical tools and few large community initiatives. “common pool of resources is not governed or managed by the current scholarly commons initiative. There is no dedicated hard infrastructure and though there may be a nascent community, there is no formal membership.”[footnote “Bosman et al. 2018, p. 19”]
More precise concepts were needed to embed ethical principles of openness, community-service and autonomous governance in the building of infrastructure and ensure the transformation of small localized scholarly networks into large, “community-wide” structures.[footnote “Neylon 2018, p. 7”] In 2013, Cameron Neylon underlined that the lack of common infrastructure was one of the main weakness of the open science ecosystem: “in a world where it can be cheaper to re-do an analysis than to store the data, we need to consider seriously the social, physical, and material infrastructure that might support the sharing of the material outputs of research”.[footnote “Neylon 2013”] Two years later, Neylon, Geoffrey Bilder and Jenifer Lin defined a series of Principles for Open Scholarly Infrastructure[footnote “Neylon et al. 2015”] that reacted primarily to the discrepancy between the increasing openness of scientific publications or datasets and the closeness of the infrastructure that control their circulation.
Over the past decade, we have made real progress to further ensure the availability of data that supports research claims. This work is far from complete. We believe that data about the research process itself deserves exactly the same level of respect and care. The scholarly community does not own or control most of this information. For example, we could have built or taken on the infrastructure to collect bibliographic data and citations but that task was left to private enterprise.[footnote “Neylon et al. 2015”]
Since 2015 these principles have become the most influential definition of Open Science Infrastructures and been endorsed by leading infrastructures such as Crossref[footnote “Crossref’s Board votes to adopt the Principles of Open Scholarly Infrastructure”], OpenCitations[footnote “OpenCitations’ compliance with the Principles of Open Scholarly Infrastructure”] or Data Dryad[footnote “Dryad’s Commitment to the Principles of Open Scholarly Infrastructure”] and has become a commmon basis for the institutional evaluation of existing open infrastructures[footnote “Ficarra et al. 2020, p. 21”]. The main focus of the Principles is to build “trustworthy institutions” with significant committments in terms of governance, financial sustainability and technical efficiency sot that it can be durably relied on by scientific communities.[footnote “Neylon 2018, p. 7”]
By 2021, public services and infrastructures for research have largely endorsed open science as an integral part of their activity and identity: “open science is the dominant discourse to which new online services for research refer.”[footnote “Fecher et al. 2021, p. 505”] According to the 2021 Roadmap of the European Strategy Forum on Research Infrastructures (ESFRI), major legacy infrastructures in Europe have embraced open science principles. “Most of the Research Infrastructures on the ESFRI Roadmap are at the forefront of Open Science movement and make important contributions to the digital transformation by transforming the whole research process according to the Open Science paradigm.”[footnote “ESFRI Roadmap 2021, p. 159”] Examples of extensive data sharing programs include the European Social Survey (in social science), ECRIN ERIC (for clinical data) or the Cherenkov Telescope Array (in Astronomy).[footnote “ESFRI Roadmap 2021, p. 159”]
In agreement with the original intent of the Principles, open science infrastructure are “seen as an antidote to the increased market concentration observed in the scholarly communication space.”[footnote “Kraker 2021, p. 2”]. In November 2021, the UNESCO Recommendation for Open Science acknowledged open science infrastructure as one of the four pillar of open science, along with open science knowledge, open engagement of societal actors and open dialog with other knowledge system and called for sustained investment and funding: “open science infrastructures are often the result of community-building efforts, which are crucial for their longterm sustainability and therefore should be not-for-profit and guarantee permanent and unrestricted access to all public to the largest extent possible.”[footnote “UNESCO Recommendation on Open Science, 2021, CL/4363”]
The development of open scientific infrastructure has become a debated topic regarding the future of online scientific research. In January 2021, a collective of researchers called for a Plan I or Plan Infrastructure in reaction to perceived shortcomings of the international initiative for open science of the cOAlition S, the Plan S.[footnote “Brembs et al. 2021”] In contrast with the focus of Plan S on scientific publication, Plan I aims to integrate all research outputs on large interoperable infrastructures: “research and scholarship are crucially dependent on an information infrastructure that treats all scholarly output, text, data and code, equally and that is based on open standards and open markets.”[footnote “Brembs et al. 2021, p. 4”]
Organization of open infrastructures
Most of the landscape reports on Open Infrastructure have been undertaken in Europe and, to a lesser extent, in Latin America. For Europe, the main sources include the SPARC report from 2020[footnote “Ficarra et al. 2020”], the OPERAS report on social science and humanities infrastructure[footnote “Future of Scholarly Communication 2021”] as well as the 2019 report of Katherine Skinner (that also extends to a few North American infrastructures). International studies include European Commission 2010 report on The Role of E-Infrastructure which mostly receive input from Europe, South America and North America[footnote “Role of e-Infrastructure 2010”].
These reports underline that important open science infrastructures may be already existing and yet remain invisible to funders and scientific policies: “alternative practices and projects exist inside and outside Europe, but these projects are almost invisible to the eyes of the public authorities”.[footnote “Mounier 2018, p. 305”]
Type and roles
Open Access repositories are the most frequent form of Open Science Infrastructure[footnote “Operas Landscape Study 2017, p. 15”] with 5,791 repositories in existence in December 2021 according to OpenDOAR[footnote “OpenDOAR Statistics”]
Yet, there is a significant diversification of the roles and the activities of open science infrastructure, at least among the largest infrastructures. In the survey of European infrastructure conducted by SPARC Europe, 95% of the respondents mention that they provide services in at least three different stages of research production out of six (Creation, Evaluation, Publishing, Hosting, Discovering and Archiving)[footnote “Ficarra et al. 2020, p. 13”]. Agregation, hosting and indexing are especially central activities, common to most Open Science Infrastructures regardless of their focus.
Specialization does happen at a higher level. A network analysis identifies “two main clusters of activities”:
-
- Publishing-focused infrastructures which are associated with the “publishing and hosting traditional text formats”[footnote “Ficarra et al. 2020, p. 13”]. Among them, “paper submission (41 out of 70) and review (30) were the most commonly reported activities”[footnote “Ficarra et al. 2020, p. 15”].
- Creation-focused infrastructures which deal preferably with the “processing and storing research outputs, particularly data”. Theses actors provide specific services in the field of “data gathering (47 out of 71), and data analysis (40)”[footnote “Ficarra et al. 2020, p. 15”]. Besides, “computation and machine learning (18) and Experimentation (15) were roughly half as common”[footnote “Ficarra et al. 2020, p. 15”].
Standards and technologies
Standardization is a major function of open science infrastructure as they aim to insure that the content they share and support is distributed consistently as well as ease reuse.
Maintaining open standards is one of the main challenge identified by leading European open infrastructures, as it implies choosing among competing standards in some case, as well as ensuring that the standards are correctly updated and accessibile through APIs or other endpoints.[footnote “Ficarra et al. 2020, p. 23”] Two third of the respondents have undertaken an evaluation of their technological environment during the past year, to ensure that key components have not become obsolete.[footnote “Ficarra et al. 2020, p. 29”] As a consequence of this sustained efforts, most open infrastructure complies with the new established standards of open science, such as FAIR data or Plan S.[footnote “Ficarra et al. 2020, p. 29”]
Open science infrastructures preferably integrate standards from other open science infrastructures. Among European infrastructures: “The most commonly cited systems – and thus essential infrastructure for many – are ORCID, Crossref, DOAJ, BASE, OpenAIRE, Altmetric, and Datacite, most of which are not-for-profit”.[footnote “Ficarra et al. 2020, p. 50”] Google Scholar is the first mentioned commercial service, while Scopus, the leading proprietary academic search engine developed by Elsevier, is one of least quoted leading service.[footnote “Ficarra et al. 2020, p. 31”]. Open science infrastructure are then part of an emerging “truly interoperable Open Science commons” that hold the premise of “researcher-centric, low-cost, innovative, and interoperable tools for research, superior to the present, largely closed system.”[footnote “Ross-Hellauer et al. 2020, p. 13”]
Infrastructures are frequently dependent on choices made by external stakeholders, especially scientific publishers: they “do not themselves decide on the openness of content since they are dependent on the policies of content providers”.[footnote “Ficarra et al. 2020, p. 27”] This affects not only the content but also the “user data policies [that] are set by publishers which limits what can be made available”.[footnote “Ficarra et al. 2020, p. 24”]
Open Science Infrastructure have strong ties with the open source movement. 82% of the European infrastructures surveyed by SPARC claim to have partially built open source software and 53% have their entire technological infrastructure in open source.[footnote “Ficarra et al. 2020, p. 29”]
Governance
Governance has been self-identified as a potential weakness by the European infrastructure surveyed by SPARC[footnote “Ficarra et al. 2020, p. 22”]. Less than half of the respondents considering that they are at a “mature” stage in this regard and a “good governance” is quoted as the main challenge[footnote “Ficarra et al. 2020, p. 23”]. Interaction between the communities they aim to support and the other stakeholders and funders is especially complicated: “One specific challenge identified was the tension between serving the needs of the community of users versus prioritising the needs of clients that provide financial support to the OSI”[footnote “Ficarra et al. 2020, p. 23”].
The tension between centralization and diversity largely characterizes Open Science Infrastructure. While historically defined as a “centralized [Open Access] project”, Redalyc aims to become a “community-based sustainable infrastructure in Latin America” (Berrecil). The leading European open infrastructures have reported “challenges around ensuring sufficient (and sufficiently diverse) representation” as well as the involvement from some professional communities like researchers and librarians[footnote “Ficarra et al. 2020, p. 23”].
Audience
Open Science Infrastructure “target and serve a wide range of stakeholders”[footnote “Ficarra et al. 2020, p. 18”]. Researchers remain the primary target, but libraries, teachers and learners are among the expected audience of more than half of the infrastructure surveyed by Sparc Europe.
A majority of european infrastructures “operate at a global scale”, with English being the primary language of 82% of the respondents[footnote “Ficarra et al. 2020, p. 20”]. These infrastructures are also frequently multilingual and integrate a specific national focus: they “provide access to a range of language content of local and international significance”[footnote “Ficarra et al. 2020, p. 20”].

Distribution of disciplines among the infrastructures surveyed by the SPARC report Scoping the Open Science Infrastructure Landscape in Europe
Open Science Infrastructures benefit to diverse disciplines and scientific communities. In 2020, 72% of the european infrastructures surveyed by Sparc Europe claim to support all disciplines. The social sciences and the humanities are the most mentioned disciplines, which is partly attributed to the fact that the survey was “distributed widely by the OPERAS network”[footnote “Ficarra et al. 2020, p. 19”]. In 2010, the infrastructures supporting the social sciences and the humanities were much less prevalent and most of the uses cases came from “biosciences, High Energy Physics and other fields of physics, earth and environmental sciences, computer science, astronomy and astrophysics”.[footnote “Role of e-Infrastructure 2010, p. 106”].
Economics
Many Open Science Infrastructure run “at a relatively low cost” as small infrastructures are an important part of the open science ecosystem.[footnote “Ficarra et al. 2020, p. 35”] In 2020, 21 out of 53 surveyed European infrastructures “report spending less than €50,000”.[footnote “Ficarra et al. 2020, p. 35”] Consequently, more than 75% of surveyed European infrastructures are run by small teams of 5 FTEs or less.[footnote “Ficarra et al. 2020, p. 41”] The size of the infrastructure and the extent of its funding is far from always proportional to the critical service it offers: “some of the most heavily used services make ends meet with a tiny core team of two to five people.”[footnote “Kraker 2021, p. 3”] Volunteer contributions are significant as well with is both “a strength and weakness to an OSI’s sustainability”.[footnote “Ficarra et al. 2020, p. 35”] The landscape of open science infrastructures is therefore rather close to the ideals of a “decentralised network of small projects” envisioned by theoricians of the scholarly commons.[footnote “Moore 2019, p. 176”] A very large majority of open science infrastructure are non-commercial[footnote “Ficarra et al. 2020, p. 48”] and collaborations or financial support from the private sector remain very limited.[footnote “Ficarra et al. 2020, p. 45”]
Overall, European infrastructures were financially sustainable in 2020[footnote “Ficarra et al. 2020, p. 51”] which contrasts with the situation ten years prior: in 2010, European infrastructures had much less visibility: they usually lacked “a long-term perspective” and struggled “with securing the funding for more than 5 years”.[footnote “Role of e-Infrastructure 2010, p. 103”] In 2020, European infrastructures frequently relies on grants from National funds and from the European Commission.[footnote “Ficarra et al. 2020, p. 45”] Without theses grants, most of theses actors would “could only remain viable for less than a year”.[footnote “Ficarra et al. 2020, p. 48”] Yet, one quarter of surveyed European infrastructures was not supported by any grants and subventions and used either alternative means of incomes or voluntary contributions.[footnote “Ficarra et al. 2020, p. 35”]. As they can be “difficult to define adequately”, open science infrastructures can be overlooked by funding bodies, which “contributes to the challenge of securing funding”.[footnote “Neylon 2018, p. 1”]