Project Alexandria: Democratizing Scientific Knowledge with Structured Fact Extraction

Introduction

Scientific publishing has grown significantly in recent decades. However, access to vital research remains limited for many, especially in developing countries, independent researchers, and small academic institutions. Rising journal subscription costs worsen this issue, restricting knowledge availability even in well-funded universities. Despite the push for Open Access (OA), barriers persist, as seen in access losses in Germany and the U.S. due to disputes with publishers. This limitation hampers scientific progress, prompting researchers to seek alternative methods for making knowledge more accessible while adhering to copyright laws.

Current Access Methods

Researchers primarily access scientific content through subscriptions, institutional access, or ambiguous repositories. These methods are often financially unsustainable or legally questionable. While OA publishing is beneficial, it does not fully resolve the accessibility crisis. Large Language Models (LLMs) provide a new method for extracting and summarizing knowledge from scholarly texts, yet they raise copyright concerns regarding the separation of factual content from creative expressions.

Project Alexandria

The research team proposes Project Alexandria, introducing Knowledge Units (KUs) to extract factual information while omitting stylistic elements. KUs encode key scientific insights—definitions, relationships, and methodologies—into a structured database, ensuring only non-copyrightable factual content is preserved. This framework aligns with legal principles that state facts cannot be copyrighted, only their specific phrasing and presentation.

Knowledge Unit Structure

KUs are generated through an LLM pipeline that processes scholarly texts in paragraph-sized segments, extracting core concepts and their relationships. Each KU includes:

  • Entities: Core scientific concepts identified in the text.
  • Relationships: Connections between entities, including causal or definitional links.
  • Attributes: Specific details related to entities.
  • Context Summary: A brief overview ensuring coherence across multiple KUs.
  • Sentence MinHash: A fingerprint to track the source text without storing the original phrasing.

Legal Compliance

This approach balances knowledge retention with legal defensibility. The framework complies with both German and U.S. copyright laws, allowing data mining under specific exemptions. The research team demonstrates that KUs meet these legal conditions by excluding expressive elements while preserving factual content.

Effectiveness of KUs

The team conducted multiple-choice question tests using abstracts and full-text articles from various fields. Results indicate that LLMs using KUs achieve nearly the same accuracy as those using original texts, suggesting that most relevant information is retained despite the removal of expressive elements. Plagiarism detection tools confirm minimal overlap between KUs and original texts, reinforcing the method’s legal viability.

Limitations of Existing Methods

The research also addresses the limitations of current alternatives. Text embeddings fail to capture precise factual details, while direct paraphrasing methods risk copyright violations. In contrast, KUs offer a structured and legally sound approach.

Addressing Criticisms

Concerns about citation dilution and loss of nuance in scientific research are acknowledged. Traceable attribution systems can mitigate citation dilution, while most complex elements like mathematical proofs are not copyrightable. Recommendations for hybrid human-AI validation systems enhance reliability.

Broader Impact

Freely accessible scientific knowledge benefits multiple sectors. Researchers can collaborate more effectively, healthcare professionals access critical medical research, and educators develop high-quality curricula without cost barriers. Open scientific knowledge promotes public trust and transparency, reducing misinformation.

Future Directions

The team identifies several research directions, including refining factual accuracy, developing educational applications for KUs, and establishing interoperability standards for knowledge graphs. They propose integrating KUs into a broader semantic web for scientific discovery, leveraging AI to automate and validate extracted knowledge at scale.

Conclusion

Project Alexandria offers a promising framework for making scientific knowledge more accessible while respecting copyright constraints. By extracting factual content from scholarly texts into Knowledge Units, this approach provides a legally viable and effective solution to the accessibility crisis in scientific publishing.

Get Involved

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Follow us on Twitter and join our 80k+ ML SubReddit.

Explore AI Solutions

Discover how artificial intelligence can transform your work processes. Identify areas for automation and customer interactions where AI can add value. Monitor key performance indicators to ensure your AI investments yield positive results. Start small, gather data, and gradually expand your AI initiatives.

For guidance on managing AI in business, contact us at hello@itinai.ru. Connect with us on Telegram, X, and LinkedIn.


AI Products for Business or Try Custom Development

AI Sales Bot

Welcome AI Sales Bot, your 24/7 teammate! Engaging customers in natural language across all channels and learning from your materials, it’s a step towards efficient, enriched customer interactions and sales

AI Document Assistant

Unlock insights and drive decisions with our AI Insights Suite. Indexing your documents and data, it provides smart, AI-driven decision support, enhancing your productivity and decision-making.

AI Customer Support

Upgrade your support with our AI Assistant, reducing response times and personalizing interactions by analyzing documents and past engagements. Boost your team and customer satisfaction

AI Scrum Bot

Enhance agile management with our AI Scrum Bot, it helps to organize retrospectives. It answers queries and boosts collaboration and efficiency in your scrum processes.