The article discusses the implementation of a cross-platform text summarization tool in Rust using techniques such as TFIDF and parallel computing with Rayon. It highlights the Rust implementation of text summarization, its usage in C/C++, Android, and Python platforms, and discusses future improvements and benchmarking. For the full details, please refer to the original article on “Towards Data Science” medium publication.
“`html
Cross-Platform NLP in Rust
Optimization with Rayon with usage in C/C++, Android and Python
NLP tools and utilities have grown largely in the Python ecosystem, enabling developers from all levels to build high-quality language apps at scale. Rust is a newer introduction to NLP, with organizations like HuggingFace adopting it to build packages for machine learning.
Hugging Face has written a new ML framework in Rust, now open-sourced!
In this blog, we’ll explore how we can build a text summarizer using the concept of TFIDF. We’ll first have an intuition on how TFIDF summarization works, and why Rust could be a good language to implement NLP pipelines and how we can use our Rust code on other platforms like C/C++, Android and Python. Moreover, we discuss how we can optimize the summarization task with parallel computing with Rayon.
Here’s the GitHub project:
Motivation
I had built a text summarizer using the same technique, back in 2019, with Kotlin and called in Text2Summary. It was primarily designed for Android apps, as a side project and used Kotlin for all computations. Fast-forward to 2023, I am now working with C, C++ and Rust codebases and have used modules built in these native languages in Android and Python.
I chose to re-implement Text2Summary in Rust, as it would serve as a great learning experience and also as a small, efficient, handy text summarization which can handle large texts easily. Rust is a compiled language with intelligent borrow and reference checkers that helps developers write bug-free code. Code written in Rust can be integrated with Java codebases through jni and converted to C headers/libraries for use in C/C++ and Python.
Extractive and Abstractive Text Summarization
Text summarization has been a long-studied problem in natural language processing (NLP). Extracting important information from the text and generating a summary of the given text is the core problem that text summarizers need to solve. The solutions belong to two categories, namely, extractive summarization and abstractive summarization.
Understanding Automatic Text Summarization-1: Extractive Methods
In extractive text summarization, phrases or sentences are derived from the sentence directly. We can rank sentences using a scoring function and pick the most suitable sentences from the text considering their scores. Instead of generating new text, as in abstractive summarization, the summary is a collection of selected sentences from the text, hence avoiding problems which generative models exhibit.
Precision of the text is maintained in extractive summarization, but there is a high chance that some information is lost as the granularity of the selecting text is only limited to sentences. If a piece of information is spread across multiple sentences, the scoring function must take care of the relation which contains those sentences.
“`