Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up jekyll related posts functionality (--lsi, classifier-reborn, gsl, nmatrix, narray, Numo:NArray, Numo:GSL) #83

Open
0xdevalias opened this issue Jul 21, 2020 · 6 comments

Comments

@0xdevalias
Copy link
Owner

0xdevalias commented Jul 21, 2020

(See also: #1)

Jekyll can "create an index for related posts" using the --lsi build command option, which uses the classifier-reborn gem to create a site variable of related posts:

More info on Jekyll's usage of LSI:

The gsl gem can make use of nmatrix and narray:

narray is in maintenance mode, and directs to numo-narray:

numo-gsl provides a GSL interface for Ruby with Numo::NArray:

I'm unsure if the numo gems can be used with classifier-reborn, and which of nmatrix/narray provide better speed; but I created an issue asking:

As noted in jekyll/classifier-reborn#193, i'm not sure if classifier-reborn is actively updated/maintained.


nmatrix was last updated in 2018, and at least one issue claims that Numo::NArray outperforms NMatrix

Several years have passed since the new version of NArray came out.

It appeared that NMatrix was not being maintained well.
And I think Numo::NArray now outperforms NMatrix in almost every way. (benchmark needed)

Newcomers try NMatrix first. After a while, they notice that NArray is far better in performance.
And they begin to make libraries dependent on NArray.


rb-gsl was last updated in 2017, and claims compatibility only with GSL versions up to v2.1:

Ruby/GSL is compatible with GSL versions upto 2.1.

I've asked if it is still maintained, but my guess is probably not:


My comment in reply to the following StackOverflow question:

The `--lsi` option comes from the [`classifier-reborn`][1] gem, which includes the following note about increasing speed under the [dependencies][2] heading:

> To speed up LSI classification by at least 10x consider installing
> following libraries.
> 
> [GSL - GNU Scientific Library][3]
>
> [Ruby/GSL Gem][4]
> 
> Note that LSI will work without these libraries, but as soon as they
> are installed, classifier will make use of them. No configuration
> changes are needed, we like to keep things ridiculously easy for you.

The [`gsl` gem's installation docs][5] mentions:

> the GSL libraries must already be installed before Ruby/GSL can be installed:
>
> - Debian/Ubuntu: +libgsl0-dev+
> - Fedora/SuSE: +gsl-devel+
> - Gentoo: +sci-libs/gsl+
> - OS X: `brew install gsl`

The [`gsl` gem can also make use of `nmatrix` or `narray`][6], which I believe may further increase the speed/efficiency:

> In order to use rb-gsl with NMatrix you must first set the NMATRIX
> environment variable and then install rb-gsl:
> - `gem install nmatrix`
> - `export NMATRIX=1`
> - `gem install rb-gsl`
> 
> This will compile rb-gsl with NMatrix specific functions.
> 
> For using rb-gsl with NArray:
> - `gem install narray`
> - `export NARRAY=1`
> - `gem install rb-gsl`
> 
> Note that setting both `NMATRIX` and `NARRAY` variables will lead to
> undefined behaviour. Only one can be used at a time.

I'm not sure whether `nmatrix` or `narray` is the better/faster choice, though I did open `https://github.com/jekyll/classifier-reborn/issues/192` on the `classifier-reborn` repo.

I did notice that the old [narray GitHub repo][7] mentions that the package is no longer maintained, and instead links to a new version: [Ruby/Numo::NArray][8]

> Numo::NArray is an Numerical N-dimensional Array class for fast processing and easy manipulation of multi-dimensional numerical data, similar to numpy.ndaray. This project is the successor to Ruby/NArray.

Numo::NArray also links to [`numo-gsl`][9], which appears to be related gsl bindings:

> GSL interface for Ruby/Numo::NArray

At this stage i'm not sure if `classifier-reborn` is able to make use of any of these numo dependencies, but if it can, my guess is that they are going to be faster/more actively maintained.

  [1]: https://jekyll.github.io/classifier-reborn/
  [2]: https://jekyll.github.io/classifier-reborn/#dependencies
  [3]: http://www.gnu.org/software/gsl
  [4]: https://rubygems.org/gems/gsl
  [5]: https://github.com/SciRuby/rb-gsl#installation
  [6]: https://github.com/SciRuby/rb-gsl#nmatrix-and-narray-usage
  [7]: https://github.com/masa16/narray#new-version-is-under-development---rubynumonarray
  [8]: https://github.com/ruby-numo/narray
  [9]: https://github.com/ruby-numo/numo-gsl

See Also

@0xdevalias
Copy link
Owner Author

Found some benchmarks comparing various underlying options (though they seem rather outdated)

@0xdevalias
Copy link
Owner Author

Looking at site build times in #86 (comment) with/without --lsi, gsl, nmatrix, etc; it seemed to have negligible impact regardless of which we used.

In light of that.. this thread about optimisations may not even be relevant anymore..

@mkasberg
Copy link

👋 Hi,

I stumbled onto this thread from jekyll/classifier-reborn#193.

A few notes that you might find helpful:

  • You're not noticing any difference in build times with the --lsi option because your site (as it is today in this repo) doesn't use related posts (so the --lsi option does nothing). To use LSI, you need to call site.related_posts somewhere in a Liquid template. For example, you might add something like the following to _layouts/post.html:
    {% for post in site.related_posts limit:3 %}
      <p>{{ post.title }}</p>
    {% endfor %}
  • When you call site.related_posts, if you don't pass the --lsi option, it's just recent posts.
  • If you are using site.related_posts and you pass the --lsi option, You'll see Populating LSI... in your jekyll build --lsi output. The build will be slow unless you have the gsl gem and native gsl library installed. I haven't experimented with nmatrix or narray at all, but simply using the gsl gem results in a ~500x speed increase for my use.

Hope that helps. I appreciated some of your comments on some of the libraries so I thought I'd share some notes with you!

@0xdevalias
Copy link
Owner Author

@mkasberg Thanks for the notes and insights :) Much appreciated.

I’d have to look deeper at things (has been a long time since I did), but if the related_posts part isn’t there anymore then I guess I must have removed it from my templates at some stage. I know I had it at one point. Maybe the speed thing was why I removed it.

If/when I get back to looking at this I’ll make sure to check that out first!

@0xdevalias
Copy link
Owner Author

The referenced issue links from the Gemfile:

It seems rb-gsl only compatible up to gsl 2.1 as well:

Originally posted by @0xdevalias in #20 (comment)

@0xdevalias
Copy link
Owner Author

The following posts by @mkasberg are also worth reading/considering before going too deep with this:

Having ChatGPT explain the differences between using LSI and embeddings for this purpose:

Latent Semantic Indexing (LSI)

  • Method: Uses Singular Value Decomposition (SVD) on term-document matrices.
  • Representation: Lower-dimensional space capturing latent semantic structures.
  • Applications: Information retrieval, document clustering, text summarization.
  • Advantages: Handles synonymy, less computationally intensive.
  • Limitations: Limited in capturing complex linguistic phenomena, performance depends on the corpus.

Embeddings (e.g., OpenAI embeddings)

  • Method: Uses deep learning models like transformers.
  • Representation: Dense vectors capturing semantic meaning, context, and relationships.
  • Applications: Sentiment analysis, text classification, named entity recognition, question answering.
  • Advantages: Captures complex linguistic phenomena, state-of-the-art performance, versatile.
  • Limitations: Computationally intensive, requires significant resources, may need fine-tuning.

Summary

  • LSI is simpler and effective for basic tasks but less nuanced.
  • Embeddings provide richer, context-aware representations and superior performance on a wide range of tasks but require more computational power.

Originally posted by @0xdevalias in #87 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants