Wednesday, February 12, 2014

What is SageMathCloud?

The two main reasons for existence of SageMathCloud (SMC) are...

Goal 1. Increase resource for Sage: Generate a different longterm revenue stream to support development of Sage, i.e., open source mathematical software. By "different", I mean different than government and foundation grants and donations, which are relatively limited for primarily pure mathematics software development, which is what Sage specializes in. Even in my wildest dreams, it is very unlikely Sage will get more than a million dollars a year in funding (and in practice it gets a lot less); however, a successful commercial product with wide adoption has the potential to generate significantly more than a million dollars a year in revenue -- of course most would go back into the product... but when the product is partly Sage, that's fine. The National Science Foundation (and other donors) have played a major part during the last 8 years in funding Sage, but I think everybody would benefit from another funding source.

Goal 2. Increase the usage of Sage: The number of unique visitors per month to http://sagemath.org grew nicely from 2005 (when I started Sage) until Summer 2011, after which point it has remained fairly constant at 70,000 unique visitors. There is no growth at all: it was 70,332 in Jan 2011, and it was 70,449 last month (Jan 2014), both with a bounce rate of about 50%. A significant obstruction to growth is accessible, which SMC helps to address for certain users (last month the SMC website has 17,700 unique visitors with a bounce rate of about 30%).

Here's an actual email I received from somebody literally as I was writing this, which I think illustrates how SMC addresses the second goal:

    Hey William,

    Today I stopped by cloud.sagemath.com because 
    I wanted to do some computation with sage, and 
    cloud is announced in a big way on sagemath.org

    This is after a lengthy hiatus from computing
    with sage ( maybe a year ).

    Using cloud.sagemath.com completely blew my 
    mind.  At first I did not really understand 
    why sagenb was ditched after all the work that 
    went into it.  But man, cloud is really a 
    pleasure to use !

    I just wanted to share the joy :)

    Thanks for all that you do !

Licensing and Reuse of the SageMathCloud Codebase

The design and coding of SageMathCloud (SMC) has been mostly supported by University of Washington (UW). Due to goal 1 above, I have been working from the start (before a line of code was written) with the commercialization/tech transfer office of UW, who (because of 1) are not enthusiastic about simply open source the whole SMC codebase, as a condition for their help with commercialization. Some of SMC is open sourced, mainly the code that runs on the VM's and some of the HTML5 client that runs on the browser. We also plan to make the HTML5 client and a mini server BSD licensed, and include them with Sage (say) as a new local graphical interface. Of course SMC builds on top of many standard open source libraries and tools (e.g., CodeMirror, Cassandra, ZFS, Node.js, etc.).

There is, however, a large amount of interesting backend code, which is really the "cloud" part of SMC, and which we do not intend to release as open source. We do intend to sell licenses (with support) for the complete package, when it is sufficiently polished, since many organizations want to run their own private SMC servers, mainly for confidentiality reasons.

Goal 2 above mainly impacts how we market SMC. However, it's easy to completely ignore Sage and still get a lot of value out of SMC. I just glanced at what people are doing as I write this, and the result seems pretty typical: latex'ing documents, some Sage worksheets, some IPython notebooks, editing a perl script.

It's important to understand how SMC is different than other approaches to cloud computing. It's designed to make certain things very easy, but they are quite different things than what "traditional" cloud stacks like OpenStack are designed to make easy. SMC is supposed to make the following easy:

  • using Sage and IPython, both command line and notebook interfaces.
  • writing a paper using LaTeX (possibly with a specific private list of collaborators),
  • editing source code, e.g., developing Python/C/etc., libraries., again possibly with realtime collaboration.
  • creating collaborative "projects", which are really a Linux account on a machine, and provide isolation from other projects.
  • backups: all data is automatically snapshotted frequently
  • high availability: failure of a machine (or even whole data center) results in at most a few minutes of lost time/work.
  • speed: files are stored on a compressed local filesystem, which is snapshotted and replicated out regularly; thus the filesystem feels fast and is scalable, as compared to a networked filesystem.

The above design goals are useful for certain target audiences, e.g., people doing Sage/Python/etc. development, teachers and students in courses that make use of Sage/Python/etc., collaborative math research projects. SMC is designed so that a large number of people can make simultaneous small use of ever-expanding resources. SMC should also fully support the "social networks" that form in this context. At the same time, it's critical that SMC have excellent uptime and availability (and offsite backups, just in case), so that people can trust it. By trust, I don't mean so much in the sense of "trust it with proprietary info", but in the sense of "trust it to not just loose all my data and to be there when I'm giving a talk/teaching a class/need to do homework/etc.".

However, exactly the above design goals are at odds with some of goals of large-scale scientific/supercomputing. The following are not design goals of SMC:

  • supercomputing -- have large data that many distributed processes operate on: exactly what people often do on supercomputers (or with Hadoop, etc.)
  • traditional "cloud computing" -- dynamically spin up many VM's, run computations on them; then destroy them. With SMC, things tend to get created but not destroyed (e.g., projects and files in them), and a full VM is much too heavy given the number of users and type of usage that we have already (and plan to have).

What happens in practice with SMC is that people run smaller-scale computations on SMC (say things that just take a few cores), and when they want to run something bigger, they ssh from SMC to other resources they have (e.g., a supercomputer account) and launch computations there. All project collaborators can see what anybody types in a terminal, which can be helpful when working with remote compute clusters.

Anyway, I hope this helps to clarify what exactly SMC actually is.