The New Column
API
Today we at scicloj deployed
a new Column API (tablecloth.column.api
) into the data
processing library Tablecloth (available
as of version 7.029.1
). This new API adds a new primitive
to the Tablecloth system: the column
. Here’s how we use
it:
(require '[tablecloth.column.api :as tcc])
(tcc/column [1 2 3 4 5])
;; => #tech.v3.dataset.column<int64>[5]
null
[1, 2, 3, 4, 5]
The new column
is the same as the columns that comprise
a dataset
. It is a one-dimensional typed sequence of
values. Underneath the hood, the column is just the column
defined in tech.ml.dataset,
the library that backs Tablecloth.
The difference is that now when you are using Tablecloth you have the
option of interacting directly with a column
using an API
that provides a set of operations that always take and return a
column.
Basic Usage
Let’s go through a simple example. Let’s say we have some test scores that we need to analyze:
(def test-scores (tcc/column [85 92 78 88 95 83 80 90]))
test-scores
;; => #tech.v3.dataset.column<int64>[8]
null
[85, 92, 78, 88, 95, 83, 80, 90]
Now that we have these values in a column, we can easily perform operations on them:
(tcc/mean test-scores)
;; => 86.375
(tcc/standard-deviation test-scores)
;; => 5.926634795564849
There are a many operations that one can perform. At the moment, the
available operations are those that you would have previously accessed
by importing the tech.v3.datatype.functional
namespace from dtype-next.
To get a fuller picture of the Column API and how it works, please consult the Column API section in the Tablecloth documentation.
Easier Column Operations on the Dataset
The changes we’ve deployed also improve the expressive power of Tablecloth’s standard Dataset API. Previously, if you needed to do something simple like a group by and aggregation on a column in a dataset, the code could become unnecessarily verbose:
(defonce stocks
(tc/dataset "https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv" {:key-fn keyword}))
(tc/column-names stocks)
(-> stocks
(tc/group-by [:symbol])
(tc/aggregate (fn [ds]
(-> ds
:price
tech.v3.datatype.functional/mean))))
;; => _unnamed [5 2]:
| :symbol | summary |
|---------|-------------:|
| MSFT | 24.73674797 |
| AMZN | 47.98707317 |
| IBM | 91.26121951 |
| GOOG | 415.87044118 |
| AAPL | 64.73048780 |
With the new column operations within for datasets, you can now simply write:
(-> stocks
(tc/group-by [:symbol])
(tc/mean [:price]))
The same set operations available to be run on the
column
can be called on columns in the datasest. However,
when operating a dataset, functions that would return a scalar value act
as aggregator functions, as seen above.
Functions that would return a new column allow the user to specify a target column to be added to the dataset, as in this example where we first use the method above to add a column with the mean back to stocks:
(def stocks-with-mean
(-> stocks
(tc/group-by [:symbol])
(tc/mean [:price])
(tc/rename-columns {"summary" :mean-price})
(tc/inner-join stocks :symbol)))
stocks-with-mean
;; => inner-join [560 4]:
;; | :symbol | :mean-price | :date | :price |
;; |---------|------------:|------------|-------:|
;; | MSFT | 24.73674797 | 2000-01-01 | 39.81 |
;; | MSFT | 24.73674797 | 2000-02-01 | 36.35 |
;; | MSFT | 24.73674797 | 2000-03-01 | 43.22 |
;; | MSFT | 24.73674797 | 2000-04-01 | 28.37 |
Then we use a dataset column operation that returns a column – column division, in this case – to add a new column holding the relative daily price of the stock:
(require '[tablecloth.api :as tc])
(-> stocks-with-mean
(tc// :relative-daily-price [:price :mean-price]))
;; => inner-join [560 5]:
;; | :symbol | :mean-price | :date | :price | :relative-daily-price |
;; |---------|------------:|------------|-------:|----------------------:|
;; | MSFT | 24.73674797 | 2000-01-01 | 39.81 | 1.60934655 |
;; | MSFT | 24.73674797 | 2000-02-01 | 36.35 | 1.46947368 |
;; | MSFT | 24.73674797 | 2000-03-01 | 43.22 | 1.74719814 |
;; | MSFT | 24.73674797 | 2000-04-01 | 28.37 | 1.14687670 |
For more information, on these operations, please consult the documentation here.
Thanks to Clojurist Together
This contribution to Tablecloth was supported by Clojurists Together through their Quarterly Fellowships for open source development.
I welcome feedback, so please write to me via email or twitter with any comments and/or questions.