{ "cells": [ { "cell_type": "markdown", "metadata": { "pycharm": { "is_executing": false } }, "source": [ "This note is a simple data wrangling example worked using both the Python [data_algebra](https://github.com/WinVector/data_algebra) package and the [R](https://www.r-project.org) [cdata](https://github.com/WinVector/cdata) package. Both of these packages make data wrangling easy through he use of [coordinatized data](http://www.win-vector.com/blog/tag/coordinatized-data/) concepts (relying heavily on [Codd's \"rule of access\"](https://en.wikipedia.org/wiki/Codd%27s_12_rules)).\n", "\n", "The advantages of [data_algebra](https://github.com/WinVector/data_algebra) and [cdata](https://github.com/WinVector/cdata) are:\n", "\n", " * The user specifies their desired transform declaratively *by example* and *in data*. What one does is: work an example, and then write down what you want (we have a tutorial on this [here](https://winvector.github.io/cdata/articles/design.html)).\n", " \n", " * The transform systems can print what a transform is going to do. This makes reasoning about data transforms *much* easier.\n", " \n", " * The transforms, as they themselves are written as data, can be easily shared between systems (such as R and Python).\n", " \n", "Let's re-work a small [R cdata example](https://github.com/WinVector/cdata/blob/master/vignettes/control_table_keys.Rmd), using the Python package [data_algebra](https://github.com/WinVector/data_algebra).\n", "\n", "## An Example\n", "\n", "First we import some modules and packages, and import some notional data." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": " Sepal.Length Sepal.Width Petal.Length Petal.Width Species id\n0 5.1 3.5 1.4 0.2 setosa 0\n1 4.9 3.0 1.4 0.2 setosa 1\n2 4.7 3.2 1.3 0.2 setosa 2", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpeciesid
05.13.51.40.2setosa0
14.93.01.40.2setosa1
24.73.21.30.2setosa2
\n
" }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# https://pandas.pydata.org\n", "import pandas\n", "\n", "# pip install data_algebra\n", "# data_algebra from https://github.com/WinVector/data_algebra/\n", "from data_algebra.cdata import *\n", "from data_algebra.data_ops import *\n", "import data_algebra.data_ops\n", "import data_algebra.SQLite\n", "\n", "\n", "# read our example data\n", "iris = pandas.read_csv('iris_small.csv')\n", "# print it out\n", "iris" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our goal is to move from this de-normalized or wide-form (or \"model matrix\"/\"data matrix\" form, where each record is exactly one row) into a tall form where records may span more than one row.\n", "\n", "Or, more concretely, we want our data to look like the following." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": " id Species Part Measure Value\n0 0 setosa Petal Length 1.4\n1 0 setosa Petal Width 0.2\n2 0 setosa Sepal Length 5.1\n3 0 setosa Sepal Width 3.5\n4 1 setosa Petal Length 1.4\n5 1 setosa Petal Width 0.2\n6 1 setosa Sepal Length 4.9\n7 1 setosa Sepal Width 3.0\n8 2 setosa Petal Length 1.3\n9 2 setosa Petal Width 0.2\n10 2 setosa Sepal Length 4.7\n11 2 setosa Sepal Width 3.2", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
idSpeciesPartMeasureValue
00setosaPetalLength1.4
10setosaPetalWidth0.2
20setosaSepalLength5.1
30setosaSepalWidth3.5
41setosaPetalLength1.4
51setosaPetalWidth0.2
61setosaSepalLength4.9
71setosaSepalWidth3.0
82setosaPetalLength1.3
92setosaPetalWidth0.2
102setosaSepalLength4.7
112setosaSepalWidth3.2
\n
" }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "answer = pandas.read_csv(\"answer.csv\")\n", "answer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice each row of the original data set is now four rows of the derived one. This \"tall form\" is often useful for plotting.\n", "\n", "This sort of conversion can be called an anti-pivot. In Python these sorts of transforms are specified with [pandas.DataFrame.pivot](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html), [pandas.pivot_table](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html), [pandas.melt](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html), \n", "and others. Many [R](https://www.r-project.org) packages supply related operators: [reshape]( https://CRAN.R-project.org/package=reshape), [reshape2](https://CRAN.R-project.org/package=reshape2), [data.table](https://CRAN.R-project.org/package=data.table), [tidyr](https://CRAN.R-project.org/package=tidyr), [cdata](https://CRAN.R-project.org/package=cdata), and more. We feel the [data_algebra](https://github.com/WinVector/data_algebra) and [cdata](https://CRAN.R-project.org/package=cdata) methods (which only perform data-reshaping, and not aggregation) offer a number of significant advantages (some of which we have already mentioned, and a few more of which we will see demonstrated in this note).\n", "\n", "Back to our example.\n", "\n", "To specify our desired transform, we build a structure describing what a data record looks like. We can crib this specification from the answer as follows." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": " Part Measure Value\n0 Petal Length Petal.Length\n1 Petal Width Petal.Width\n2 Sepal Length Sepal.Length\n3 Sepal Width Sepal.Width", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
PartMeasureValue
0PetalLengthPetal.Length
1PetalWidthPetal.Width
2SepalLengthSepal.Length
3SepalWidthSepal.Width
\n
" }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "control_table = answer.loc[answer.id == 0, ['Part', 'Measure']]\n", "control_table = control_table.reset_index(inplace=False, drop=True)\n", "control_table[\"Value\"] = [control_table['Part'][i] + '.' + control_table['Measure'][i] for \n", " i in range(control_table.shape[0])]\n", "control_table" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "We can derive the control table from the answer, as we did here, or just type one in directly. The idea is: we can use any method we want to derive the prototype record shape, we are not limited to a sub-query language or methodology from any one package.\n", "\n", "For each record we take care to identify what keys identify records (the `record_keys`) and want parts identify rows within the record (the `control_table_keys`). We suppress the `record_key` when writing the control table, as these are exactly the values that do not vary inside each record.\n", "\n", "We combine all of these specification into a `RecordSpecification` as follows:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": "data_algebra.cdata.RecordSpecification(\n record_keys=['id', 'Species'],\n control_table=pd.DataFrame({\n 'Part': ['Petal', 'Petal', 'Sepal', 'Sepal'],\n 'Measure': ['Length', 'Width', 'Length', 'Width'],\n 'Value': ['Petal.Length', 'Petal.Width', 'Sepal.Length', 'Sepal.Width'],\n }),\n control_table_keys=['Part', 'Measure'])" }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "record_spec = data_algebra.cdata.RecordSpecification(\n", " control_table,\n", " control_table_keys = ['Part', 'Measure'],\n", " record_keys = ['id', 'Species']\n", " )\n", "record_spec" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notation any cells of the control table from columns that are not `control_table_keys` are \"value stand ins\". During data transfrom these cells will be replaced by values coming from the columns named by these cells.\n", "\n", "Now we can transform our original row-record oriented data into general block records. To do this we specify a `RecordMap` using our record specification to describe the outgoing record structure. The incoming record structure is implicitly assumed to be single-row records, unless we specify otherwise (using the `blocks_in` argument)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Transform row records of the form:\n", " record_keys: ['id', 'Species']\n", " ['Petal.Length', 'Petal.Width', 'Sepal.Length', 'Sepal.Width']\n", "to block records of structure:\n", "RecordSpecification\n", " record_keys: ['id', 'Species']\n", " control_table_keys: ['Part', 'Measure']\n", " control_table:\n", " Part Measure Value\n", " 0 Petal Length Petal.Length\n", " 1 Petal Width Petal.Width\n", " 2 Sepal Length Sepal.Length\n", " 3 Sepal Width Sepal.Width\n", "\n" ] } ], "source": [ "mp_to_blocks = RecordMap(blocks_out=record_spec)\n", "print(str(mp_to_blocks))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we are ready to apply our specified transform." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": " id Species Part Measure Value\n0 0 setosa Petal Length 1.4\n1 0 setosa Petal Width 0.2\n2 0 setosa Sepal Length 5.1\n3 0 setosa Sepal Width 3.5\n4 1 setosa Petal Length 1.4\n5 1 setosa Petal Width 0.2\n6 1 setosa Sepal Length 4.9\n7 1 setosa Sepal Width 3.0\n8 2 setosa Petal Length 1.3\n9 2 setosa Petal Width 0.2\n10 2 setosa Sepal Length 4.7\n11 2 setosa Sepal Width 3.2", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
idSpeciesPartMeasureValue
00setosaPetalLength1.4
10setosaPetalWidth0.2
20setosaSepalLength5.1
30setosaSepalWidth3.5
41setosaPetalLength1.4
51setosaPetalWidth0.2
61setosaSepalLength4.9
71setosaSepalWidth3.0
82setosaPetalLength1.3
92setosaPetalWidth0.2
102setosaSepalLength4.7
112setosaSepalWidth3.2
\n
" }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "arranged_blocks = mp_to_blocks.transform(iris)\n", "arranged_blocks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see the operation has been performed for us. Notice we specify the transform *declaratively* with data structures carrying deceptions of what we want, instead of having to build a sequence of verbs that realize the transformation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Simplified interfaces\n", "\n", "We can also generate quick transforms for the most common cases: moving into and out of row records (what is usually handled by melt and so on).\n", "\n", "These functions are `pivot_row_recs_to_blocks()` and `pivot_blocks_to_rowrecs()`. They cover the cases where the control table has two columns. Examples can be found [here](pivot_unpivot.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inverting the transform\n", "\n", "An inverse transform is simply expressed by reversing the roles of the `blocks_out` and `blocks_in` arguments. In this case the output is row-records, as we didn't specify an outgoing block structure with `blocks_out`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Transform block records of structure:\n", "RecordSpecification\n", " record_keys: ['id', 'Species']\n", " control_table_keys: ['Part', 'Measure']\n", " control_table:\n", " Part Measure Value\n", " 0 Petal Length Petal.Length\n", " 1 Petal Width Petal.Width\n", " 2 Sepal Length Sepal.Length\n", " 3 Sepal Width Sepal.Width\n", "to row records of the form:\n", " record_keys: ['id', 'Species']\n", " ['Petal.Length', 'Petal.Width', 'Sepal.Length', 'Sepal.Width']\n", "\n" ] } ], "source": [ "mp_to_rows = RecordMap(blocks_in=record_spec)\n", "print(str(mp_to_rows))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": " id Species Petal.Length Petal.Width Sepal.Length Sepal.Width\n0 0 setosa 1.4 0.2 5.1 3.5\n1 1 setosa 1.4 0.2 4.9 3.0\n2 2 setosa 1.3 0.2 4.7 3.2", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
idSpeciesPetal.LengthPetal.WidthSepal.LengthSepal.Width
00setosa1.40.25.13.5
11setosa1.40.24.93.0
22setosa1.30.24.73.2
\n
" }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "arranged_rows = mp_to_rows.transform(arranged_blocks)\n", "arranged_rows" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Arbitrary transforms\n", "\n", "Arbitrary record to record transforms can be specified by setting both `blocks_in` (to describe incoming structure) and `blocks_out` (to describe outgoing structure) at the same time. \n", "\n", "## Transforms in databases\n", "\n", "`data_algebra` also implements all the transform steps in databases using [`SQL`](https://en.wikipedia.org/wiki/SQL).\n", "\n", "These queries can be seen below." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "db_model = data_algebra.SQLite.SQLiteModel()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-- data_algebra SQL https://github.com/WinVector/data_algebra\n", "-- dialect: SQLiteModel\n", "-- string quote: '\n", "-- identifier quote: \"\n", "SELECT\n", " a.\"id\" AS \"id\",\n", " a.\"Species\" AS \"Species\",\n", " b.\"Part\" AS \"Part\",\n", " b.\"Measure\" AS \"Measure\",\n", " CASE WHEN CAST(b.\"Value\" AS VARCHAR) = 'Petal.Length' THEN a.\"Petal.Length\" WHEN CAST(b.\"Value\" AS VARCHAR) = 'Petal.Width' THEN a.\"Petal.Width\" WHEN CAST(b.\"Value\" AS VARCHAR) = 'Sepal.Length' THEN a.\"Sepal.Length\" WHEN CAST(b.\"Value\" AS VARCHAR) = 'Sepal.Width' THEN a.\"Sepal.Width\" ELSE NULL END AS \"Value\"\n", " FROM ( SELECT * FROM \n", " \"iris\"\n", " ) a\n", " CROSS JOIN (\n", " SELECT\n", " *\n", "FROM (\n", " SELECT 'Petal' AS \"Part\", 'Length' AS \"Measure\", 'Petal.Length' AS \"Value\"\n", " UNION ALL SELECT 'Petal' AS \"Part\", 'Width' AS \"Measure\", 'Petal.Width' AS \"Value\"\n", " UNION ALL SELECT 'Sepal' AS \"Part\", 'Length' AS \"Measure\", 'Sepal.Length' AS \"Value\"\n", " UNION ALL SELECT 'Sepal' AS \"Part\", 'Width' AS \"Measure\", 'Sepal.Width' AS \"Value\"\n", ") \"table_values\"\n", "\n", " ) b\n", " ORDER BY\n", " a.\"id\", \n", " a.\"Species\", \n", " b.\"Part\", \n", " b.\"Measure\"\n", "\n" ] } ], "source": [ "print(db_model.to_sql(\n", " data_algebra.data_ops.describe_table(\n", " iris, 'iris').map_records(blocks_out=record_spec)\n", "))\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-- data_algebra SQL https://github.com/WinVector/data_algebra\n", "-- dialect: SQLiteModel\n", "-- string quote: '\n", "-- identifier quote: \"\n", "SELECT\n", " \"id\" AS \"id\",\n", " \"Species\" AS \"Species\",\n", " MAX(CASE WHEN ( CAST(\"Part\" AS VARCHAR) = 'Petal' ) AND ( CAST(\"Measure\" AS VARCHAR) = 'Length' ) THEN \"Value\" ELSE NULL END) AS \"Petal.Length\",\n", " MAX(CASE WHEN ( CAST(\"Part\" AS VARCHAR) = 'Petal' ) AND ( CAST(\"Measure\" AS VARCHAR) = 'Width' ) THEN \"Value\" ELSE NULL END) AS \"Petal.Width\",\n", " MAX(CASE WHEN ( CAST(\"Part\" AS VARCHAR) = 'Sepal' ) AND ( CAST(\"Measure\" AS VARCHAR) = 'Length' ) THEN \"Value\" ELSE NULL END) AS \"Sepal.Length\",\n", " MAX(CASE WHEN ( CAST(\"Part\" AS VARCHAR) = 'Sepal' ) AND ( CAST(\"Measure\" AS VARCHAR) = 'Width' ) THEN \"Value\" ELSE NULL END) AS \"Sepal.Width\"\n", " FROM ( SELECT * FROM \n", " \"arranged_blocks\"\n", " ) a\n", " GROUP BY\n", " \"id\",\n", " \"Species\"\n", " ORDER BY \n", " \"id\",\n", " \"Species\"\n", "\n" ] } ], "source": [ "print(db_model.to_sql(\n", " descr(arranged_blocks=arranged_blocks).map_records(blocks_in=record_spec)\n", "))\n" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "The use case for `SQL` queries is: the queries can be used to with \"`CREATE TABLE table_name AS`\" to materialize transform results in a database, without round-tripping the data in and out of the database.\n", "\n", "As complicated as the queries look, they actually expose some deep truths:\n", "\n", " * Converting rowrecs to blocks is essentially a cross-join of the data to the record description. Each combination of data row and record description row builds a new result row.\n", "\n", " * Converting blocks to rowrecs is an aggregation. Each set of rows corresponding to a given data record is aggregated into a single result row.\n", " \n", " * Just about any arbitrary record shape to arbitrary record shape can be written as a transform from the first record shape to row-records (record sets that have exactly one row per record), followed by a transform from the row-records to the new format. This transform can preserve column types as in the intermediate form each different record entry has its own column. This is an advantage of using a \"thin\" intermediate form such as [RDF triples](https://en.wikipedia.org/wiki/Semantic_triple).\n", "\n", "This leads us to believe that transforming to and from single-row records are in fact fundemental operations, and not just implementation details.\n", "\n", "## The R `cdata` version\n", "\n", "The `data_algebra` had been previously implemented in \n", "[R](https://www.r-project.org) in the [`cdata`](https://github.com/WinVector/cdata), [`rquery`](https://github.com/WinVector/rquery), and [`rqdatatable`](https://github.com/WinVector/rqdatatable) packages.\n", "This packages support both in-memory operations and translation of the transforms to SQL.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "The [`cdata`](https://github.com/WinVector/cdata) and [`data_algebra`](https://github.com/WinVector/data_algebra) systems offer powerful implementations, and deep understanding of the nature of record transformations. They allow one to reshape data quickly and conveniently either in R. Python/[Pandas](https://pandas.pydata.org), or even SQL." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }