| Copyright | (c) 2025 |
|---|---|
| License | GPL-3.0 |
| Maintainer | mschavinda@gmail.com |
| Stability | experimental |
| Portability | POSIX |
| Safe Haskell | None |
| Language | Haskell2010 |
DataFrame
Description
Batteries-included entry point for the DataFrame library.
This module re-exports the most commonly used pieces of the dataframe library so you
can get productive fast in GHCi, IHaskell, or scripts.
Naming convention
- Use the
D.("DataFrame") prefix for core table operations. - Use the
F.("Functions") prefix for the expression DSL (columns, math, aggregations).
Example session:
We provide a script that imports the core functionality and defines helpful macros for writing safe code.
$ cabal update
$ cabal install dataframe
$ dataframe
Configuring library for fake-package-0...
Warning: No exposed modules
GHCi, version 9.6.7: https://www.haskell.org/ghc/ :? for help
Loaded GHCi configuration from /tmp/cabal-repl.-242816/setcwd.ghci
========================================
📦Dataframe
========================================
✨ Modules were automatically imported.
💡 Use prefix D for core functionality.
● E.g. D.readCsv "/path/to/file"
💡 Use prefix F for expression functions.
● E.g. F.sum (F.col @Int "value")
✅ Ready.
Loaded GHCi configuration from ./dataframe.ghci
ghci>
Quick start
Load a CSV, select a few columns, filter, derive a column, then group + aggregate:
-- 1) Load data
ghci> df0 <- D.readCsv "data/housing.csv"
ghci> D.describeColumns df0
-------------------------------------------------------------------------------------------------------------
Column Name | # Non-null Values | # Null Values | # Partially parsed | # Unique Values | Type
--------------------|-------------------|---------------|--------------------|-----------------|-------------
Text | Int | Int | Int | Int | Text
--------------------|-------------------|---------------|--------------------|-----------------|-------------
ocean_proximity | 20640 | 0 | 0 | 5 | Text
median_house_value | 20640 | 0 | 0 | 3842 | Double
median_income | 20640 | 0 | 0 | 12928 | Double
households | 20640 | 0 | 0 | 1815 | Double
population | 20640 | 0 | 0 | 3888 | Double
total_bedrooms | 20640 | 0 | 0 | 1924 | Maybe Double
total_rooms | 20640 | 0 | 0 | 5926 | Double
housing_median_age | 20640 | 0 | 0 | 52 | Double
latitude | 20640 | 0 | 0 | 862 | Double
longitude | 20640 | 0 | 0 | 844 | Double
-- 2) Project & filter
ghci> :exposeColumn df
ghci> df1 = D.filterWhere (ocean_proximity .== "ISLAND") df0 D.|> D.select [F.name median_house_value, F.name median_income, F.name ocean_proximity]
-- 3) Add a derived column using the expression DSL
-- (col types are explicit via TypeApplications)
ghci> df2 = D.derive "rooms_per_household" (total_rooms / households) df0
-- 4) Group + aggregate
ghci> let grouped = D.groupBy ["ocean_proximity"] df0
ghci> let summary =
D.aggregate
[ F.maximum median_house_value `F.as` "max_house_value"]
grouped
ghci> D.take 5 summary
----------------------------------
ocean_proximity | max_house_value
-----------------|----------------
Text | Double
-----------------|----------------
<1H OCEAN | 500001.0
INLAND | 500001.0
ISLAND | 450000.0
NEAR BAY | 500001.0
NEAR OCEAN | 500001.0
Simple operations (cheat sheet)
Most users only need a handful of verbs:
I/O
D.readCsv :: FilePath -> IO DataFrame
D.readTsv :: FilePath -> IO DataFrame
D.writeCsv :: FilePath -> DataFrame -> IO ()
D.readParquet :: FilePath -> IO DataFrame
Exploration
D.take :: Int -> DataFrame -> DataFrame
D.takeLast :: Int -> DataFrame -> DataFrame
D.describeColumns :: DataFrame -> DataFrame
D.summarize :: DataFrame -> DataFrame
Row ops
D.filter :: Expr a -> (a -> Bool) -> DataFrame -> DataFrame
D.filterWhere :: Expr Bool -> DataFrame -> DataFrame
D.sortBy :: SortOrder -> [Text] -> DataFrame -> DataFrame
Column ops
D.select :: [Text] -> DataFrame -> DataFrame
D.exclude :: [Text] -> DataFrame -> DataFrame
D.rename :: [(Text,Text)] -> DataFrame -> DataFrame
D.derive :: Text -> D.Expr a -> DataFrame -> DataFrame
Group & aggregate
D.groupBy :: [Text] -> DataFrame -> GroupedDataFrame
D.aggregate :: [NamedExpr] -> GroupedDataFrame -> DataFrame
Joins
D.innerJoin / D.leftJoin / D.rightJoin / D.fullOuterJoin
Expression DSL (F.*) at a glance
Columns (typed):
F.col @Text "ocean_proximity" F.col @Double "total_rooms" F.lit @Double 1.0
Math & comparisons (overloaded by type):
(+), (-), (*), (/), abs, log, exp, round (F.eq), (F.gt), (F.geq), (F.lt), (F.leq) (.==), (.>), (.>=), (.<), (.<=)
Aggregations (for D.aggregate):
F.count @a (F.col @a "c") F.sum @Double (F.col @Double "x") F.mean @Double (F.col @Double "x") F.min @t (F.col @t "x") F.max @t (F.col @t "x")
REPL power-tool: ':declareColumns'
Use :declareColumns df in GHCi/IHaskell to turn each column of a bound DataFrame
into a local binding with the same (mangled if needed) name and the column's concrete
vector type. This is great for quick ad-hoc analysis, plotting, or hand-rolled checks.
-- Suppose df has columns: "passengers" :: Int, "fare" :: Double, "payment" :: Text ghci> :set -XTemplateHaskell ghci> :declareColumns df -- Now you have in scope: ghci> :type passengers passengers :: Expr Int ghci> :type fare fare :: Expr Double ghci> :type payment payment :: Expr Text -- You can use them directly: ghci> D.derive "fare_with_tip" (fare * 1.2)
Notes:
- Name mangling: spaces and non-identifier characters are replaced (e.g.
"trip id"->trip_id). - Optional/nullable columns are exposed as
Expr (Maybe a).
Synopsis
- empty :: DataFrame
- null :: DataFrame -> Bool
- data DataFrame
- data GroupedDataFrame
- toMarkdownTable :: DataFrame -> Text
- fromList :: (Columnable a, ColumnifyRep (KindOf a) a) => [a] -> Column
- toList :: Columnable a => Column -> [a]
- data Column
- fromUnboxedVector :: (Columnable a, Unbox a) => Vector a -> Column
- fromVector :: (Columnable a, ColumnifyRep (KindOf a) a) => Vector a -> Column
- hasElemType :: Columnable a => Column -> Bool
- hasMissing :: Column -> Bool
- isNumeric :: Column -> Bool
- toVector :: forall a v. (Vector v a, Columnable a) => Column -> Either DataFrameException (v a)
- data Any
- type Row = Vector Any
- fromAny :: Columnable a => Any -> Maybe a
- rowValue :: Expr a -> [(Text, Any)] -> Maybe a
- toAny :: Columnable a => a -> Any
- toRowList :: DataFrame -> [[(Text, Any)]]
- toRowVector :: [Text] -> DataFrame -> Vector Row
- data Expr a
- prettyPrint :: Expr a -> String
- module DataFrame.Display
- insert :: (Columnable a, Foldable t) => Text -> t a -> DataFrame -> DataFrame
- fold :: (a -> DataFrame -> DataFrame) -> [a] -> DataFrame -> DataFrame
- rename :: Text -> Text -> DataFrame -> DataFrame
- columnAsDoubleVector :: Expr Double -> DataFrame -> Either DataFrameException (Vector Double)
- dimensions :: DataFrame -> (Int, Int)
- columnNames :: DataFrame -> [Text]
- nRows :: DataFrame -> Int
- nColumns :: DataFrame -> Int
- insertVector :: Columnable a => Text -> Vector a -> DataFrame -> DataFrame
- insertColumn :: Text -> Column -> DataFrame -> DataFrame
- insertVectorWithDefault :: Columnable a => a -> Text -> Vector a -> DataFrame -> DataFrame
- insertWithDefault :: (Columnable a, Foldable t) => a -> Text -> t a -> DataFrame -> DataFrame
- insertUnboxedVector :: (Columnable a, Unbox a) => Text -> Vector a -> DataFrame -> DataFrame
- cloneColumn :: Text -> Text -> DataFrame -> DataFrame
- renameMany :: [(Text, Text)] -> DataFrame -> DataFrame
- describeColumns :: DataFrame -> DataFrame
- fromNamedColumns :: [(Text, Column)] -> DataFrame
- fromUnnamedColumns :: [Column] -> DataFrame
- fromRows :: [Text] -> [[Any]] -> DataFrame
- valueCounts :: Columnable a => Expr a -> DataFrame -> [(a, Int)]
- columnAsVector :: Columnable a => Expr a -> DataFrame -> Either DataFrameException (Vector a)
- valueProportions :: Columnable a => Expr a -> DataFrame -> [(a, Double)]
- toFloatMatrix :: DataFrame -> Either DataFrameException (Vector (Vector Float))
- toDoubleMatrix :: DataFrame -> Either DataFrameException (Vector (Vector Double))
- toIntMatrix :: DataFrame -> Either DataFrameException (Vector (Vector Int))
- columnAsIntVector :: Expr Int -> DataFrame -> Either DataFrameException (Vector Int)
- columnAsFloatVector :: Expr Float -> DataFrame -> Either DataFrameException (Vector Float)
- columnAsUnboxedVector :: (Columnable a, Unbox a) => Expr a -> DataFrame -> Either DataFrameException (Vector a)
- columnAsList :: Columnable a => Expr a -> DataFrame -> [a]
- schemaType :: Columnable a => SchemaType
- data HeaderSpec
- = NoHeader
- | UseFirstRow
- | ProvideNames [Text]
- data ReadOptions = ReadOptions {}
- data TypeSpec
- defaultReadOptions :: ReadOptions
- readCsv :: FilePath -> IO DataFrame
- readCsvWithOpts :: ReadOptions -> FilePath -> IO DataFrame
- readSeparated :: ReadOptions -> FilePath -> IO DataFrame
- readTsv :: FilePath -> IO DataFrame
- writeCsv :: FilePath -> DataFrame -> IO ()
- writeSeparated :: Char -> FilePath -> DataFrame -> IO ()
- module DataFrame.IO.Unstable.CSV
- readParquet :: FilePath -> IO DataFrame
- sample :: RandomGen g => g -> Double -> DataFrame -> DataFrame
- filter :: Columnable a => Expr a -> (a -> Bool) -> DataFrame -> DataFrame
- range :: (Int, Int) -> DataFrame -> DataFrame
- take :: Int -> DataFrame -> DataFrame
- drop :: Int -> DataFrame -> DataFrame
- select :: [Text] -> DataFrame -> DataFrame
- selectBy :: [SelectionCriteria] -> DataFrame -> DataFrame
- data SelectionCriteria
- byIndexRange :: (Int, Int) -> SelectionCriteria
- byName :: Text -> SelectionCriteria
- byNameProperty :: (Text -> Bool) -> SelectionCriteria
- byNameRange :: (Text, Text) -> SelectionCriteria
- byProperty :: (Column -> Bool) -> SelectionCriteria
- cube :: (Int, Int) -> DataFrame -> DataFrame
- dropLast :: Int -> DataFrame -> DataFrame
- exclude :: [Text] -> DataFrame -> DataFrame
- filterAllJust :: DataFrame -> DataFrame
- filterAllNothing :: DataFrame -> DataFrame
- filterBy :: Columnable a => (a -> Bool) -> Expr a -> DataFrame -> DataFrame
- filterJust :: Text -> DataFrame -> DataFrame
- filterNothing :: Text -> DataFrame -> DataFrame
- filterWhere :: Expr Bool -> DataFrame -> DataFrame
- kFolds :: RandomGen g => g -> Int -> DataFrame -> [DataFrame]
- randomSplit :: RandomGen g => g -> Double -> DataFrame -> (DataFrame, DataFrame)
- takeLast :: Int -> DataFrame -> DataFrame
- module DataFrame.Operations.Transformations
- groupBy :: [Text] -> DataFrame -> GroupedDataFrame
- aggregate :: [NamedExpr] -> GroupedDataFrame -> DataFrame
- distinct :: DataFrame -> DataFrame
- shuffle :: RandomGen g => g -> DataFrame -> DataFrame
- sortBy :: [SortOrder] -> DataFrame -> DataFrame
- data SortOrder
- module DataFrame.Operations.Merge
- module DataFrame.Operations.Join
- sum :: (Columnable a, Num a) => Expr a -> DataFrame -> a
- correlation :: Text -> Text -> DataFrame -> Maybe Double
- frequencies :: Text -> DataFrame -> DataFrame
- genericPercentile :: (Columnable a, Ord a) => Int -> Expr a -> DataFrame -> a
- imputeWith :: Columnable b => (Expr b -> Expr b) -> Expr (Maybe b) -> DataFrame -> DataFrame
- interQuartileRange :: (Columnable a, Real a, Unbox a) => Expr a -> DataFrame -> Double
- mean :: (Columnable a, Real a, Unbox a) => Expr a -> DataFrame -> Double
- meanMaybe :: (Columnable a, Real a) => Expr (Maybe a) -> DataFrame -> Double
- median :: (Columnable a, Real a, Unbox a) => Expr a -> DataFrame -> Double
- medianMaybe :: (Columnable a, Real a) => Expr (Maybe a) -> DataFrame -> Double
- percentile :: (Columnable a, Real a, Unbox a) => Int -> Expr a -> DataFrame -> Double
- skewness :: (Columnable a, Real a, Unbox a) => Expr a -> DataFrame -> Double
- standardDeviation :: (Columnable a, Real a, Unbox a) => Expr a -> DataFrame -> Double
- summarize :: DataFrame -> DataFrame
- variance :: (Columnable a, Real a, Unbox a) => Expr a -> DataFrame -> Double
- module DataFrame.Errors
- module DataFrame.Display.Terminal.Plot
- (|>) :: a -> (a -> b) -> b
Core data structures
data GroupedDataFrame Source #
A record that contains information about how and what
rows are grouped in the dataframe. This can only be used with
aggregate.
Instances
| Show GroupedDataFrame Source # | |
Defined in DataFrame.Internal.DataFrame Methods showsPrec :: Int -> GroupedDataFrame -> ShowS # show :: GroupedDataFrame -> String # showList :: [GroupedDataFrame] -> ShowS # | |
| Eq GroupedDataFrame Source # | |
Defined in DataFrame.Internal.DataFrame Methods (==) :: GroupedDataFrame -> GroupedDataFrame -> Bool # (/=) :: GroupedDataFrame -> GroupedDataFrame -> Bool # | |
toMarkdownTable :: DataFrame -> Text Source #
For showing the dataframe as markdown in notebooks.
fromList :: (Columnable a, ColumnifyRep (KindOf a) a) => [a] -> Column Source #
O(n) Convert a list to a column. Automatically picks the best representation of a vector to store the underlying data in.
Examples:
> fromList [(1 :: Int), 2, 3, 4] [1,2,3,4]
toList :: Columnable a => Column -> [a] Source #
O(n) Converts a column to a list. Throws an exception if the wrong type is specified.
Examples:
> column = fromList [(1 :: Int), 2, 3, 4]
> toList Int column
[1,2,3,4]
> toList Double column
exception: ...
Our representation of a column is a GADT that can store data based on the underlying data.
This allows us to pattern match on data kinds and limit some operations to only some kinds of vectors. E.g. operations for missing data only happen in an OptionalColumn.
fromUnboxedVector :: (Columnable a, Unbox a) => Vector a -> Column Source #
O(n) Convert an unboxed vector to a column. This avoids the extra conversion if you already have the data in an unboxed vector.
Examples:
> import qualified Data.Vector.Unboxed as V > fromUnboxedVector (VB.fromList [(1 :: Int), 2, 3, 4]) [1,2,3,4]
fromVector :: (Columnable a, ColumnifyRep (KindOf a) a) => Vector a -> Column Source #
O(n) Convert a vector to a column. Automatically picks the best representation of a vector to store the underlying data in.
Examples:
> import qualified Data.Vector as V > fromVector (VB.fromList [(1 :: Int), 2, 3, 4]) [1,2,3,4]
hasElemType :: Columnable a => Column -> Bool Source #
Checks if a column is of a given type values.
hasMissing :: Column -> Bool Source #
Checks if a column contains missing values.
toVector :: forall a v. (Vector v a, Columnable a) => Column -> Either DataFrameException (v a) Source #
Converts a column to a vector of a specific type.
This is a type-safe conversion that requires the column's element type to exactly match the requested type. You must specify the desired type via type applications.
Type Parameters
Examples
>>>toVector @Int @VU.Vector columnRight (unboxed vector of Ints)
>>>toVector @Text @VB.Vector columnRight (boxed vector of Text)
Returns
Right- The converted vector if types matchLeftTypeMismatchException- If the column's type doesn't match the requested type
See also
For numeric conversions with automatic type coercion, see toDoubleVector,
toFloatVector, and toIntVector.
rowValue :: Expr a -> [(Text, Any)] -> Maybe a Source #
Given a row gets the value associated with a field.
Examples
>>>map (rowValue (F.col @Int "age")) (toRowList df)[25,30, ...]
toAny :: Columnable a => a -> Any Source #
Wraps a value into an Any type. This helps up represent rows as heterogenous lists.
toRowList :: DataFrame -> [[(Text, Any)]] Source #
Converts the entire dataframe to a list of rows.
Each row contains all columns in the dataframe, ordered by their column indices. The rows are returned in their natural order (from index 0 to n-1).
Examples
>>>toRowList df[[("name", "Alice"), ("age", 25), ...], [("name", "Bob"), ("age", 30), ...], ...]
Performance note
This function materializes all rows into a list, which may be memory-intensive
for large dataframes. Consider using toRowVector if you need random access
or streaming operations.
toRowVector :: [Text] -> DataFrame -> Vector Row Source #
Converts the dataframe to a vector of rows with only the specified columns.
Each row will contain only the columns named in the names parameter.
This is useful when you only need a subset of columns or want to control
the column order in the resulting rows.
Parameters
names- List of column names to include in each row. The order of names determines the order of fields in the resulting rows.
df- The dataframe to convert.
Examples
>>>toRowVector ["name", "age"] dfVector of rows with only name and age fields
>>>toRowVector [] df -- Empty column listVector of empty rows (one per dataframe row)
Instances
| (IsString a, Columnable a) => IsString (Expr a) Source # | |
Defined in DataFrame.Internal.Expression Methods fromString :: String -> Expr a # | |
| (Floating a, Columnable a) => Floating (Expr a) Source # | |
| (Num a, Columnable a) => Num (Expr a) Source # | |
| (Fractional a, Columnable a) => Fractional (Expr a) Source # | |
| Show a => Show (Expr a) Source # | |
| (Eq a, Columnable a) => Eq (Expr a) Source # | |
| (Ord a, Columnable a) => Ord (Expr a) Source # | |
prettyPrint :: Expr a -> String Source #
Display operations
module DataFrame.Display
Core dataframe operations
Arguments
| :: (Columnable a, Foldable t) | |
| => Text | Column Name |
| -> t a | Sequence to add to dataframe |
| -> DataFrame | DataFrame to add column to |
| -> DataFrame |
Adds a foldable collection to the dataframe. If the collection has less elements than the
dataframe and the dataframe is not empty
the collection is converted to type `Maybe a` filled with Nothing to match the size of the dataframe. Similarly,
if the collection has more elements than what's currently in the dataframe, the other columns in the dataframe are
change to `Maybe Type` and filled with Nothing.
Be careful not to insert infinite collections with this function as that will crash the program.
Example
>>> :set -XOverloadedStrings >>> import qualified DataFrame as D >>> D.insert "numbers" [(1 :: Int)..10] D.empty -------- numbers -------- Int -------- 1 2 3 4 5 6 7 8 9 10
fold :: (a -> DataFrame -> DataFrame) -> [a] -> DataFrame -> DataFrame Source #
A left fold for dataframes that takes the dataframe as the last object. This makes it easier to chain operations.
Example
>>> df = D.fromNamedColumns [("x", D.fromList [1..100]), ("y", D.fromList [11..110])]
>>> D.fold D.dropLast [1..5] df
---------
x | y
----|----
Int | Int
----|----
1 | 11
2 | 12
3 | 13
4 | 14
5 | 15
6 | 16
7 | 17
8 | 18
9 | 19
10 | 20
11 | 21
12 | 22
13 | 23
14 | 24
15 | 25
16 | 26
17 | 27
18 | 28
19 | 29
20 | 30
Showing 20 rows out of 85
rename :: Text -> Text -> DataFrame -> DataFrame Source #
O(n) Renames a single column.
Example
>>> :set -XOverloadedStrings >>> import qualified DataFrame as D >>> import qualified Data.Vector as V >>> df = insertVector "numbers" (V.fromList [1..10]) D.empty >>> D.rename "numbers" "others" df ------- others ------- Int ------- 1 2 3 4 5 6 7 8 9 10
columnAsDoubleVector :: Expr Double -> DataFrame -> Either DataFrameException (Vector Double) Source #
Retrieves a column as an unboxed vector of Double values.
Returns Left with a DataFrameException if the column cannot be converted to doubles.
This may occur if the column contains non-numeric data.
dimensions :: DataFrame -> (Int, Int) Source #
O(1) Get DataFrame dimensions i.e. (rows, columns)
Example
>>> :set -XOverloadedStrings
>>> import qualified DataFrame as D
>>> df = D.fromNamedColumns [("a", D.fromList [1..100]), ("b", D.fromList [1..100]), ("c", D.fromList [1..100])]
>>> D.dimensions df
(100, 3)
columnNames :: DataFrame -> [Text] Source #
O(k) Get column names of the DataFrame in order of insertion.
Example
>>> :set -XOverloadedStrings
>>> import qualified DataFrame as D
>>> df = D.fromNamedColumns [("a", D.fromList [1..100]), ("b", D.fromList [1..100]), ("c", D.fromList [1..100])]
>>> D.columnNames df
["a", "b", "c"]
nRows :: DataFrame -> Int Source #
O(1) Get number of rows in a dataframe.
Example
>>> :set -XOverloadedStrings
>>> import qualified DataFrame as D
>>> df = D.fromNamedColumns [("a", D.fromList [1..100]), ("b", D.fromList [1..100]), ("c", D.fromList [1..100])]
>>> D.nRows df
100
nColumns :: DataFrame -> Int Source #
O(1) Get number of columns in a dataframe.
Example
>>> :set -XOverloadedStrings
>>> import qualified DataFrame as D
>>> df = D.fromNamedColumns [("a", D.fromList [1..100]), ("b", D.fromList [1..100]), ("c", D.fromList [1..100])]
>>> D.nColumns df
3
Arguments
| :: Columnable a | |
| => Text | Column Name |
| -> Vector a | Vector to add to column |
| -> DataFrame | DataFrame to add column to |
| -> DataFrame |
Adds a vector to the dataframe. If the vector has less elements than the dataframe and the dataframe is not empty
the vector is converted to type `Maybe a` filled with Nothing to match the size of the dataframe. Similarly,
if the vector has more elements than what's currently in the dataframe, the other columns in the dataframe are
change to `Maybe Type` and filled with Nothing.
Example
>>> :set -XOverloadedStrings >>> import qualified DataFrame as D >>> import qualified Data.Vector as V >>> D.insertVector "numbers" (V.fromList [(1 :: Int)..10]) D.empty -------- numbers -------- Int -------- 1 2 3 4 5 6 7 8 9 10
Arguments
| :: Text | Column Name |
| -> Column | Column to add |
| -> DataFrame | DataFrame to add the column to |
| -> DataFrame |
O(n) Add a column to the dataframe.
Example
>>> :set -XOverloadedStrings >>> import qualified DataFrame as D >>> D.insertColumn "numbers" (D.fromList [(1 :: Int)..10]) D.empty -------- numbers -------- Int -------- 1 2 3 4 5 6 7 8 9 10
insertVectorWithDefault Source #
Arguments
| :: Columnable a | |
| => a | Default Value |
| -> Text | Column name |
| -> Vector a | Data to add to column |
| -> DataFrame | DataFrame to add the column to |
| -> DataFrame |
Adds a vector to the dataframe and pads it with a default value if it has less elements than the number of rows.
Example
>>> :set -XOverloadedStrings
>>> import qualified Data.Vector as V
>>> import qualified DataFrame as D
>>> df = D.fromNamedColumns [("x", D.fromList [(1 :: Int)..10])]
>>> D.insertVectorWithDefault 0 "numbers" (V.fromList [(1 :: Int),2,3]) df
-------------
x | numbers
----|--------
Int | Int
----|--------
1 | 1
2 | 2
3 | 3
4 | 0
5 | 0
6 | 0
7 | 0
8 | 0
9 | 0
10 | 0
Arguments
| :: (Columnable a, Foldable t) | |
| => a | Default Value |
| -> Text | Column name |
| -> t a | Data to add to column |
| -> DataFrame | DataFrame to add the column to |
| -> DataFrame |
Adds a list to the dataframe and pads it with a default value if it has less elements than the number of rows.
Example
>>> :set -XOverloadedStrings
>>> import qualified DataFrame as D
>>> df = D.fromNamedColumns [("x", D.fromList [(1 :: Int)..10])]
>>> D.insertWithDefault 0 "numbers" [(1 :: Int),2,3] df
-------------
x | numbers
----|--------
Int | Int
----|--------
1 | 1
2 | 2
3 | 3
4 | 0
5 | 0
6 | 0
7 | 0
8 | 0
9 | 0
10 | 0
Arguments
| :: (Columnable a, Unbox a) | |
| => Text | Column Name |
| -> Vector a | Unboxed vector to add to column |
| -> DataFrame | DataFrame to add the column to |
| -> DataFrame |
O(n) Adds an unboxed vector to the dataframe.
Same as insertVector but takes an unboxed vector. If you insert a vector of numbers through insertVector it will either way be converted into an unboxed vector so this function saves that extra work/conversion.
cloneColumn :: Text -> Text -> DataFrame -> DataFrame Source #
O(n) Clones a column and places it under a new name in the dataframe.
Example
>>> :set -XOverloadedStrings >>> import qualified Data.Vector as V >>> df = insertVector "numbers" (V.fromList [1..10]) D.empty >>> D.cloneColumn "numbers" "others" df ----------------- numbers | others ---------|------- Int | Int ---------|------- 1 | 1 2 | 2 3 | 3 4 | 4 5 | 5 6 | 6 7 | 7 8 | 8 9 | 9 10 | 10
renameMany :: [(Text, Text)] -> DataFrame -> DataFrame Source #
O(n) Renames many columns.
Example
>>> :set -XOverloadedStrings
>>> import qualified DataFrame as D
>>> import qualified Data.Vector as V
>>> df = D.insertVector "others" (V.fromList [11..20]) (D.insertVector "numbers" (V.fromList [1..10]) D.empty)
>>> df
-----------------
numbers | others
---------|-------
Int | Int
---------|-------
1 | 11
2 | 12
3 | 13
4 | 14
5 | 15
6 | 16
7 | 17
8 | 18
9 | 19
10 | 20
>>> D.renameMany [("numbers", "first_10"), ("others", "next_10")] df
-------------------
first_10 | next_10
----------|--------
Int | Int
----------|--------
1 | 11
2 | 12
3 | 13
4 | 14
5 | 15
6 | 16
7 | 17
8 | 18
9 | 19
10 | 20
describeColumns :: DataFrame -> DataFrame Source #
O(n * k ^ 2) Returns the number of non-null columns in the dataframe and the type associated with each column.
Example
>>> import qualified Data.Vector as V
>>> df = D.insertVector "others" (V.fromList [11..20]) (D.insertVector "numbers" (V.fromList [1..10]) D.empty)
>>> D.describeColumns df
--------------------------------------------------------
Column Name | # Non-null Values | # Null Values | Type
-------------|-------------------|---------------|-----
Text | Int | Int | Text
-------------|-------------------|---------------|-----
others | 10 | 0 | Int
numbers | 10 | 0 | Int
fromNamedColumns :: [(Text, Column)] -> DataFrame Source #
Creates a dataframe from a list of tuples with name and column.
Example
>>> df = D.fromNamedColumns [("numbers", D.fromList [1..10]), ("others", D.fromList [11..20])]
>>> df
-----------------
numbers | others
---------|-------
Int | Int
---------|-------
1 | 11
2 | 12
3 | 13
4 | 14
5 | 15
6 | 16
7 | 17
8 | 18
9 | 19
10 | 20
fromUnnamedColumns :: [Column] -> DataFrame Source #
Create a dataframe from a list of columns. The column names are "0", "1"... etc. Useful for quick exploration but you should probably always rename the columns after or drop the ones you don't want.
Example
>>> df = D.fromUnnamedColumns [D.fromList [1..10], D.fromList [11..20]] >>> df ----------------- 0 | 1 -----|---- Int | Int -----|---- 1 | 11 2 | 12 3 | 13 4 | 14 5 | 15 6 | 16 7 | 17 8 | 18 9 | 19 10 | 20
valueCounts :: Columnable a => Expr a -> DataFrame -> [(a, Int)] Source #
O (k * n) Counts the occurences of each value in a given column.
Example
>>> df = D.fromUnnamedColumns [D.fromList [1..10], D.fromList [11..20]] >>> D.valueCounts @Int "0" df [(1,1),(2,1),(3,1),(4,1),(5,1),(6,1),(7,1),(8,1),(9,1),(10,1)]
columnAsVector :: Columnable a => Expr a -> DataFrame -> Either DataFrameException (Vector a) Source #
Get a specific column as a vector.
You must specify the type via type applications.
Examples
>>>columnAsVector (F.col @Int "age") dfRight [25, 30, 35, ...]
>>>columnAsVector (F.col @Text "name") dfRight ["Alice", "Bob", "Charlie", ...]
valueProportions :: Columnable a => Expr a -> DataFrame -> [(a, Double)] Source #
O (k * n) Shows the proportions of each value in a given column.
Example
>>> df = D.fromUnnamedColumns [D.fromList [1..10], D.fromList [11..20]] >>> D.valueCounts @Int "0" df [(1,0.1),(2,0.1),(3,0.1),(4,0.1),(5,0.1),(6,0.1),(7,0.1),(8,0.1),(9,0.1),(10,0.1)]
toFloatMatrix :: DataFrame -> Either DataFrameException (Vector (Vector Float)) Source #
Returns a dataframe as a two dimensional vector of floats.
Converts all columns in the dataframe to float vectors and transposes them into a row-major matrix representation.
This is useful for handing data over into ML systems.
Returns Left with an error if any column cannot be converted to floats.
toDoubleMatrix :: DataFrame -> Either DataFrameException (Vector (Vector Double)) Source #
Returns a dataframe as a two dimensional vector of doubles.
Converts all columns in the dataframe to double vectors and transposes them into a row-major matrix representation.
This is useful for handing data over into ML systems.
Returns Left with an error if any column cannot be converted to doubles.
toIntMatrix :: DataFrame -> Either DataFrameException (Vector (Vector Int)) Source #
Returns a dataframe as a two dimensional vector of ints.
Converts all columns in the dataframe to int vectors and transposes them into a row-major matrix representation.
This is useful for handing data over into ML systems.
Returns Left with an error if any column cannot be converted to ints.
columnAsIntVector :: Expr Int -> DataFrame -> Either DataFrameException (Vector Int) Source #
Retrieves a column as an unboxed vector of Int values.
Returns Left with a DataFrameException if the column cannot be converted to ints.
This may occur if the column contains non-numeric data or values outside the Int range.
columnAsFloatVector :: Expr Float -> DataFrame -> Either DataFrameException (Vector Float) Source #
Retrieves a column as an unboxed vector of Float values.
Returns Left with a DataFrameException if the column cannot be converted to floats.
This may occur if the column contains non-numeric data.
columnAsUnboxedVector :: (Columnable a, Unbox a) => Expr a -> DataFrame -> Either DataFrameException (Vector a) Source #
columnAsList :: Columnable a => Expr a -> DataFrame -> [a] Source #
Get a specific column as a list.
You must specify the type via type applications.
Examples
>>>columnAsList @Int "age" df[25, 30, 35, ...]
>>>columnAsList @Text "name" df["Alice", "Bob", "Charlie", ...]
Throws
error- if the column type doesn't match the requested type
Types
schemaType :: Columnable a => SchemaType Source #
Construct a SchemaType for the given a.
Examples
>>>:set -XTypeApplications>>>schemaType @T.Text == schemaType @T.TextTrue
>>>show (schemaType @Double)"Double"
I/O
data HeaderSpec Source #
STANDARD CONFIG TYPES
Constructors
| NoHeader | |
| UseFirstRow | |
| ProvideNames [Text] |
Instances
| Show HeaderSpec Source # | |
Defined in DataFrame.IO.CSV Methods showsPrec :: Int -> HeaderSpec -> ShowS # show :: HeaderSpec -> String # showList :: [HeaderSpec] -> ShowS # | |
| Eq HeaderSpec Source # | |
Defined in DataFrame.IO.CSV | |
data ReadOptions Source #
CSV read parameters.
Constructors
| ReadOptions | |
Fields
| |
Constructors
| InferFromSample Int | |
| SpecifyTypes [SchemaType] | |
| NoInference |
readCsv :: FilePath -> IO DataFrame Source #
Read CSV file from path and load it into a dataframe.
Example
ghci> D.readCsv "./data/taxi.csv"
readCsvWithOpts :: ReadOptions -> FilePath -> IO DataFrame Source #
Read CSV file from path and load it into a dataframe.
Example
ghci> D.readCsvWithOpts "./data/taxi.csv" (D.defaultReadOptions { dateFormat = "%d%-m%-Y" })
readSeparated :: ReadOptions -> FilePath -> IO DataFrame Source #
Read text file with specified delimiter into a dataframe.
Example
ghci> D.readSeparated (D.defaultReadOptions { columnSeparator = ';' }) "./data/taxi.txt"
readTsv :: FilePath -> IO DataFrame Source #
Read TSV (tab separated) file from path and load it into a dataframe.
Example
ghci> D.readTsv "./data/taxi.tsv"
module DataFrame.IO.Unstable.CSV
readParquet :: FilePath -> IO DataFrame Source #
Read a parquet file from path and load it into a dataframe.
Example
ghci> D.readParquet "./data/mtcars.parquet"
Operations
sample :: RandomGen g => g -> Double -> DataFrame -> DataFrame Source #
Sample a dataframe. The double parameter must be between 0 and 1 (inclusive).
Example
ghci> import System.Random ghci> D.sample (mkStdGen 137) 0.1 df
Arguments
| :: Columnable a | |
| => Expr a | Column to filter by |
| -> (a -> Bool) | Filter condition |
| -> DataFrame | Dataframe to filter |
| -> DataFrame |
O(n * k) Filter rows by a given condition.
filter "x" even df
select :: [Text] -> DataFrame -> DataFrame Source #
O(n) Selects a number of columns in a given dataframe.
select ["name", "age"] df
selectBy :: [SelectionCriteria] -> DataFrame -> DataFrame Source #
O(n) select columns by column predicate name.
data SelectionCriteria Source #
byIndexRange :: (Int, Int) -> SelectionCriteria Source #
Criteria for selecting columns whose indices are in the given (inclusive) range.
selectBy [byIndexRange (0, 5)] df
byName :: Text -> SelectionCriteria Source #
Criteria for selecting a column by name.
selectBy [byName "Age"] df
equivalent to:
select ["Age"] df
byNameProperty :: (Text -> Bool) -> SelectionCriteria Source #
Criteria for selecting columns whose name satisfies given predicate.
selectBy [byNameProperty (T.isPrefixOf "weight")] df
byNameRange :: (Text, Text) -> SelectionCriteria Source #
Criteria for selecting columns whose names are in the given lexicographic range (inclusive).
selectBy [byNameRange ("a", "c")] dfbyProperty :: (Column -> Bool) -> SelectionCriteria Source #
Criteria for selecting columns whose property satisfies given predicate.
selectBy [byProperty isNumeric] df
cube :: (Int, Int) -> DataFrame -> DataFrame Source #
O(k) cuts the dataframe in a cube of size (a, b) where a is the length and b is the width.
cube (10, 5) df
filterAllJust :: DataFrame -> DataFrame Source #
O(n * k) removes all rows with Nothing from the dataframe.
filterAllJust df
filterAllNothing :: DataFrame -> DataFrame Source #
O(n * k) keeps any row with a null value.
filterAllNothing df
filterBy :: Columnable a => (a -> Bool) -> Expr a -> DataFrame -> DataFrame Source #
O(k) a version of filter where the predicate comes first.
filterBy even "x" df
filterJust :: Text -> DataFrame -> DataFrame Source #
O(k) removes all rows with Nothing in a given column from the dataframe.
filterJust "col" df
filterNothing :: Text -> DataFrame -> DataFrame Source #
O(k) returns all rows with Nothing in a give column.
filterNothing "col" df
filterWhere :: Expr Bool -> DataFrame -> DataFrame Source #
O(k) filters the dataframe with a boolean expression.
filterWhere (F.col @Int x + F.col y F.> 5) df
kFolds :: RandomGen g => g -> Int -> DataFrame -> [DataFrame] Source #
Creates n folds of a dataframe.
Example
ghci> import System.Random ghci> D.kFolds (mkStdGen 137) 5 df
randomSplit :: RandomGen g => g -> Double -> DataFrame -> (DataFrame, DataFrame) Source #
Split a dataset into two. The first in the tuple gets a sample of p (0 <= p <= 1) and the second gets (1 - p). This is useful for creating test and train splits.
Example
ghci> import System.Random ghci> D.randomSplit (mkStdGen 137) 0.9 df
groupBy :: [Text] -> DataFrame -> GroupedDataFrame Source #
O(k * n) groups the dataframe by the given rows aggregating the remaining rows into vector that should be reduced later.
aggregate :: [NamedExpr] -> GroupedDataFrame -> DataFrame Source #
Aggregate a grouped dataframe using the expressions given. All ungrouped columns will be dropped.
sortBy :: [SortOrder] -> DataFrame -> DataFrame Source #
O(k log n) Sorts the dataframe by a given row.
sortBy Ascending ["Age"] df
Sort order taken as a parameter by the sortBy function.
module DataFrame.Operations.Merge
module DataFrame.Operations.Join
sum :: (Columnable a, Num a) => Expr a -> DataFrame -> a Source #
Calculates the sum of a given column as a standalone value.
correlation :: Text -> Text -> DataFrame -> Maybe Double Source #
Calculates the Pearson's correlation coefficient between two given columns as a standalone value.
frequencies :: Text -> DataFrame -> DataFrame Source #
Show a frequency table for a categorical feaure.
Examples:
ghci> df <- D.readCsv "./data/housing.csv"
ghci> D.frequencies "ocean_proximity" df
---------------------------------------------------------------------
Statistic | <1H OCEAN | INLAND | ISLAND | NEAR BAY | NEAR OCEAN
----------------|-----------|--------|--------|----------|-----------
Text | Any | Any | Any | Any | Any
----------------|-----------|--------|--------|----------|-----------
Count | 9136 | 6551 | 5 | 2290 | 2658
Percentage (%) | 44.26% | 31.74% | 0.02% | 11.09% | 12.88%
genericPercentile :: (Columnable a, Ord a) => Int -> Expr a -> DataFrame -> a Source #
Calculates the nth percentile of a given column as a standalone value.
imputeWith :: Columnable b => (Expr b -> Expr b) -> Expr (Maybe b) -> DataFrame -> DataFrame Source #
O(n) Impute missing values in a column using a derived scalar.
Given
- an expression
f ::that, when interpreted over a non-nullable column, produces the same value in every row (for example a mean, median, or other aggregate), andExprb ->Exprb - a nullable column
Expr(Maybeb)
this function:
- Drops all
Nothingvalues from the target column. - Interprets
fon the remaining non-null values. - Checks that the resulting column contains a single repeated value.
- Uses that value to impute all
Nothings in the original column.
Throws
DataFrameException- if the column does not exist, is empty,
Example
>>> :set -XOverloadedStrings
>>> import qualified DataFrame as D
>>> let df =
... D.fromNamedColumns
... [ ("age", D.fromList [Just 10, Nothing, Just 20 :: Maybe Int]) ]
>>>
>>> -- Impute missing ages with the mean of the observed ages
>>> D.imputeWith F.mean "age" df
-- age
-- ----
-- 10
-- 15
-- 20
interQuartileRange :: (Columnable a, Real a, Unbox a) => Expr a -> DataFrame -> Double Source #
Calculates the inter-quartile range of a given column as a standalone value.
mean :: (Columnable a, Real a, Unbox a) => Expr a -> DataFrame -> Double Source #
Calculates the mean of a given column as a standalone value.
median :: (Columnable a, Real a, Unbox a) => Expr a -> DataFrame -> Double Source #
Calculates the median of a given column as a standalone value.
medianMaybe :: (Columnable a, Real a) => Expr (Maybe a) -> DataFrame -> Double Source #
Calculates the median of a given column (containing optional values) as a standalone value.
percentile :: (Columnable a, Real a, Unbox a) => Int -> Expr a -> DataFrame -> Double Source #
Calculates the nth percentile of a given column as a standalone value.
skewness :: (Columnable a, Real a, Unbox a) => Expr a -> DataFrame -> Double Source #
Calculates the skewness of a given column as a standalone value.
standardDeviation :: (Columnable a, Real a, Unbox a) => Expr a -> DataFrame -> Double Source #
Calculates the standard deviation of a given column as a standalone value.
variance :: (Columnable a, Real a, Unbox a) => Expr a -> DataFrame -> Double Source #
Calculates the variance of a given column as a standalone value.
Errors
module DataFrame.Errors