Useful helper functions from the fast.ai library for processing and visualising structured data.
draw_tree
(t
, df
, size
=10
, ratio
=0.6
, precision
=0
)
Draws a representation of a random forest in IPython.
get_sample
(df
, n
)
Gets a random sample of n rows from df, without replacement.
add_datepart
(df
, fldnames
, drop
=True
, time
=False
, errors
='raise'
)
add_datepart converts a column of df from a datetime64 to many columns containing
the information from the date. This applies changes inplace.
train_cats
(df
)
Change any columns of strings in a panda's dataframe to a column of
categorical values. This applies the changes inplace.
apply_cats
(df
, trn
)
Changes any columns of strings in df into categorical variables using trn as
a template for the category codes.
fix_missing
(df
, col
, name
, na_dict
)
Fill missing data in a column of df with the median, and add a {name}_na column
which specifies if the data was missing.
numericalize
(df
, col
, name
, max_n_cat
)
Changes the column col from a categorical type to it's integer codes.
scale_vars
(df
, mapper
)
Standardize numerical features by removing the mean and scaling to unit variance.
proc_df
(df
, y_fld
=None
, skip_flds
=None
, ignore_flds
=None
, do_scale
=False
, na_dict
=None
, preproc_fn
=None
, max_n_cat
=None
, subset
=None
, mapper
=None
)
proc_df takes a data frame df and splits off the response variable, and
changes the df into an entirely numeric dataframe. For each column of df
which is not in skip_flds nor in ignore_flds, na values are replaced by the
median value of the column.
rf_feat_importance
(m
, df
)
Create a pandas.DataFrame of feature importances.