Useful helper functions from the fast.ai library for processing and visualising structured data.

draw_tree[source]

draw_tree(t, df, size=10, ratio=0.6, precision=0)

Draws a representation of a random forest in IPython.

get_sample[source]

get_sample(df, n)

Gets a random sample of n rows from df, without replacement.

add_datepart[source]

add_datepart(df, fldnames, drop=True, time=False, errors='raise')

add_datepart converts a column of df from a datetime64 to many columns containing the information from the date. This applies changes inplace.

is_date[source]

is_date(x)

train_cats[source]

train_cats(df)

Change any columns of strings in a panda's dataframe to a column of categorical values. This applies the changes inplace.

apply_cats[source]

apply_cats(df, trn)

Changes any columns of strings in df into categorical variables using trn as a template for the category codes.

fix_missing[source]

fix_missing(df, col, name, na_dict)

Fill missing data in a column of df with the median, and add a {name}_na column which specifies if the data was missing.

numericalize[source]

numericalize(df, col, name, max_n_cat)

Changes the column col from a categorical type to it's integer codes.

scale_vars[source]

scale_vars(df, mapper)

Standardize numerical features by removing the mean and scaling to unit variance.

proc_df[source]

proc_df(df, y_fld=None, skip_flds=None, ignore_flds=None, do_scale=False, na_dict=None, preproc_fn=None, max_n_cat=None, subset=None, mapper=None)

proc_df takes a data frame df and splits off the response variable, and changes the df into an entirely numeric dataframe. For each column of df which is not in skip_flds nor in ignore_flds, na values are replaced by the median value of the column.

rf_feat_importance[source]

rf_feat_importance(m, df)

Create a pandas.DataFrame of feature importances.