3 python ML libraries to get you up and running faster

python ml libraries

Background

I’ve always been advocating for – when you’re starting out on an ML problem – getting your first model up and running as quickly as possible. This way you won’t instantly go down the rabbit whole of feature engineering (which, given any amount of time, will most likely consume all of it). Most importantly, you will get your first benchmark score (say, AUC ROC) and will instantly know where to move from there. Who knows, maybe the first stab at the problem is good enough!

Here are some of the tools and tricks I discovered which you can start using in your workflow to get started on problems as quickly as possible.

Dataset

Given the popularity of mature, batteries-included web framework, such as Ruby on Rails and Django, ORM mappers became a given in a lot of people’s toolbelt. While certainly useful in a lot of cases, I noticed that sometimes that can lead to people being scared of plain ol’ SQL. So, when faced with a newly database of structured data, some will try to map it to ORM systems before being able to use it. When trying to get up and running as quickly as possible, setting up an ORMs schema can be an overkill and a bit of a time waste. Even though SQLAlchemy does provide really simple way to automap your DB schema, sometimes you might just want to write SQL, and get your data. Perhaps you already have queries pre-written, or perhaps you have not yet learned ins and out of SQLAlchemy. dataset is a library which allows you to do precisely that!

Let’s run through a simple use case. First, connect to your DB

import dataset
import pandas as pd
# connect, return rows as objects with attributes
db = dataset.connect(‘postgresql://...', row_type=stuf)

rows = db.query('SELECT country, COUNT(*) c
                FROM user GROUP BY country')

# get data into pandas, that's where the fun begins!
rows_df = pd.DataFrame.from_records(rows)

And just like that, in 3 commands, you have data from your query available in pandas!

I’ve only scratched the surface of what this library allows you to do – check out – simple row creation from dicts (https://dataset.readthedocs.io/en/latest/quickstart.html#storing-data) – supporting atomic transactions via begin(), commit() androllback() (https://dataset.readthedocs.io/en/latest/quickstart.html#using-transactions) – and supporting upserts!

And, since it’s built on top of SQLAlchemy, it will seamlessly work with all major databases, such as SQLite, PostgreSQL and MySQL.

Missing data

One highly useful thing to know about your dataset is how much missing data you have. In previous section we got the data from a DB into pandas. Now that you have your data in a DataFrame, you can certainly calldf['column_to_check'].isnull().value_counts() to see how many rows have missing values. However, what you really want to see is those stats for your whole dataset, and ideally visualised in a useful fashion.

missingno provides a dead-simple way to visualise a sample of your data. Here’s the simplest use case:

import missingno as msno

training_set_df = load_training_set()  # fake loading function
msno.matrix(training_set_df.sample(250))  # 250 is the sample size

This is what the result looks like:

This can be very useful to instantly see which columns (and thus potential features) have enough data. Perhaps if a column has less than 10% data, you will not be able to use it. However, columns with 50%> can arguably be imputed in a meanigful way to be usable! This will be covered in the next section.

Sklearn-pandas

Once you have your data in nice DataFrames, it sometimes is difficult to get it into scikit-learn (if you use it). When tryingt to get up to speed quickly, defining special FunctionTransformers required to extract columns into 2D numpy arrays can be quite verbose.

sklearn-pandas is the glue between the libraries, which allows you to define mappers, consisely describing how columns in your DataFrame are going to map to features in your pipeline. Imagine you have a big DataFrame of user details, and among the columns you have their salary, and subscription plan stored as enumerated string (e.g. ‘free’ and ‘premium’). Now, when you checked the data set for missing values using missingno from the previous chapter, you noticed that about 25% of the values are missing! And furthermore, you realise that because these are stored as numbers with quite a large range, they need to be scaled to be nicely usable in the model. On top of that, the subscription column needs to be one-hot encoded, since it’s stored as enum strings, which again are difficult to use in most ML algorithms.

With sklearn-pandas you can very quickly and easily define a simple mapper to fix those issues.

mapper = DataFrameMapper([
    (['salary'], [preprocessing.Imputer(),
                  preprocessing.StandardScaler()]),
    (['subscription'], [preprocessing.OneHotEncoder()]),
    ...  # more features
)

Once we have the mapper defined, here is how it would fit in a simple pipeline. I omit the details for other parts for illustrative purposes.

# pipeline to convert DataFrame to ML representation
pipeline = Pipeline([
    ('featurise', mapper),
    ('feature_selection', SelectKBest(...)),
    ('random_forest', RandomForestClassifier())])

Now you have a very simple way of mapping your datasets stored in pandas directly into scikit-learn pipelines for getting to your first benchmark results quickly!

Putting it all together

Now let’s see how all of these libraries fit together to give you the starting boost!

import dataset
import pandas as pd
from sklearn.feature_selection import SelectKBest
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
from sklearn_pandas import DataFrameMapper

# get data
db = dataset.connect(‘postgresql://...', row_type=stuf)
users = db.query('SELECT salary, subscription, ...  FROM user')
users_df = pd.DataFrame.from_records(rows)

# map data
mapper = DataFrameMapper([
    (['salary'], [preprocessing.Imputer(),
                  preprocessing.StandardScaler()]),
    (['subscription'], [preprocessing.OneHotEncoder()]),
    ...  # more features
)

# create pipeline
pipeline = Pipeline([
    ('featurise', mapper),
    ('feature_selection', SelectKBest(...)),
    ('random_forest', RandomForestClassifier())])


# find the best model with this data, and get AUC ROC benchmark score
cv = GridSearchCV(pipeline, param_grid=dict(...))
# best model with the current features
best_model = cv.best_estimator_
# this is your benchmark AUC ROC score
roc_auc = cv.best_score_

With these libraries, I hope I managed to show a quick a dirty way of getting from raw data in your database (be it local or remote) straight to a working model in under 15 lines of code! I am a big believer in getting to a benchmark score as quickly as possible, and then using that as a heuristic into where you should start digging further and focus your resources. Good luck in your data adventures! Gotta go fast!

Published in
0 Comments

Leave a reply

Log in with your credentials

or    

Forgot your details?

Create Account