I’ve always been advocating for – when you’re starting out on an ML problem – getting your first model up and running as quickly as possible. This way you won’t instantly go down the rabbit whole of feature engineering (which, given any amount of time, will most likely consume all of it). Most importantly, you will get your first benchmark score (say, AUC ROC) and will instantly know where to move from there. Who knows, maybe the first stab at the problem is good enough!
Here are some of the tools and tricks I discovered which you can start using in your workflow to get started on problems as quickly as possible.
Given the popularity of mature, batteries-included web framework, such as Ruby on Rails and Django, ORM mappers became a given in a lot of people’s toolbelt. While certainly useful in a lot of cases, I noticed that sometimes that can lead to people being scared of plain ol’ SQL. So, when faced with a newly database of structured data, some will try to map it to ORM systems before being able to use it. When trying to get up and running as quickly as possible, setting up an ORMs schema can be an overkill and a bit of a time waste. Even though SQLAlchemy does provide really simple way to automap your DB schema, sometimes you might just want to write SQL, and get your data. Perhaps you already have queries pre-written, or perhaps you have not yet learned ins and out of SQLAlchemy.
dataset is a library which allows you to do precisely that!
Let’s run through a simple use case. First, connect to your DB
import dataset import pandas as pd # connect, return rows as objects with attributes db = dataset.connect(‘postgresql://...', row_type=stuf) rows = db.query('SELECT country, COUNT(*) c FROM user GROUP BY country') # get data into pandas, that's where the fun begins! rows_df = pd.DataFrame.from_records(rows)
And just like that, in 3 commands, you have data from your query available in pandas!
I’ve only scratched the surface of what this library allows you to do – check out – simple row creation from
dicts (https://dataset.readthedocs.io/en/latest/quickstart.html#storing-data) – supporting atomic transactions via
rollback() (https://dataset.readthedocs.io/en/latest/quickstart.html#using-transactions) – and supporting upserts!
And, since it’s built on top of SQLAlchemy, it will seamlessly work with all major databases, such as SQLite, PostgreSQL and MySQL.
One highly useful thing to know about your dataset is how much missing data you have. In previous section we got the data from a DB into pandas. Now that you have your data in a
DataFrame, you can certainly call
df['column_to_check'].isnull().value_counts() to see how many rows have missing values. However, what you really want to see is those stats for your whole dataset, and ideally visualised in a useful fashion.
missingno provides a dead-simple way to visualise a sample of your data. Here’s the simplest use case:
import missingno as msno training_set_df = load_training_set() # fake loading function msno.matrix(training_set_df.sample(250)) # 250 is the sample size
This is what the result looks like:
This can be very useful to instantly see which columns (and thus potential features) have enough data. Perhaps if a column has less than 10% data, you will not be able to use it. However, columns with 50%> can arguably be imputed in a meanigful way to be usable! This will be covered in the next section.
Once you have your data in nice
DataFrames, it sometimes is difficult to get it into
scikit-learn (if you use it). When tryingt to get up to speed quickly, defining special
FunctionTransformers required to extract columns into 2D numpy arrays can be quite verbose.
sklearn-pandas is the glue between the libraries, which allows you to define mappers, consisely describing how columns in your
DataFrame are going to map to features in your pipeline. Imagine you have a big
DataFrame of user details, and among the columns you have their salary, and subscription plan stored as enumerated string (e.g. ‘free’ and ‘premium’). Now, when you checked the data set for missing values using
missingno from the previous chapter, you noticed that about 25% of the values are missing! And furthermore, you realise that because these are stored as numbers with quite a large range, they need to be scaled to be nicely usable in the model. On top of that, the
subscription column needs to be one-hot encoded, since it’s stored as enum strings, which again are difficult to use in most ML algorithms.
sklearn-pandas you can very quickly and easily define a simple mapper to fix those issues.
mapper = DataFrameMapper([ (['salary'], [preprocessing.Imputer(), preprocessing.StandardScaler()]), (['subscription'], [preprocessing.OneHotEncoder()]), ... # more features )
Once we have the mapper defined, here is how it would fit in a simple pipeline. I omit the details for other parts for illustrative purposes.
# pipeline to convert DataFrame to ML representation pipeline = Pipeline([ ('featurise', mapper), ('feature_selection', SelectKBest(...)), ('random_forest', RandomForestClassifier())])
Now you have a very simple way of mapping your datasets stored in pandas directly into scikit-learn pipelines for getting to your first benchmark results quickly!
Putting it all together
Now let’s see how all of these libraries fit together to give you the starting boost!
import dataset import pandas as pd from sklearn.feature_selection import SelectKBest from sklearn.ensemble import RandomForestClassifier from sklearn.grid_search import GridSearchCV from sklearn.pipeline import Pipeline from sklearn import preprocessing from sklearn_pandas import DataFrameMapper # get data db = dataset.connect(‘postgresql://...', row_type=stuf) users = db.query('SELECT salary, subscription, ... FROM user') users_df = pd.DataFrame.from_records(rows) # map data mapper = DataFrameMapper([ (['salary'], [preprocessing.Imputer(), preprocessing.StandardScaler()]), (['subscription'], [preprocessing.OneHotEncoder()]), ... # more features ) # create pipeline pipeline = Pipeline([ ('featurise', mapper), ('feature_selection', SelectKBest(...)), ('random_forest', RandomForestClassifier())]) # find the best model with this data, and get AUC ROC benchmark score cv = GridSearchCV(pipeline, param_grid=dict(...)) # best model with the current features best_model = cv.best_estimator_ # this is your benchmark AUC ROC score roc_auc = cv.best_score_
With these libraries, I hope I managed to show a quick a dirty way of getting from raw data in your database (be it local or remote) straight to a working model in under 15 lines of code! I am a big believer in getting to a benchmark score as quickly as possible, and then using that as a heuristic into where you should start digging further and focus your resources. Good luck in your data adventures! Gotta go fast!Published in