David Bishop
David Bishop

Predictive analytics in the age of citizen data science

What is this article about?

Are you a business analyst who is just starting to dabble in the field of predictive analytics, and you are feeling overwhelmed by the sheer amount of marketing material plastered all over the internet for the ‘latest and greatest’ in data science?

Are you a more experienced data scientist looking for a broader understanding of the ‘on-rails’ platforms out there, and how they can make your job easier? Or are you skeptical that guided software tools could ever be a sufficient replacement for down-and-dirty programming?

This article aims to unpack some of the misconceptions about citizen data science. By the end of this post, you will understand the concept of citizen data science, and the type of role it plays in the modern analytics environment. You will learn why disagreement on platform choice commonly arises between people who are focussed on explaining their models to upper management, versus people who want to code. You will learn about the different types of data science tools, and the situations in which you might want to pick one over the other.

Okay, so what exactly is citizen data science?

“Citizen data science is a branch of data science that allows internal users to extract advanced insights from data without the formal training in advanced mathematics and statistics required to be a specialist data scientist.”

– Krensky, P, Linden, A & Hare, J, 2016

“A person who creates or generates models that use advanced diagnostic analytics or predictive and prescriptive capabilities, but whose primary job function is outside the field of statistics and analytics.”
– Morgan, L, 2015

To summarise, we can think of ‘citizen data science’ as predictive analytics which can be performed by business users or analysts without specialised knowledge in data science. Their job title and primary job function is not (or only tangentially) related to advanced data analysis.

This persona differs markedly from what we will refer to as a ‘specialised data scientist’, who has a strong mathematics/statistics background, and formal training in understanding data science algorithms from first principles – they may even have a Masters in Data Science or other advanced qualification in the field. They have likely been hired explicitly in the role of data scientist.

But why even make the distinction?

In industries where both of these roles exist, tension can arise between the two groups – specialist data scientists think of citizen data scientists as reckless and throwing caution to the wind, and citizen data scientists think of specialist data scientists as black-box purists who get too distracted by building crazy models to focus on things like business context, model interpretability, and time value for money.

It is important to recognise that these two roles need not be at odds with each other! Both the citizen and specialist data scientist roles exist to fulfil a specific need in the business. The need for one type of role over the other also depends on the context and the project. Considerations to be made include:

  • Who is funding the project? Who is the project for?
  • How accurate does the analysis need to be? Is the person/team/business unit funding the project happy to pay more for more resources and further analysis?
  • Do we need to throw a team of specialist data scientists at the problem? What resources do we realistically have at our disposal?
  • Does the solution need to be enterprise, or is this a proof of concept to drive hunger for advanced analytics in the business?
  • What does the existing technology stack look like? Can we do data normalisation/transformation and analysis in separate layers of the stack, or do they need to exist in the same product?

Which solutions are good for specialist data scientists?

The following three programming languages represent the classic coding options for building predictive models. On this route, your choice should be focussed less on specific attributes of the language, since all three platforms should be able to solve the same problems in different ways. Instead, think more broadly about the skills of your team (especially if they are already familiar with one tool), existing infrastructure, and the end user’s ability to maintain the solution.

Python with Anaconda/Jupyter

A classic choice as a great ‘all-rounder’ programming language, Python has made leaps and bounds in the last few years in the field of machine learning. Fully open-source and hosts an enormous community dedicated to building customised libraries. Commonly deployed using tools like Anaconda and Jupyter, to present both code and results alongside each other in a unified platform. In general, Python tends to execute code faster than R. In addition to predictive analytics, Python also excels at related ‘general purpose’ tasks like web scraping and creating interactive apps.

R with Anaconda/Jupyter

With a slightly steeper learning curve but an even larger suite of third-party software libraries to choose from, R is another excellent choice for a programming-based machine learning tool. Completely open-source, with a large dedicated online community. R is also commonly deployed using Anaconda + Jupyter, or through a commonly-used IDE like RStudio. R is less of a ‘general purpose programming language’ than Python, and best suited specifically to tackle statistics problems and data science.

MATLAB

Short for “Matrix Laboratory”, MATLAB is used widely in academia/research, and common in certain fields of engineering like the automotive, aerospace, and medical device industries. MATLAB is great for solving complex mathematical problems, and offers unmatched features when it comes to visualising multidimensional data. A free cloud IDE ‘clone’ of MATLAB known as Octave is available for individual users online, though MATLAB does not offer a free platform for unlimited personal use.

 

Which solutions are good for citizen data scientists?

These tools are targeted heavily towards users without direct programming experience, and offer a range of out-of-the-box tools for building machine learning models without complex prerequisite knowledge. However, they may not offer the ability to directly write code within the platform itself, limiting flexibility and the range of custom advanced techniques possible for more technical users. That being said, Python/R integration for citizen data science tools is sometimes available in the form of platform administration, and modifying resources or running tasks (etc.) through an API.

Azure Machine Learning

Microsoft is well known for seamlessly integrating their product offerings with each other, making Azure Machine Learning an attractive option for users who are already working in an existing Azure stack. 

Azure Machine Learning’s main offering is the ability to build predictive models in-browser using a point-and-click GUI. Though the ability to write code directly in the platform is not available, specialised data scientists will be excited by Microsoft’s Python integration. The Azure ML library for Python allows users to normalise and transform data in Python themselves using familiar syntax, and call Azure Machine Learning models as needed using loops. Not only this, but Azure Machine Learning also integrates with existing Python ML packages (including scikit-learn, TensorFlow and PyTorch). For users familiar with these tools, distributed cloud resources can be used to productionise results at scale, just like any other experiment. 

As of the writing of this article, Azure Machine Learning also offers an SDK for R in a public preview (i.e. non-productionisable) mode, which is expected to improve over time.

H2O Driverless AI

H2O Driverless AI is the main commercial enterprise offering of the company H2O.ai, offering automated AI with some pretty in-depth algorithms, including advanced features like natural language processing. A strong focus on model interpretability gives users multiple options for visualising algorithms in charts, decision trees, and flowcharts.

H2O.ai are already well-known in industry for their fully open-source ML platform H2O, which can be accessed as a package through existing languages like Python and R, or in notebook format. H2O Driverless AI and H2O currently exist as fairly separate products, though there is potential for these to be further integrated in the future. Partnerships with multiple cloud infrastructure providers (including AWS, Microsoft, Google Cloud and Snowflake) make H2o Driverless AI a product to watch in the coming years.

DataRobot

DataRobot offers a tool which is intended to empower business users to build predictive models through a streamlined point-and-click GUI. The tool focusses very heavily on model explainability, by generating flowcharts for data normalisation and automated visuals for assessing model outcomes. These out-of-the-box visuals include important exploratory charts like ROC curves, confusion matrices and feature impact charts. 

DataRobot’s end-to-end capabilities were significantly bolstered by the company’s acquisition of Paxata (a data preparation platform) in December 2019, which has since been integrated with the DataRobot predictive platform. The company also boasts some big name partnerships, including Qlik, Tableau, Looker, Snowflake, AWS, and Alteryx.

DataRobot does offer Python and R packages, which allow many of the service’s predictive features to be called through code, though the ability to directly write code in the DataRobot platform and collaborate with citizen data scientist users is not currently available (as of the writing of this article). DataRobot’s new MLOps service also provides the ability to deploy independent models written in Python/R (in addition to models developed in DataRobot), as part of a robust operations platform which includes deployment tests, integrated source control, and the ability to track model drift over time.

Which solutions are in the “Goldilocks Zone”?

Not too hot, not too cold, but just right – these are the platforms which achieve a mix between being loved by techies and non-techies alike. This middle ground offers a strong focus on citizen data science users and heavy integration with programming languages, allowing for flexibility and in-platform collaboration between people who can code, and people who can’t.

RapidMiner

RapidMiner Studio is a drag & drop GUI-based tool for building predictive analytics solutions, with a free version providing analysis of up to 10,000 rows. In-database querying and processing are available through the GUI, but programmers/analysts also have the option to query in SQL code. The ETL process is handled by Turbo Prep, which offers point & click data preparation (as well as a direct export to .qvx, for users who want to import results into Qlik).

The cool thing about RapidMiner is the integration with Python & R modules, available as supported extensions in the RapidMiner Marketplace, through which coders & non-coders can both collaborate on the same project. For coders working on a local Python instance, the RapidMiner library in Python also allows for administration of projects and resources of a RapidMiner instance. For cloud-based scaling of models, RapidMiner also allows containerisation using Docker and Kubernetes.

Alteryx

An existing big player in the ETL tool market, Alteryx is used to build data transformation workflows in a GUI, replacing the need to write SQL code. Alteryx has significantly stepped up its game in recent years with its integrated data science offering, allowing users to build predictive models using their drag-and-drop “no-code” approach. The ability to visualise and troubleshoot results at every step of the operation is a huge plus, and users familiar with SQL should transition easily to the logical flowchart style of the ETL, removing the need for complex nested scripts. 

Alteryx has a fantastic online community with plenty of resources, and direct integration with both Python and R through out-of-the-box tools. The Python tool includes common data science packages such as pandas, scikit-learn, matplotlib, numpy, and others which will be familiar to the Python enthusiasts of this world.

Dataiku

One quick look at the Dataiku website will make it immediately clear that this is a platform for everyone in the data space. Dataiku offers both a visual UI and a code-based platform for ML model development, along with a host of features that make Dataiku a highly sustainable platform in production.

Data scientists will be delighted with not only the Python & R integration, but the flexibility in being able to code either using the embedded code editor, or their favourite IDE like Jupyter notebooks or RStudio. The Dataiku DSS (Data Science Studio) is available as a HTTP REST API, allowing users to manage models, pipelines and automation externally.

Data analysts will be excited by the multitude of plugins available – including PowerBI, Looker, Qlik .qvx export, Dropbox, Excel, Google Sheets, Google Drive, Google Cloud, OneDrive, SharePoint, Confluence, and many more. Automatic feature engineering, generation and selection, in combination with the visual UI for model development, allows ML to sit firmly within the reach of these citizen data scientists.

Data engineers will be thrilled by the scalability (ability to containerise and manage job execution with Docker/Kubernetes) and the production pipeline, including Git integration and a fully staged deployment model (one-click production deployment). Model effectiveness is logged historically, along with automatic data validation policies, allowing model drift to be tracked over time.

How long is all of this going to take?

One major challenge is that of timeboxing data science projects. It is much more difficult to place a ‘definition of done’ on the end product, than it is to do in software development, or business intelligence. 

As an example, your end goal for a four-week app development project might be to release a working proof of concept to a user group. Your end goal for an eight-week business intelligence project might be to develop an executive dashboard, verify its effectiveness with end users, and release it to production.

Your end goal for a predictive analytics project will be more uncertain. How do you define when such a project is ‘done’? If you define your measure of success based on the accuracy of your predictive model, you could be setting yourself up for failure or severe scope creep from the start. 

Some of the questions that you should be asking yourself from the very start are:

  • Is the result even possible to predict?
  • If so, do you have the right data to make a prediction?
  • If so, do you have enough data to do this?
  • If so, is the data high enough quality to produce good results?

Sometimes, the answers to these questions don’t make themselves known until two weeks into data discovery. Sometimes, you don’t know the answers to these questions until six weeks later. This is the key reason why data science projects are more difficult to manage – because the outcome is fundamentally more unpredictable. 

This may be foundational knowledge for the specialist data scientists in your team who have had years of experience in the field, but may not be immediately obvious to the project managers and citizen data scientists working on the same project. Regardless of the choice of platform, it is crucial that communication and expectation management remains a central tenet in hybrid data science teams.

So, what’s the bottom line?

At the end of the day, most of these platforms are challenging to categorise, as they don’t fall neatly into one distinct “bucket” – the difficulty or ease of using a particular platform can often come down to how you use it. Understanding the pros and cons of each tool is not just about comparing raw features, but also taking into account the skills and preferences of the people around you. In an increasingly polarised world where perfect often becomes the enemy of good, a little brainstorming and collaboration with your team can go a long way!

Get in touch with David Bishop

Mathematician and (semi-professional) musician with an unashamedly niche sense of humour. Strengths are statistics, programming, and working with large datasets. Driven by constantly reading, learning, and finding new novel ways to solve problems. Could literally talk for hours about footy if required. Loves the social and friendly work culture, and always up for banter.

Get in Touch
David 600