OS

Data science

Why every company needs citizen data scientists

Spreading the data knowledge


With the introduction of the internet in our daily lives, the amount of data available has increased hugely. Furthermore, the incremental growth of computer capabilities has allowed for the collecting and storing of this ever-increasing amount of data, opening up all kinds of possibilities for extracting information. It is in this context that data science has come to play a fundamental role.

Data science as we currently conceive it is a product of the present century and has evolved significantly throughout the years. It is a field closely related to statistics, data analysis, computer science, and many others—from machine learning and mathematics to domain expertise, communication, and data visualization. It’s all about processing and understanding huge amounts of data with the purpose of extracting information from it.

The volume of data created, captured, copied, and consumed worldwide increased from 2 zettabytes in 2010 to 79 zettabytes in 2021.

The possibility of processing that enormous quantity of data has made the role of the data scientist to be considered by some the sexiest job in the twenty-first century. However, more recently, the shortage of data scientists has become a constraint—it’s no longer enough to rely solely upon dedicated data scientists.

Enter the citizen data scientist (CDS), an individual who participates in advanced data analytics but whose primary role exists in a related business field—not in statistics or analytics. In this article, you’ll learn more about citizen data scientists and their role in the data science process. You’ll explore distinctions in their responsibilities versus those of dedicated data scientists and review both the benefits and drawbacks of involving CDSs in your organization. Finally, you’ll learn some of the best practices that can help CDSs flourish in your workplace.

Learn the secrets of performant and explainable ML

Be the first to know by subscribing to the blog. You will be notified whenever there is a new post.

What is a citizen data scientist?

The term “big data” has been used since the ’90s as a consequence of the data explosion. This massive amount of information being stored promises the ability to extract valuable insights with the use of several convenient tools. Artificial intelligence, computer science, statistics, and other fields provide the tools and methods, while data scientists provide the knowledge to use them.

Over time, big data and artificial intelligence became an objective for commercialization. New and powerful tools appeared—like MS Azure Machine Learning Studio, Apache Spark, Amazon SageMaker, Google Cloud AutoML, IBM Watson Studio, KNIME, and so on—to simplify and commercialize the methods and strategies for dealing with big amounts of data. These platforms provide interfaces that help users apply complicated calculus to extract interesting information, visualize the results, and produce reports. They sell services that allow people without a deep knowledge of mathematics, statistics, and other sciences to apply artificial intelligence methods to their specific situations. You can also find courses and advisers on those platforms to help you learn and troubleshoot along the way.

According to Gartner, a citizen data scientist can be defined as “a person who creates or generates models that use advanced diagnostic analytics or predictive and prescriptive capabilities, but whose primary job function is outside the field of statistics and analytics.” CDSs are an important link between business users and data scientists.

Forbes mentioned that engineers that have experience with statistics and modeling but whose roles don’t require math expertise often fill the role of CDS. They can extract relevant information from the analytics tools they are using and delve deeply into data, applying visualization tools to best analyze the data streams.

Businesses have realized they do not need data scientists for every single data role; other professionals with the right skill set and training are sufficient for specific tasks. Organizations can make better use of data and reduce costs by hundreds of thousands of dollars using CDSs to handle exploratory analysis, visualization, and the creation of actionable insights.

CDSs are more closely connected to the daily problems of businesses or industries than data scientists typically are, which allows them to more easily integrate the results obtained through the mathematical models to create valuable solutions.

Do dedicated data scientists still have a place?

Of course, dedicated data scientists (DSs) still have a place, as their skills are generally far more advanced than those of a CDS. CDSs serve a role between the business users and the DS doing advanced analytics.

DSs focus on the complex and specialized analysis of data for which they have studied and prepared for. They have the advanced mathematical expertise to manage the end-to-end process: exploratory data analysis, model programming, and evaluation, result interpretation and communication, and of course, deployment and maintenance.

Data scientists are still needed, but perhaps not as many of them as were previously. Nick Elprin, co-founder and CEO of Domino Data Lab, put it this way: “You don’t go to the lifeguard if you need surgery done.” The world needs both lifeguards and doctors.

How involved in the data science process are citizen data scientists?

According to Gartner’s definition, citizen data scientists could create or generate models that use advanced diagnostic analytics or predictive and prescriptive capabilities. However, as their primary job function is outside the field of statistics and analytics, CDSs typically focus on other things like developing exploratory analysis, visualization, and providing business and industry expertise in the data science process.

Considering that the shortage of data scientists persists, in addition to the high salaries associated with their position, many companies make an effort to use CDSs as substitutes as much as possible.

According to Ryohei Fujimaki, founder and CEO of DotData, “Model development, as well as model operationalization, can be significantly simplified by automation. New data science automation platforms will enable enterprises to deploy, operate, and maintain data science processes in production, with minimal efforts, helping companies maximize their AI and ML investments, and their current data team.” CDSs use such tools to accomplish the companies’ goals, taking advantage of online courses designed to provide the basic skills of a citizen data scientist, such as SQL, Tableau, Power BI, and others. Many organizations rely heavily on these automated tools to process big data and create models to gain additional insights.

What are the benefits of having citizen data scientists in your business?

As you’ve seen, citizen data scientists can be suitable substitutes for data scientists when it comes to certain tasks. There are many advantages to going with this option. First, CDSs do not require advanced mathematics and statistics knowledge; therefore, it tends to be easier to recruit them. Armed with powerful software, they solve specific business problems with a “point and click” interface, and in doing so, data preparation costs are reduced by hundreds of thousands of dollars.

CDSs also have an in-depth understanding of the relevant business problems, meaning they have a unique ability to discover insights. They usually reside in a line of business such as sales, marketing, finance, or human resources. Their business experience and awareness of business priorities allow them to provide better business expertise than many data scientists. Furthermore, CDSs can perform repetitive and redundant tasks in the analytics workflow, releasing data scientists from those tasks and thus reducing costs.

What are the limitations and drawbacks of having citizen data scientists?

CDSs often have corporate or industry work experience as well as basic computer and math skills. However, because they typically don’t have a deep knowledge of the mathematical basis and programming of the models they use, they depend on automated machine learning tools. It’s important that they receive proper education and training on these tools in order to ensure that the tools are used correctly and that errors are avoided—it would be easy to set a model incorrectly, leading to unexpected results that may be difficult to detect. Company management must enforce regulations and work procedures to help prevent such errors from occurring.

For cases where there are prepackaged software tools available to easily apply a technique without deep knowledge about it, CDSs should be able to apply the corresponding approach. When it comes to developing new strategies, though, there’s rarely a simple solution. In those cases, deeper domain knowledge is often necessary, making a dedicated data scientist a better option.

Nick Elprin advises, “But for any problem that’s going to be really competitively differentiating for a business or require deep domain expertise or inventing something new, I think that’s going to be hard for citizens to attack that problem... There’s a risk of people building models where they don’t have a deep understanding of the statistical fundamental for models that have risk associated with them.”

What are the best practices that can help citizen data scientists flourish in the workplace?

For CDSs to flourish in your workplace, they need to understand the company’s needs, think outside the box, have an analytical mind, and be able to draw meaningful conclusions.

Software developers and engineers are typically good candidates for this role. It is advantageous to work in an adjacent field like backend software development or engineering due to the skills required in those positions, such as math, computer science, and coding. Otherwise, CDSs can prepare themselves by taking courses to develop skills with Tableau, KNIME, SQL, Excel, or Python.

The company should also provide flexible and secure environments that allow CDSs to work and collaborate with data scientists and data engineers. Administrations should classify data sets considering security and data-protection criteria. At the same time, they should prepare some data sets specifically for CDS training to avoid the risk of damaging anything. CDSs should analyze the results of applying different tools to the same data and then compare the results they obtain. They should be able to understand their results and their insights, use graphical interfaces to present the results, and participate in the overall analysis of the data.

CDSs should learn the basics of data science tools—especially those available in their workplace that simplify the tasks of applying machine learning to real-world problems, detecting anomalies in the output of machine learning algorithms, and following the regulations established by relevant governing bodies.

Finally, it is convenient for CDSs to know techniques such as regression analysis and predictive analytics. They should be able to tell when a model is overfitting and have some coding ability, as well as experience using actual data and knowledge of Excel and SQL.

Conclusion

In this article, you learned about how the internet and its big data explosion have brought many opportunities to gather valuable information. However, specialists (i.e., data scientists) are needed to extract this information from the data. As there are not enough data scientists to cover the demand, a new role has appeared: the citizen data scientist (CDS). They do not have the same expertise as a dedicated data scientist, but they are equipped to use automated machine learning tools and offer additional value in their analytical abilities and their unique understanding of the company’s needs. They do not require high salaries or even a master’s degree or Ph.D. Rather, they can be reskilled or upskilled by taking courses to train in any number of relevant areas as needed.

While using CDSs offers many advantages, there are limitations related to the depth of their understanding of mathematics, statistics, and modeling. They should be prepared to use prepacked software tools to deal with commoditized use cases, but they are not typically equipped to attack significant new problems.

For best results, administrations utilizing CDSs should provide appropriate environments for CDS to grow in, build teams that allow CDSs to collaborate with traditional data scientists, and offer courses to train them in relevant skills.


Andres Soto Villaverde

Computer Science professor at different universities for many years. Developer and researcher for commercial companies for 10 years. Ph.D. in Computer Science (Artificial Intelligence). Graduated in Science (Mathematics, Numerical Analysis).

Recommended posts

Data quality
Data science

Data labeling and relabeling in machine learning

The never-ending process in data science

Sundeep Teki

June 26th, 2023 • 5 minute read

Data labeling and relabeling in machine learning
Data science
Data-centric
Testing

Detecting data integrity issues in machine learning

Methods and tools for data quality assurance

Gustavo Cid

June 20th, 2023 • 6 minute read

Detecting data integrity issues in machine learning
Data science
Machine learning

How to generate synthetic data for machine learning projects

Solving the data gap

Sundeep Teki

June 13th, 2023 • 10 minute read

How to generate synthetic data for machine learning projects