While we may yet be a long way from the threat of truly sentient computers such as Hal 9000 and ARIIA, the current capability of AI technologies still offers plenty of power for those who wish to harness it for malicious purposes. These types of crimes go beyond the grey ethical areas that we discussed in our last newsletter, and dive into existing applications that are consciously misused, or new applications that are specifically designed with nefarious activity in mind.

(For an even more extensive analysis on the future risks from malicious AI applications, check out this detailed 2018 report authored by contributors from Oxford, Cambridge, and the Centre for the Study of Existential Risk – I just learned that is a thing!)

Though not necessarily dangerous on an existential level, the below examples of data related criminal activity shed some light on the dark side of AI and analytics:

Data hijacking:
These days, the right insights from a company’s data can turn into massive profits. Enter hackers, who with the commonly used SQL Injection Attack can steal, corrupt, delete, or manipulate the data a company is counting on for their analytics.

The rise of the botnets:
Not just one bad bot, but an entire nefarious network set up to do the bidding of the “bot herder.” The bot mitigation/fraud detection company White Ops offers a fascinating blog post highlighting some of the most notable botnets in internet history as well as a more detailed report all about 3ve – including details on the inner workings of this massive ad fraud operation, and how it was shut down.

The rise of the narco-drones:
This time rising both literally and figuratively, unmanned aerial vehicles (UAV – aka drones) are now being integrated with object recognition applications and AI data processing tools to speed surveys and analytics across a variety of industries. However, these same tools are also being utilized by drug cartels to scan for the locations/movements of law enforcement officers and to keep their smuggling operations under the radar (again.. both literally and figuratively).

Cheaters never prosper:
Except sometimes when they make improper use of analytics to help them win the World Series.

Unfortunately these examples don’t even begin to cover the multitude of threats that cybersecurity experts and ethical hackers are fighting against daily. Given our current dependence on technology and rapid exchange of information, awareness of the potential ways in which our data might be vulnerable is crucial In order to support informed decision making at an individual, corporate, national, and global level.

May you always use your coding skills for the greater good!

Where is my book?!

At the beginning of the pandemic lockdown, news shows started interviewing people from their home offices. Almost everyone had a bookcase in the background (except Claire McCaskill, who did her interviews from her kitchen, always with some new, delicious-looking baked good in the background under a cake dome). It’s still going on and a number of people even hawk their own latest book by placing it with the cover, rather than the spine, facing the camera.

What I’ve noticed through all of this though (and I look closely), is a complete dearth of the books Oracle8 Data Warehousing, Oracle 8i Data Warehousing, and SQL Server 7 Data Warehousing. Not necessarily great books but, well, books I worked on in the early 2000’s. Perhaps I’m asking a lot given that even my own parents didn’t buy a copy, but it would be nice to see that somebody has them.

So, if you happen to see a copy of any of those books behind an interviewee on TV, please send me a screenshot. 20 years on, I still receive statements from the publisher telling me that I owe them money against the advance they paid us. I need to show my wife that it was all worth it.

News on the Golden Record Front
We continue to make great progress with our record matching technology, Golden Record. For those that don’t remember, Golden Record takes disparate sets of data and finds the common records between them.

Someone provided me with another great use case the other day: improving data lakes. Data lakes are databases where technically-savvy data analysts store temporary data sets for analysis (about six years ago I wrote a blog piece about this concept but called them sandboxes, not data lakes; same concept, different marketing spin). But, as you throw new data into your data lake, you need an easy way to tie it all together. If I have a set of prospects from my CRM system and a list of people who’ve purchased a new car in the past 24 months, how do I tie these two sets together? And the problem multiplies as I add new data sets.

This is a problem that Golden Record is able to attack.

So, if you’re interested in record matching, would like a demo, or want to discuss a use case, let me know. I’d love to hear your thoughts.

Thanks, wear a mask, and don’t let anyone sneeze on you!


What do you think? How can we make this blog more useful to you? What topics would you like to see more of? Want to contribute an article? Just want to catch up and chat?

I’d love to hear from you! Email me at benjamin.taub@dataspace.com.

Ben Taub
Thanks for reading!

CEO, Dataspace


Dataspace provides expertly-vetted data and analytics consultants to organizations across the United States. Our customers are usually managers whose needs are not being met by their traditional, general purpose staffing providers. Are you looking for expert help? Let’s talk! Reach out to us at 734-761-5962 or info@dataspace.com if you would like more information!

Greetings from Dataspace!

The last edition of our newsletter focused on some very creative applications of data science tools and methods. Most of these were lighthearted, merely reflecting the interest of the creators in their subject matter – like true scientists, they applied their skills to the questions that interested them.

In contrast, today’s newsletter delves into some of the data science “grey areas” – uses and applications that raise ethical and legal questions that need to be carefully considered.

To start off in the most general sense, we need to be aware of the variety of spaces where the questions of ethical usage of data science tools have come into play. Similarly, it is important to recognize that we may hold some problematic preconceptions that can present roadblocks to truly productive discussions of ethics and technology.

The good news is that these questions of ethics are, in fact, being raised. While certainly not all-inclusive, the below list highlights some of the most salient areas of data science usage that are currently under scrutiny:

Facial Recognition:  First and foremost, this technology raises fears of government surveillance. Beyond all of the ethical concerns inherent in that topic, it is also important to urge caution in the application of facial recognition tools based on their current limitations as well as the propensity of AI systems to misclassify certain groups.

Taking this topic a few shades of grey further, the analytics company Faception claims to have developed a computer vision/machine learning tool to provide personality prediction analytics. They hope to see it applied to be able to identify terrorists or other criminals, however there are many ethical issues and potential misuses inherent in this model.

Decision Making:  The implication of the term “data driven decision making” is that this process relies entirely on “neutral” data and is free from the biases that a human decision maker brings to the table. However, given the evidence that societal biases can be built into these algorithms, we need to be cautious not only with outsourcing decisions to computers but also with leaning heavily on the outputs of ML during the decision making process – particularly in fields that could have a major impact on the rights and freedoms of entire populations.

Creativity and Intellectual Property issues:  As the ability of AI to produce “new” material increases, new questions arise: Can an algorithm be credited as an “inventor”? Can it be said to be violating Intellectual Property Rights if it is trained off of others’ creations? These debates are likely to continue for quite some time.

While the above topics raise questions without definitive answers at this point in time, stay tuned for our next newsletter installment where we look at some more obviously nefarious uses of data science technologies.

 Until next time, Happy Coding

Ben’s Take

Greetings, once again, from the home office (i.e. my grown son’s bedroom – yes, those are bunk beds behind me when we’re video chatting).

So, what’s going on in the analytics staffing space? Of course a number of companies have paused or eliminated projects as the economy has slowed. This, in turn, has led to reductions in analytic staff, both contract and permanent. While no one knows for sure, my best guess is that analytics hiring will start to pick up again in the July – August timeframe. I suspect that companies will be reluctant to hire permanent employees as they emerge from lockdown so they’ll begin by adding contractors. However, by late in the year, the need for both permanent and contract analytics staff will be growing again.

When you start looking for analytics contractors, think of us. Our role is to help out when your other vendors aren’t providing the quality you need to complete your critical projects. Our clients comment that our folks are consistently stronger than those of our competitors, who are usually large, general-purpose contracting firms. Why? Because we started as an analytics consulting firm, not a staffing firm. Thus, we developed the technical skill to determine if a resource really knows their stuff and the experience to know whether we’d like that resource on our team. If we wouldn’t want them on our team, we won’t ask you to put them on yours.

We lay out our core beliefs on our about us page. Compare them to the services you receive today. I think you’ll notice the difference.

This Week in Golden Record

We continue chugging along with our Golden Record matching and deduplication technology and are on track for an end of June initial release. If you’re facing a need to identify and track records that match across databases and files, please do contact me at Benjamin.Taub@Dataspace.com. I’d love to discuss how we might help.

Until next time, thanks for reading!



What do you think? How can we make this newsletter more useful to you? What topics would you like to see more of? Want to contribute an article? Just want to catch up and chat?

I’d love to hear from you! Email me at benjamin.taub@dataspace.com.

Greetings from Dataspace!

In this issue of the Dataspace Newsletter we’re taking a break from the more classical and businessy (aka boring) applications for machine learning and data science. Instead, we invite you to open the door to explore all of the ways that you really want to be using your growing data science skills, especially in a time of quarantine when we have all had to focus on what is really important to us.

  • Priority numero uno: Cats. Have you adopted too many cats to keep you company during quarantine? Are you running out of ideas for names for said cats? A neural network can take care of that for you! Details (and some amazing cat name suggestions) here!
  • Second priority: Staying safe and avoiding all of the things that you didn’t know were dangerousThis set of easily consumable data visualizations provides some insights into things you never knew you should be worried about. The connection between chicken and crude oil may surprise you!
  • Third Priority: Keeping our favorite TV shows alive forever!! While the current situation dictates that we can’t actually film any new episodes of Friends or Game of Thrones, you can train a neural net to generate endless scripts for all of your favorite shows! These machine generated episodes might be slightly incoherent, but…. many of us felt that way about Season 8 of GOT anyway, right?
  • Priority 4: Improving my jokes. Yes, there is an algorithm for that! Check out this article for details on the process of leveraging various machine learning techniques to hone in on predicting funniness levels!
  • But perhaps most important of all, in these unprecedented times we’ve been called upon to ponder the existential questions that define our humanity. Questions such as: Where can I get the best burrito? Is there really such a thing as a ‘best’ burrito?? Check out this unique data science project exploring the possible methods to quantify the qualities that define a great burrito, and get some suggestions on burritos that you may want to try. For science!

Until next time, happy coding!

Ben’s Take


Katie, our lead recruiter, has written our last few newsletters and this time she’s really outdone herself. How can I possibly top tools to help name kittens? Where is she coming up with this stuff?!

Our Latest Project: Ensuring Data Privacy Compliance

Yes, a number of us are still very hard at work on Golden Record, our cross-database record matching tool. In fact, this week I wrote a blog post about using a record matching technology, like Golden Record, to help organizations comply with GDPR, CCPA, and other data privacy regulations. There is a growing list of jurisdictions that require you to know all the places that personal information is stored. But, this is very hard to do if all the systems that hold person data don’t share common identifiers, like customer ID number. Golden Record provides the ability to recognize all the places where a person’s data exists so you can confidently respond to privacy-related requests. Noncompliance can, and has, led to multi million dollar fines. Get more details from my blog post.

OK, now back to automating feline naming! Thanks for reading!





What do you think? How can we make this newsletter more useful to you? What topics would you like to see more of? Want to contribute an article? Just want to catch up and chat?

I’d love to hear from you! Email me at benjamin.taub@dataspace.com.

Greetings from Dataspace!

Another week, another newsletter! Before we get into any new news, a quick update on a piece of old news:

It looks like our new website was experiencing some problems of its own when we sent out our last newsletter with the launch announcement. Hopefully all technical issues have been resolved now, so if you weren’t able to see our new look before – give our site another try!

Time for Homeschool: Practicing with real data!

Speaking of recording problems in spreadsheets.. this week’s online learning resource section includes links to some interesting datasets that will allow you the chance to practice the data science techniques you’ve been working on, and hone your problem solving skills.

Happy Learning!

  • I’ve got 99 problems… how do I pick ONE (data science technique)? If you’re at a bit of a loss for where to start, this article provides some insight into picking the right kind of data depending on which technique you are hoping to practice. Similarly, you can find some suggestions for some structured data science projects, along with the links to the appropriate data sets, here.
  • My main problem right now is being tired of watching re-runs of my favorite reality TV shows… is there a data science technique for that? Not exactly. However, the VLOG Dataset curated by researchers at the University of Michigan (link not data science related, just nostalgic for Michigan football games) catalogues massive amounts of data gathered from Lifestyle Video Blogs, and also provides some resources discussing the best ways to tag, organize, and analyze this kind of data.
  • I’d rather do data science on other people’s problems, not mine. What is the right technique for me? Never fear, this tutorial will walk you through the process of working with streaming data (specifically, the Twitter API), and how to collect and analyze the information published by others online.
  • Enough tutorials, just show me the data! This free data repository at Harvard University provides access to massive amounts of research data across a wide variety of fields – from Astronomy to Law to Military History, etc. Play as you please!

Stay tuned for more learning at home resources – next time highlighting some more unorthodox and creative applications for data science techniques.


Ben’s Take


Have you thought about what comes after Covid 19 (pattern wise, Covid 20, I guess)?

Yes, almost everyone has cut back in a really big way. Things are tight now. But, have you thought about what comes next? Sadly, when the current crisis ends, a number of us will be looking for new jobs. Others, however, will have to figure out where to go next. For those in analytics, that means answering some very important questions, like:


  • What projects are most important and need to continue?
  • Should we hire new staff to tackle our hot projects or does the risk of reoccurrence make that dangerous?
  • Has this bout of working from home made the concept of remote resources more, or perhaps less, appealing?
Frankly, we’re trying to figure out how companies are going to answer questions like these. As you may know, we provide both temporary contractors and permanent employees in analytics and data engineering. I, personally, think that companies are going to forgo hiring for a while and lean on contractors to meet immediate needs until the situation calms down and stabilizes. What do you think? Do you have a plan for what comes next? I’d love to hear what you’re thinking and, of course, I’m here as a sounding board if you need one.
More on my Data Matching project, aka Golden Record
So, it turns out that funky characters from alphabets other than traditional US English can throw off a database load routine. Who knew? I suspect it’s a problem that has followed Mr. Ziębo all his life, however. In other words, I was busy on Sunday.
In the broader picture, we are making great progress in our effort to develop a system that matches people and other things across data sets and we remain on track for a late June POC release. Interesting accomplishments in the past few weeks include:


  • We now have a web page that describes Golden Record (in perhaps too much detail).
  • We’ve tested our basic matching algorithms and, thank goodness, they work!
  • A number of folks have stepped forward to provide input on their needs and to describe situations where they face matching problems. (Thank you, thank you, thank you!)
I am very eager to hear about other situations where matching records between datasets could be useful. If you know of a potential need and are willing to talk, please do reach out to me at Benjamin.Taub@Dataspace.com. I promise that I won’t try to sell you anything (at least not until after June 30 ). I just want to hear about your needs. And, if you’re comfortable sharing some sample data sets, that would be heaven! In any case, don’t hesitate to reach out if you have any input or questions for me. Thanks!
That’s all for now, thanks for reading. Until next time, please don’t let anyone sneeze within six feet of you.





What do you think? How can we make this newsletter more useful to you? What topics would you like to see more of? Want to contribute an article? Just want to catch up and chat?

I’d love to hear from you! Email me at benjamin.taub@dataspace.com.

Greetings from Dataspace!

Here’s hoping that this week’s newsletter finds you well, and that those of you who observe enjoyed Happy Passover and Easter celebrations, even if the family was only able to come together virtually.

Much like the above painting of the Last Supper received a recent renovation to fit in with modern times, the Dataspace team has been working on some updates of our own this past week – including a redesign of our website! We’re pretty happy with our new look, so check it out when you have a minute and let us know what you think!

Time for Homeschool: Data Science Learning Resources

This week’s online learning resources focus more on introductions to some specific analytics related tools, and ways that you can beef up your hands-on skills to tackle all of those pesky data science projects and stay up to date on the latest techniques.

Happy Learning!

  • Want to get over a fear of snakes? This intro course takes the scary out of learning Python for data science.
  • But how do I learn to be a real snake charmer? Once you’ve passed the python basics, expand your capacity to use this tool by introducing yourself to some of the most useful python libraries for data scientists.
  • I’m more of a super nerd than a snake charmer… where should I start? Good news! Hadley Wickham – well known statistician and developer of numerous R packages – has many books and open source code to get you up and running with R programming. In particular, check out his intro books R for Data Science, and R Packages (how to develop your own!).
  • My office isn’t ready for R and Python level analytics, are there some skills I can work on? This free course on data manipulation and analytics techniques for Google Sheets will help you become a formula master in 30 days! This site also has paid courses available as well, if you are interested in delving deeper into this tool.

Stay tuned next week for more learning at home resources – next time linking you to some interesting data sets you can try out your new skills on.


Ben’s Take


Howdy from my coronavirus cave (actually guest room) here in Ann Arbor!

I’m continuing work on the data matching project I mentioned last week – it’s nice to be able to focus on something purely technical for a while. More on this project as it evolves over the next few weeks and months. Please do reach out if you have a need that entails matching up records from disparate databases and spreadsheets.

New website: Our team has built a new dataspace.com website from scratch. It’s a refreshing change! Check it out if you get a moment and let me know what you think.

Bad marketing swag idea #1: Dataspace branded N95 masks. I think that’s enough said about that.

A Cool application of image recognition and data visualization: Finally, it turns out that I’m not the only one in Ann Arbor who’s working through this crisis. A local company called Voxel51 has developed technology that monitors activity in real time. Using video feeds from around the world they’re calculating a Physical Distancing Index (PDI) for each location to determine how well these cities are managing social distancing. They then plot the PDI versus the number of Covid-19 cases and deaths in each of these cities. It’s a fascinating, timely application of data science techniques to today’s biggest problem. Check it out here:  https://pdi.voxel51.com/newyork. Hover over the graph to see an image from the related point in time and how Voxel51’s technology identifies the items in that image. Very cool!

Stay safe and, as always, feel free to reach out. I’d love to hear from you!





What do you think? How can we make this newsletter more useful to you? What topics would you like to see more of? Want to contribute an article? Just want to catch up and chat?

I’d love to hear from you! Email me at benjamin.taub@dataspace.com.

Greetings from Dataspace!

First of all, we here at Dataspace hope that you and all of your loved ones are safe and well in this time of uncertainty, and that we’ll all soon be back on our feet.

As the nation’s workers struggle with unemployment or adjust to working from home, it has become even more essential for each of us to pay attention to our habits (or lack thereof) of self-care.

Just like it is important to build in exercise to keep our bodies healthy, we also need to make an effort to keep our brains fit and agile… and one of the best ways to do this is through learning new things!

Time for Homeschool: Data Science Learning Resources

In our upcoming newsletters, we’re hoping to provide you with some resources to help you beef up your analytics skills. While both are important, there’s a difference between a data scientist and a programmer who knows data science toolkits. A data scientist generally knows technology but also understands his/her business and the statistics and techniques that will improve it.

So, rather than technology, today we focus on the basic statistics and concepts that underlie modern analytics. Sharpen your pencil and let’s get started!

  • Why yes, I did study at Harvard (remotely, for free). Didn’t everyone? The edX website has a ton of free data science material produced by some top universities. This introduction to probability course, for example, is from Harvard. We dare you to view the intro video and not want to jump right in.
  • When you’re feeling like nothing is “normal” anymore… Check out this five-part series on data science concepts, the first of which dives heavily into statistics and distributions (normal or otherwise).
  • Just because you’re quarantined doesn’t mean you can’t go for a walk through a (random) forest. One “crowd favorite” data science technique is called random forest classification. It’s a way to create predictive models. This blog post provides a great introduction to what it’s all about.
  • What should I do if the machines take over the office while I’m gone? Keep learning about machine learning! This free lecture series from a real Caltech course covers the theories and practices behind learning from data – both for humans and machines. There’s even “homework” assignments and a final exam available if you really want to feel like you’re back in school!
  • But can my computer keep me company while I’m stuck at home? If you’re interested in learning more about how machines process and understand human language, check out this Introduction to Natural Language Processing – what it is, how it works, and some common techniques.

Stay tuned next week for more learning at home resources – next time focusing on a few specific analytics tools.

Ben’s Take

My Coronavirus Project: Have any input for me?

Greetings (from six feet away, of course)!

One thing I’m doing to make lemonade out of this stuck-at-home crisis is to work on a piece of software I’ve been thinking about for a long time. It’s a cloud-based tool for finding matches across data sets. For example, it can tell that the John Smith in your CRM system is the same person as the John Smith in your sales system but different from the John Smith in your warranty system (although it works for any data, not just persons). It can return the results in bulk or keep the data synchronized over time, serving as a master data management (MDM) solution.

Yes, I do realize that there are already matching and MDM products on the market. I’m hoping that this one, tentatively called Golden Record, will be different in a few ways:

  • It will be lightweight / cloud-based
  • It will have both an API and a browser-based web user interface
  • It will provide both one-time matching and long term, persistent MDM / integration
  • It will be less expensive than existing solutions, which can cost into the hundreds of thousands of dollars.

I’ve heard from a few folks in the software industry about their needs for something like this but I could really use your input, too. In particular, if you have a few free minutes (and who doesn’t right now?) could you please let me know…

  • If you’ve addressed a need like I’m targeting, how’d you do it?
  • If you have, or are anticipating, a need for something like this?
  • If you know of industries and use cases where Golden Record might be a good fit?

I’d love to talk if you’re up for it. Just email me at benjamin.taub@dataspace.com.

And, above all else, stay safe! Thanks for reading.



What do you think? How can we make this newsletter more useful to you? What topics would you like to see more of? Want to contribute an article? Just want to catch up and chat?

I’d love to hear from you! Email me at benjamin.taub@dataspace.com.

Dataspace is thrilled to announce that we will be sponsoring the 2019 Indy Big Data Conference in Indianapolis, Indiana on September 19th!   We would love to meet any and all big data, data science and other analytics professionals in the area.  So if you’re in town, please stop by our table for a free […]

data science

Dataspace’s Ben Taub was featured in a Dice post offering some clever suggestions for hiring excellent data science talent.  We encourage you to take a peek and absorb some of the wisdom within!


Dataspace is excited to announce that we will be sponsoring the 2019 INFORMS Conference on Business Analytics & Operations Research in Austin, Texas from April 14-16!


We encourage all local data science and business analytics professionals to please stop by our booth for a free gift and an opportunity to learn more about Dataspace and our services.  We’re looking forward to meeting you all!

For the second year in a row, Dataspace is pleased to announce that we will be sponsoring the 6th Annual Big Data & Business Analytics Summit at Wayne State University in Detroit, MI from March 21st to 22nd, 2019!

If you are a local practitioner of data science and analytics, please visit our booth for a free gift and the chance to learn more about what Dataspace can do for companies working to build world-class analytics organizations.  We look forward to meeting you all in person and building lasting relationships!