In the article, we’ll cover the various types of tools that constitute a modern data stack, and we’ll make recommendations based on your stage. We’ll answer the following questions: What infrastructure do we need to help all employees to easily access key data and insights? Which tools should we use? How to get started?
If you’re reading this, you may be a founder or exec at an early stage startup. You know you need to make the most of your data to make sure you grow your business as fast as possible.
You might be at the typical stage where business users ask developers to run SQL queries on your production database to get some insights. Your developers are probably tired of this and your business users are tired of having to wait for developers for the results. Things are not going to get better:
- Business users will need more and more such insights
- The number of data sources will increase, they’ll need to get insights from your CRM, payment software, etc
Soon, you’ll have nightmarish spreadsheets with csv extract from all over. Reports won’t be automated, teams will spend a lot of time maintaining and updating reports and will have multiple sources of truth.
If this is the case and you’re not sure where to start to make things better, this article is for you.
We’ll cover the various types of tools that constitute a modern data stack, and we’ll make recommendations based on your stage. We’ll answer the following questions:
- What infrastructure do we need to help all employees to easily access key data and insights?
- Which tools should we use?
- How to get started?
1. What are the different parts of a modern data stack
Where your data is generated
This is where the core raw data of your business is generated and stored. If you operate an e-commerce company for example, you’ll store in your database:
- All your customers
- All the products listed on your website
- All the transactions (when a customer buys a product)
This usually will be in either of these two types of databases:
- Relational databases: typically MySQL, PostgreSQL or Microsoft SQL Server
- Non relational databases: MongoDB, Firebase, and others
If you’re reading this, you probably already have your database setup.If you don’t, any of the relational databases will work well. If you’re thinking about using a non relational database, we recommend you make sure there’s a really good reason for that. If not, go for a relational one, it will save you lots of headaches when you start to want to make sense of your data. We’ve talked to many teams who chose a MongoDB database because it was kind of cool and they quickly regretted that.
Looking into your database’s data will help you answer questions such as:
- How many transactions have we had per day?
- What’s the weekly revenue
It typically won’t answers questions like the following though:
- How many times did a user click on this specific product page
- How many times did a user hover over that specific picture
This type of information is typically not recorded in your production database. However you can use events tracking tools like Segment.
To analyse your events data, you have two main choices
- You can send it to your a data warehouse and then use a BI tool (more to come on this below)
- You can directly connect it to a tool like Amplitude or Mixpanel.
Segment is the go to tool for this. Unless you know what you’re doing and have a specific reason not to, you should use Segment.
Then of course, there’s the data that’s generated and stored by all the softwares you’re using to run your business:
- CRM: Hubspot, Salesforce, Pipedrive
- Customer support software: Zendesk, Intercom, Front
- Billing: Stripe
- Marketing: Google Analytics, Google Ads, Facebook Ads
Where your data should be gathered
Even if you’re a very early stage startup, you probably have at least data in your production database, in Segment, and 1 SaaS tool (at least an all in one type of solution like Airtable or Notion).
If you’re more advanced you’ll have data in multiple other SaaS.
Most analysis or data exploration you’ll want to do will have to use data from several of these sources.
That’s why the best thing you can do is gather all your data in one place. That place is called a data warehouse.
- Main data warehouses you can use: BigQuery (Google), Redshift (Amazon), Snowflake
Note that some early stage companies simply choose to use a MySQL or PostgreSQL as a data warehouse. That can work too. You’ll want BigQuery or Redshift mainly to have better performance. But if your volume of data is small, a simple SQL database might do the work.
Snowflake will typically be for much bigger companies. Both BigQuery and Redshift will work fine. If you’re early stage, you’ll probably want to decide depending on what cloud provider you’re using (Google or Amazon). If you’re later stage and have bigger volumes of data, you’ll want to compare performance and cost for each of these two.
How to get your data in your data warehouse
If you’ve ever heard of ETL, that’s what it is. Extract Transform Load:
- Extract data from a data source
- Transform (optional): make the data easier to digest for a business analyst type of person
- Load in the warehouse
One option is to build your own data pipelines. You don’t want to do that.
What you want to do is use a tool that will do the heavy lifting for you.
- Most popular options: Fivetran, Stitch, Airbyte, and Segment does that too
This requires limited to no engineering resources, you just have to connect your data warehouse and the data sources.
We’ve observed that many startups try to build their data pipelines themselves. We think it’s a mistake. Building them might seem not so hard and cheaper on the surface, but maintaining them (solving bugs when they occur, updating them when the API of the data source changes, adding a new data source, etc) will take your team precious time that you’d rather be spent on your core product.
How to analyze your data
Once you’ve done the setup, we finally arrive at the final results we want, which is to let business users answer key questions about the business to improve their own performance and the company’s performance.
You have multiple types of tools you can connect to your data warehouse
SQL clients & Dashboarding tools
I’m putting those two categories together although they’re a bit different and may address different types of users and use cases.
Roughly, these tools let you connect your data warehouse or database, write SQL and get an output: typically a data table, or a graph / chart.
Popular SQL clients: DBVisualizer, PopSQL, Datagrip, and many more
Popular business intelligence (BI) tools: Metabase, Google Data Studio, Chart.io (being shut down after being bought by Atlassian), Mode analytics, and others
Tableau and Looker to some extent belong to the BI tools category but have two differences:
- They let you also mode and transform the data (see below)
- They tend to be for much bigger companies
Excel / Google Sheets: a popular demand from business users to data teams is to be able to use data from the data warehouse in Excel or Google Sheets - tools they’re very familiar with. Several ways to do that
- You can let users download a csv dump of a some tables through a web interface
- You can also write an app script or a macro to pull data from the data warehouse at regular intervals to Google Sheets or Excel
Actiondesk (yes, us 👋):
Actiondesk is a new type of business intelligence tool.
- Like other tools, you can connect your data warehouse or database
- Unlike other tools, you won’t have to write SQL (although you can if you want to)
- Actiondesk works like a spreadsheet, you can view your raw data from your data warehouse, and then use formulas and functionalities (pivot tables, graphs) you’re familiar with to make sense of your data.
How to model and transform your data - more advanced
Proper data modelling will typically be for later stages companies, but in practice even earlier stage companies will do it. If you’re only getting started, you can skip this, although it’s always good to have this step in mind.
A simple way of looking at modelling is to set universal definitions of business concepts.
For example, if you’re a software business, one important concept is Daily Active User. The number of daily active users and its evolution are metrics you’ll probably follow closely.
Depending on what your business is, you may want to define a daily active user in different ways
- For some businesses, a user will be considered as active if they log in
- For others, a user will have to do more than that (eg completing a task) to be considered as active.
If as an organization, you were to just give the raw data (sessions for each user) to different teams in the company, they’ll take this raw data, apply their own logic and might come up with different numbers for the number of daily active users. And then you’ll have your teams pulling each other’s hair to understand why it’s different instead of focusing on providing values to your customers.
One way of mitigating that is to set the definition in the code. You can have engineers creating a materialized view in your data warehouse that sets the definition of a daily active user, and then business users can access views such as “Daily active users per day” and everyone will be on the same page.
What I just described is a formal way of doing it (and a clean way). Lots of organizations will do something similar by just sharing around a SQL query with the definition of a Daily Active User.
If you’re at the stage where you have nothing, we recommend not to worry too much about modelling for now.
2. What you should do depending on your stage
Pre Product Market Fit
If you’re still at the stage where you have none or few users / customers, you’ll probably have most of your data in your production database.
- In that case, you may want to just connect your production database to a Business Intelligence tool (Metabase, Actiondesk, Mode analytics, etc) or a SQL client. That may be enough to get most of the insights you need at this stage.
- We do recommend to install Segment as early as possible too. Even if you don’t start using the data right away, when you do start, you’ll be happy to have historical data.
- You can easily set up a destination in Segment to send all your events data to a database (PostgreSQL for example) or a data warehouse (BigQuery, Redshift). It takes literally only a few clicks.
Post Product Market Fit
At this point, you have many customers, you start to have a proper marketing / sales process, you have precious data in your billing system too.
You definitely should make the effort of setting up a data warehouse and data pipelines now.
You have two options:
You can set up the whole modern data stack yourself. This is time consuming, and will require at least 1 data engineer dedicated to this. Finding this data engineer might be challenging in itself, they're hard to find and there's way more demand for such profiles that supply.
You can use our "Data team as a service" offer, we’ll do all the hardcore engineering work to
- Centralize all of your data in the same place
- Transforming it to make it easy for anyone in your team to analyze your data
- Build all the dashboards you need
- All of this for the fraction of the cost of hiring your own team
Building a data infrastructure used to take weeks if not months of engineering. Now you can do it in a few hours with modern tools. We hope to have made the topic less daunting.
Our overall recommendation is to start building this as early as possible. The best performing startups we’ve seen always have a clean (or at least not too dirty 😛) data infrastructure letting business users easily access data and insights.
Of course, focus first on building a product people want and getting your first users. If you have any questions, we’d be happy to help. Reach out on twitter to me. Same if you believe there are any inaccuracies in this article.