Migrating data from PlanetScale

As previously posted here and here PlanetScale are pulling their free tier.

I've already written about the process of changing the codebase to work with a different provider, but that still left a question: how to get my existing data out of PlanetScale and into Supabase?

Dumping PlanetScale

The only "easy" way to get data out of PlanetScale that I could find was using their CLI tool pscale to create a MySQL dump. There are other (generic) ways to get data out of a MySQL database that may be better, but none that I could get to work with PlanetScale.

This means having their CLI tooling installed on your local machine (or a virtual machine that you have control over) and then running:

pscale database dump <database name> <branch name>

That gives you a folder in your current working directory that contains two SQL files for each table in your database — one containing the schema, and one with all the rows set out as an INSERT instruction.

Recreating on Supabase

I'd already set up my tables in Supabase as part of changing the codebase (I use dbdiagram.io to set up schema when using SQL databases, so I was able to grab a Postgres version of the schema from there). So "all" I needed was to get my data in.

Whilst I might have been able to use the second SQL file in the dump folder (copying the whole INSERT query and pasting it into the Supabase web console), reading through the Supabase docs it seemed a much safer bet to convert them into CSVs and then use Supabase's "import from CSV" function.

Reading a file in, converting it, and outputting in a different format — feels like a task for Python to me.

Not using Pandas

Some quick googling suggested that using Pandas to read in an SQL file and output a CSV would be "pretty easy", but there were some potential hiccups with the PlanetScale dump which meant I felt more confident going "old-skool" — hand-writing a conversion function using RegEx and Python's CSV module.

My concept was:

Read the file into memory.

For each line:

  • remove the trailing closing bracket and any extraneous characters
  • remove the first open bracket and any preceding characters
  • strip any quote marks (WARNING!)

Write the data out as a CSV.

When used from the command line the function will check that it's been given an SQL file, or a folder, and warn if neither is true. If the input is a folder it'll cycle through every non-schema SQL file within that folder.

Use example for a single file:

python3 sql_dump_to_csv.py ./lsersa_booking.users.00001.sql

And for a folder:

python3 sql_dump_to_csv.py ./pscale_dump_lsersa_booking_main/

The output does need checking — particularly if there are any text fields in the database that may actually contain commas (like an address). In that instance there'll be an extra column in the CSV (with doesn't have a header) — I simply opened the CSV in a spreadsheet and eye-balled it. The only table this affected was my users table (which had some commas in address fields), and it was visually extremely obvious where columns had gotten out of sync.

Full code:

# sql_dump_to_csv.py
import re
import csv

def remove_bounding_quotes(str):
    return re.sub('^["\'`]|["\'`]$', "", str)

def convert_to_csv(filepath):
    # open the file at the path
    sql_file = open(filepath, 'r')
    # read it in
    lines = sql_file.readlines()
    # create an empty list
    csv_lines = []
    # read lines in - we only want content between brackets on each line
    for line in lines:
        # get rid of final closing bracket and any other detritus
        temp = re.split("\)( VALUES)?,?;?$", line)[0]
        # find the first opening bracket
        bracket_pos = temp.find("(")
        # commit
        csv_lines.append(temp[bracket_pos + 1 :])
    # break down into elements
    csv_lines = list(map(lambda line: line.split(","), csv_lines))
    # remove opening and closer quotes - single, double and `
    csv_lines = list(map(lambda line: list(map(lambda cell: remove_bounding_quotes(cell), line)), csv_lines))
    # in the first row, we need to make sure we get rid of expiresAt for expires_at
    csv_lines[0] = list(map(lambda cell: 'expires_at' if cell == 'expiresAt' else cell, csv_lines[0]))
    # save to same path but finishing .csv
    csv_file = open(re.sub('.sql$', '.csv', filepath), 'w')
    csv_writer = csv.writer(csv_file)
    csv_writer.writerows(csv_lines)
    csv_file.close()
    sql_file.close()

def is_sql_dump_file(filepath):
    # we want sql files that don't have schema (naming convention is digits or 'schema')
    if re.search('.sql$', filepath):
        if re.search('schema.sql$', filepath):
            return False
        else:
            return True
    else:
        return False

def final_message():
    print("Complete. Please check output for escaped special characters")

if __name__ == "__main__":
    import sys
    import os

    filepath = sys.argv[1]
    sql_filepath = re.search('.sql$', filepath)
    dir_filepath = os.path.isdir(filepath)

    if sql_filepath:
        convert_to_csv(filepath)
        final_message()
    elif dir_filepath:
        # need to try finding sql files that aren't schemas
        all_files = [f for f in os.listdir(filepath) if os.path.isfile(os.path.join(filepath, f))]
        sql_files = list(filter(lambda f: is_sql_dump_file(f), all_files))
        # run a conversion for each
        for sql_file in sql_files:
            convert_to_csv(os.path.join(filepath, sql_file))
        final_message()
    else:
        # not an sql file and not a directory, stop
        print("Not an sql file")

Once I'd generated my CSVs for each table, actually importing the data into Supabase was very easy. I'd been unsure whether Supabase would cope with PlanetScale's boolean treatment (storing as a 1/0 integer), or if date formats would be an issue, but both were accepted just fine. The only thing I had to be aware of was which tables had foreign key constraints, and therefore out the order in which I should import the tables.