Python and CSV; know your limits

Right, I’ve got this honking great MySQL dump file, and I’d like to use the data in it without needing a MySQL db server; so I thought I’d turn it into a SQLite db, as python has native sqlite3 support. Various suggestions are on offer around the web; SQLiteBrowser, for example, claims to import and export from various SQL flavours and CSV files. Nice; but it chokes on the file.

So I saved a copy of the thing as a plain text file with the mysql tags trimmed off and tried a few options; a posting on the UK Python list reminded me that the csv module in Python can take arbitrary characters as delimiters, not just commas, which sounded useful. After all, I couldn’t just split it at the commas because the contents are basically a lot of tuples, like this: (data, data, data),(data, data, data) And I need them in groups.

I thought I was being clever when I did a global find/replace, taking out the ),( because the csv module doesn’t support multiple characters as delimiters, and replacing it with \t; then I wrote this script:

#! usr/bin/env python

import sqlite3
import csv

f = open('/home/yorksranter/Documents/Geekery/airports.txt')
csv.field_size_limit(100000) #see below!
input = csv.reader(f, delimiter='\t')
conn = sqlite3.connect('/home/yorksranter/Desktop/airport.sql')
curse = conn.cursor()

curse.execute('''CREATE TABLE airports ('id', 'country', 'latitude', 'longitude', 'name', 'timezone', 'shortname')''')

for item in input:
........curse.execute('INSERT INTO airports VALUES (%s)' % item)

Each item should be a tuple of seven values, and they should be in the same order they were in the original db, so this ought to recreate the data in an SQLite 3 file.

Then my problems began; I got the following error message:

_csv.error field larger than field limit

. Google found me this and this; it seems as far as I understand that the DB is too big for the csv module; there does seem to be a way of altering the limit, going by the module source code.

Thoughts? Update: There is; csv.field_size_limit(), and I altered it until the thing ran properly; but there’s still no data in the db!


  1. Duane

    I’m no pythonista and the docs I find with a quick google look incomplete, but a few thoughts nonetheless:

    1) You should probably be using SQL parameters instead of a formatted string. It will be more robust in the presence of dodgy data. Exactly what does item.str() return anyway? Is it properly quoted and escaped?

    2) Does execute return an error in case of failure? If so, check it. If not, perhaps check whether the INSERT had an effect by checking conn.total_changes or something.

    3) From the commit at the end (and a quick look at the docs) it appears those INSERTs are all happening in a single transaction. Perhaps that is getting too large and hitting limits somewhere. Try using autocommit mode or committing every n INSERTs.

  2. David

    Another possibility: the string you’re passing is of the form “INSERT INTO a VALUES ([‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’, ‘7’])”, and the brackets might be confusing sqlite.

    curse.execute(‘INSERT INTO airports VALUES (?,?,?,?,?,?,?)’, item)

  3. hannes

    does %s matches % item. whats in the values ?

    check the code from here:
    def csv2sqllite(infile,DB,table,delim,createstring=’#’,colnumber=”0″,location=”:memory:”):
    conn = sqlite3.connect(location+DB)
    curs = conn.cursor()
    csvfile = open(infile, ‘rb’)
    dr = csv.DictReader(csvfile, delimiter=delim)
    curs.execute(“DROP TABLE IF EXISTS “+str(table)+”;”)# if table exists, delete
    if createstring == ‘#’: # case if we do not have text, followed by values
    createstring=”CREATE TABLE “+table+”(“+dr.fieldnames[0]+’ TEXT, ‘
    for x in dr.fieldnames[1:]:
    createstring=createstring+str(x)+” FLOAT, ”
    createstring = createstring[0:len(createstring)-2]+”);”
    insertstring = ‘INSERT INTO ‘+table+’ VALUES (‘
    for x in dr.fieldnames:
    insertstring = insertstring +”?,”
    insertstring = insertstring[0:len(insertstring)-1] +”)”
    creader = csv.reader(csvfile, delimiter=delim, quotechar=’|’)
    for t in creader:
    curs.execute(insertstring, t )

  4. Procedural programs are great but there still a lot to be offered by legacy Unix/Linux command line tools. sed and awk could parse incredibly large text files into any kind of output with very flexible context sensitive filters.

    As serial processors they don’t try and read the whole file and once, just sections at a time, serially and based on rules add content to the output stream.

    You may not even need awk if each line is consistent and you don’t need to track the context of where sed is in the serial stream.

    If awk is needed, this could probably in less then half a dozen lines. If just sed, one line.

    Sed primarily relies on a syntax based on regular expressions, and, along with awk, can “reset” it’s count of input fields if you have interspersed delimiters but some consistency and structure in your data.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: