Skip to content
Advertisement

How do I preserve utf-8 JSON values and write them correctly to a utf-8 txt file in Python 3.8.2?

I recently wrote a python script to extract some data from a JSON file and use it to generate some SQL Insert values for the following statement:

INSERT INTO `card`(`artist`,`class_pid`,`collectible`,`cost`, `dbfid`, `api_db_id`, `name`, `rarity`, `cardset_pid`, `cardtype`, `attack`, `health`, `race`, `durability`, `armor`,`multiclassgroup`, `text`) VALUES ("generated entry goes here")

The names of some of the attributes are different in my SQL table but the same values are used (example cardClass in the JSON file/Python script is referred to as class_pid in the SQL table). The values generated from the script are valid SQL and can successfully be inserted into the database, however I noticed that in the resulting export.txt file some of the values changed from what they originally were. For example the following JSON entries from a utf-8 encoded JSON file:

[{"artist":"Arthur Bozonnet","attack":3,"cardClass":8,"collectible":1,"cost":2,"dbfId":2545,"flavor":"And he can't get up.","health":2,"id":"AT_003","mechanics":["HEROPOWER_DAMAGE"],"name":"Fallen Hero","rarity":"RARE","set":1,"text":"Your Hero Power deals 1 extra damage.","type":"MINION"},{"artist":"Ivan Fomin","attack":2,"cardClass":11,"collectible":1,"cost":2,"dbfId":54256,"flavor":"Were you expectorating another bad pun?","health":4,"id":"ULD_182","mechanics":["TRIGGER_VISUAL"],"name":"Spitting Camel","race":"BEAST","rarity":"COMMON","set":22,"text":"[x]At the end of your turn,n  deal 1 damage to another  nrandom friendly minion.","type":"MINION"}]

produce this output:

('Arthur Bozonnet',8,1,2,'2545','AT_003','Fallen Hero','RARE',1,'MINION',3,2,'NULL',0,0,'NULL','Your Hero Power deals 1xa0extra damage.'),('Ivan Fomin',11,1,2,'54256','ULD_182','Spitting Camel','COMMON',22,'MINION',2,4,'BEAST',0,0,'NULL','[x]At the end of your turn,nxa0xa0deal 1 damage to anotherxa0xa0nrandom friendly minion.')

As you can see, some of the values from the JSON entries have been altered somehow as if the text encoding was changed somewhere, even though in my script I made sure that the JSON file was opened with utf-8 encoding and the resulting text file was also opened and written to in utf-8 to match the JSON file. My aim is to preserve the values exactly as they are in the JSON file and transfer those values to the generated SQL value entries exactly as they are in the JSON. As an example, in the generated SQL I want the “text” value of the second entry to be:

"[x]At the end of your turn,n  deal 1 damage to another  nrandom friendly minion."

instead of:

"[x]At the end of your turn,nxa0xa0deal 1 damage to anotherxa0xa0nrandom friendly minion."

I tried using functions such as unicodedata.normalize() but unfortunately it did not seem to change the output in any way. This is the script that I wrote to generate the SQL values:

import json
import io

chosen_keys = ['artist','cardClass','collectible','cost',
'dbfId','id','name','rarity','set','type','attack','health',
'race','durability','armor',
'multiClassGroup','text']

defaults = ['NULL','0','0','0',
'NULL','NULL','NULL','NULL','0','NULL','0','0',
'NULL','0','0',
'NULL','NULL']

def saveChangesString(dataList, filename):
  with io.open(filename, 'w', encoding='utf-8') as f:
    f.write(dataList)
    f.close()

def generateSQL(json_dict):
    count = 0
    endCount = 1
    records = ""
    finalState = ""
    print('n'+str(len(json_dict))+' records will be processedn')
    for i in json_dict:
        entry = "("
        jcount = 0
        for j in chosen_keys:
            if j in i.keys():
                if str(i.get(j)).isdigit() and j != 'dbfId':
                    entry = entry + str(i.get(j))
                else:
                    entry = entry + repr(str(i.get(j)))
            else:
                if str(defaults[jcount]).isdigit() and j != 'dbfId':
                    entry = entry + str(defaults[jcount])
                else:
                    entry = entry + repr(str(defaults[jcount]))
            if jcount != len(chosen_keys)-1:
                entry = entry+","
            jcount = jcount + 1
        entry = entry + ")"
        if count != len(json_dict)-1:
                entry = entry+","
        count = count + 1
        if endCount % 100 == 0 and endCount >= 100 and endCount < len(json_dict):
            print('processed records '+str(endCount - 99)+' - '+str(endCount))
            if endCount + 100 > len(json_dict):
                finalState = 'processed records '+str(endCount+1)+' - '+str(len(json_dict))
        if endCount == len(json_dict):
            print(finalState)
        records = records + entry
        endCount = endCount + 1
    saveChangesString(records,'export.txt')
    print('done')
                
with io.open('cards.collectible.sample.example.json', 'r', encoding='utf-8') as f:
    json_to_dict = json.load(f)
    f.close()

generateSQL(json_to_dict)

Any help would be greatly appreciated as the JSON file I am actually using contains over 2000 entries so I would prefer to avoid having to edit things manually. Thank you.

Also the SQL table structure code is:

-- phpMyAdmin SQL Dump
CREATE TABLE `card` (
  `pid` int(10) NOT NULL,
  `api_db_id` varchar(50) NOT NULL,
  `dbfid` varchar(50) NOT NULL,
  `name` varchar(50) NOT NULL,
  `cardset_pid` int(10) NOT NULL,
  `cardtype` varchar(50) NOT NULL,
  `rarity` varchar(20) NOT NULL,
  `cost` int(3) NOT NULL,
  `attack` int(10) NOT NULL,
  `health` int(10) NOT NULL,
  `artist` varchar(50) NOT NULL,
  `collectible` tinyint(1) NOT NULL,
  `class_pid` int(10) NOT NULL,
  `race` varchar(50) NOT NULL,
  `durability` int(10) NOT NULL,
  `armor` int(10) NOT NULL,
  `multiclassgroup` varchar(50) NOT NULL,
  `text` text NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

ALTER TABLE `card`
  ADD PRIMARY KEY (`pid`);

ALTER TABLE `card`
  MODIFY `pid` int(10) NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=1;
COMMIT;

Advertisement

Answer

xa0 is a variant on space. Is it coming from Word?

But, more relevant, it is not utf8; it is latin1 or other non-utf8 encoding. You need to go back to where it came from and change that to utf8.

Or, if your next step is just to put it into a MySQL table, then you need to tell the truth about the client — namely that it is encoded in latin1 (not utf8). Once you have done that, MySQL will take care of the conversion for you during the INSERT.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement