Python - Append CSVs

jellybeard999 · 29 Apr 2020 at 11:44

Hi all,

I've managed to use python to append all CSVs in a folder from a script nabbed online, but am struggling to pull the below together. I wonder if anyone here is able to advise?

I have some standard template CSVs and some bespoke CSVs relating to individuals.

Below is an example of the folder structure, though this can be changed if necessary. There are hundreds of refs. All files are fairly small.

I'm using python as we use some software, which is able to run python scripts as part of the routine, to carry out actions outside of the system.

SERVER PATH\TEMPLATE
Template1.csv
Template2.csv
Template3.csv
Template5.csv

SERVER PATH\TEMPLATE4
Template4 REF1.csv
Template4 REF2.csv
Template4 REF3.csv
Template4 REF4.csv

SERVER PATH\TEMPLATE6
Template6 REF1.csv
Template6 REF2.csv
Template6 REF3.csv
Template6 REF4.csv

I'm trying to append the standard templates in every case (ie 1,2,3 and 5) and slot in the specific templates where relevant.

i.e append
Template1.csv
Template2.csv
Template3.csv
Template4 REF1.csv
Template5.csv
Template6 REF1.csv
Output - Appended REF1.csv

Template1.csv
Template2.csv
Template3.csv
Template4 REF2.csv
Template5.csv
Template6 REF2.csv
Output - Appended REF2.csv

Any ideas?

Thanks for any help

Hades · 29 Apr 2020 at 14:34

I'm trying to understand what you're trying to do. I think what you want is to end up with 4 output files:

OUTPUT FILE 1 consists of:
Template1.csv
Template2.csv
Template3.csv
Template5.csv
then Template4 REF1.csv and Template6 REF1.csv

OUTPUT FILE 2 consists of:
Template1.csv
Template2.csv
Template3.csv
Template5.csv
then Template4 REF2.csv and Template6 REF2.csv

OUTPUT FILE 3 consists of:
Template1.csv
Template2.csv
Template3.csv
Template5.csv
then Template4 REF3.csv and Template6 REF3.csv

OUTPUT FILE 4 consists of:
Template1.csv
Template2.csv
Template3.csv
Template5.csv
then Template4 REF4.csv and Template6 REF4.csv

Is that right?

1) Are the only folders you have called TEMPLATE, TEMPLATE4 and TEMPLATE6? e.g. is there a folder called TEMPLATE5 or TEMPLATE7 for example?

2) What are the filenames? Are they actually Template*.csv and Template* REF*.csv just as you have written?

3) If one file is missing (e.g. Template4 REF4.csv) then what do you want to happen?

4) Where should the output files be placed?

jellybeard999 · 29 Apr 2020 at 15:19

Hi,

You are correct in what I'm trying to achieve, but the Template4 csv goes between Template3 and Template5, so they are in order.

I can rejig the folder structure however makes things work more efficiently/simply.

Thanks for the reply. If it makes it easier to understand, I'll try to set out what I'm hoping to achieve;

I have a dataset, which is processed through a system, formatted, and output as a single CSV of transactions processed within a month. This has various columns, including name and reference number. The name and reference are concatenated to create a new column called "identifier". This is because separately there may be duplicate names or references, but not in combination. The transactions may differ each month, so new usernames/references may appear, and some usernames/references may not be used within the period.

This will be a monthly exercise.

I have used python to split this csv, so that there is a csv per "identifier" as below. This creates a file eg "Template6 A Smith 1234.csv", which is all transactions processed by A Smith, with reference 1234.

Code:

import pandas as pd

data = pd.read_csv('\\\\SERVER\\PATH\\Transactions.csv')

data_category_range = data['IDENTIFIER'].unique()
data_category_range = data_category_range.tolist()

for i,value in enumerate(data_category_range):
    data[data['IDENTIFIER'] == value].to_csv(r'\\SERVER\PATH\Template6\Template6 '+str(value)+r'.csv',index = False, na_rep = 'N/A')

Template4 is essentially the same, but with different data, but there will always be a corresponding "Template4 A Smith 1234.csv" for every "Template6 A Smith 1234.csv".

Template 2 is summarised global data, so the same for every username, but will change monthly.

Template 1, 3 and 5 are fixed data, used to set the data out in a proforma layout ... headers, standard wording etc.

So I'm hoping to end up with an append of

Template1.csv
Template2.csv
Template3.csv
Template4 A Smith 1234.csv
Template5.csv
Template6 A Smith 1234.csv

In that order. These are the actual filenames. Obviously the names and references are made up, but the format is correct (initial surname nnnn).

Template 6 is always the last file. I can set each template in separate folders if it is of some benefit.

If in a given month, there are 500 "identifiers", I want to end up with 500 files appended as above. This number will vary.

If required for some sort of lookup, I could easily generate a list .csv of the expected "identifiers" for that month.

I hope that all makes sense.

Happy to go backover if it's easier to have a single large file with all (in this example) 500 identifiers for Template4.csv and Template6.csv and extract the data from that.

Hades · 29 Apr 2020 at 15:55

OK I think I understand. I will take a look in a few hours (currently busy).

Hades · 29 Apr 2020 at 21:10

Please try this and let me know if it works and does what you want. I'm sure a python expert could probably write it more concisely but I think it does what you want. There isn't any error handling so if a file is missing from a directory then no error is thrown and it simply continues. Let me know if there is anything specific you would prefer it to do. It also runs silently but I could add any logging etc you might want. Obviously you'll need to change the directory names at the top. It is also quite flexible so you can add as many directories in the directoryList as you want if your requirements change in the future.

The way it works is to build a list of file names in a dictionary for each reference. It always includes the files listed in masterTemplateDir and then looks through each of the directories in the directoryList for files with similar names. Once it has that list it sorts them and then merges them into a file in the outputDirectory.

e.g
Template1.csv
Template2.csv
Template3.csv
Template4 A Smith 1234.csv
Template5.csv
Template6 A Smith 1234.csv

...will result in a merged file called MergedTemplate A Smith 1234.csv in the output directory.

Code:

from os import listdir
import operator


# Only change this line if the master template location changes
masterTemplateDir = 'c:\\temp\\TEMPLATE'

# Add or remove all the other template directories here
directoryList = {
                    'c:\\temp\\TEMPLATE4',
                    'c:\\temp\\TEMPLATE6'
                    }

# Only change this line if the output directory location changes
outputDirectory = 'c:\\temp'



#######################################################################################################################
# DO NOT CHANGE ANYTHING BELOW THIS LINE
#######################################################################################################################
#
# Get a list of all the files in the master template directory and add it to the master dictionary.
#
masterTemplateDict = dict()
otherTemplateDict  = dict()

masterTemplateList = listdir(masterTemplateDir)
for filename in masterTemplateList:
    masterTemplateDict[filename] = masterTemplateDir + '\\' + filename

#
# Now build a dictionary for every other directory.
# Add these individual dictionaries to a larger dictionary.
# This is done so that we can easily add or remove directories in the future
#
count = 0
for directoryName in directoryList:
    directoryDict = dict()
    fileList = listdir(directoryName)
    for filename in fileList:
        directoryDict[filename] = directoryName + '\\' + filename
    otherTemplateDict[count] = directoryDict
    count+=1

#
# Use the first directory to drive the list of files needed.
# Ignore any part of the name to the left.
#
for searchFilename in otherTemplateDict[0]:
    referenceName = searchFilename[searchFilename.find(" ")+1:]

    #
    # Build a new dictionary with each of the template names in it.
    # This includes the master template names and also the templates with similar reference names in them.
    # These do not need to be in order yet. They will be sorted later.
    #
    outputDict = masterTemplateDict.copy()
    for templateDictKey in otherTemplateDict:                      # Iterating around the dict of dicts
        for filename in otherTemplateDict[templateDictKey]:        # Iterating around the inner dict
            #
            # If the reference name matches the one we are looking for then add it to the output dictionary
            #
            if filename.endswith(referenceName):
                outputDict[filename] = otherTemplateDict[templateDictKey][filename]

            sortedOutputDict = sorted(outputDict.items(),key=operator.itemgetter(0))

            #
            # We now have a sorted list of filenames in a dictionary.
            # Use the list to concatenate all of the files in the dictionary.
            #
            outputFilename = outputDirectory + "\\MergedTemplate " + referenceName
            with open(outputFilename, 'w') as fileOut:
                for file in sortedOutputDict:
                    filenameLong  = file[1]
                    with open(filenameLong, 'r') as fileIn:
                        for record in fileIn:
                            fileOut.write(record)
                        fileOut.write("\n")

jellybeard999 · 29 Apr 2020 at 21:56

Thanks very much. I'll give it a whirl in the morning

jellybeard999 · 30 Apr 2020 at 09:23

Thanks very much, that works

It's running quite slowly for me, probably due to a) working form home through a VPN and b) the fileserver being really slow generally. The paths are all on a server, on a slow connection, designed more for serving word documents than any sort of heavy workload. Work are looking at setting up some sort of remote desktop solution to connect directly to the server our software sits on, so hopefully things will run faster then.

From checking the first file produced, it looks like it gets me where I wanted though

jellybeard999 · 30 Apr 2020 at 10:29

/Colombo ... Just one more thing!

I have two scripts which I have used previously, which work in this case, but for a single file only. Is there a simple way to adapt these to work on every csv file in a folder, rather than a specific file?

Template4 folder

Code:

import csv

input_file = '\\\\SERVER\\PATH\\A.csv'
output_file = '\\\\SERVER\\PATH\\B.csv'
cols_to_remove = [0] # Column indexes to be removed (starts at 0)

cols_to_remove = sorted(cols_to_remove, reverse=True) # Reverse so we remove from the end first
row_count = 0 # Current amount of rows processed

with open(input_file, "r") as source:
    reader = csv.reader(source)
    with open(output_file, "w", newline='') as result:
        writer = csv.writer(result)
        for row in reader:
            row_count += 1
            print('\\r{0}'.format(row_count), end='') # Print rows processed
            for col_index in cols_to_remove:
                del row[col_index]
            writer.writerow(row)

Template6 folder

Code:

# Constants
PATH = "\\\\SERVER\\PATH"
FILE = "X.csv"
OUTPUTFILE = "Y.csv"

# Read imput file
with open(PATH + "/" + FILE, "r") as f:
    data = f.read()
f.close()

# Replace
data = data.replace(',N/A,N/A,N/A,N/A',',,,,')

# Write output file
with open(PATH + "/" + OUTPUTFILE, "w") as f:
    f.write(data)
f.close()

Hades · 30 Apr 2020 at 12:17

Glad it worked. When you say it is slow, how long does it take? Also how many files are there?

I'll take a look at those two other scripts when I have time. But if I understand correctly the first one simply removes a column from a CSV and the second seems to remove the N/A string? You want them changed to loop around all files in a directory instead of working on a single file?

Also... just one last thing (Columbo stylee

); you are doing a lot of manipulation of files here. It is processing the files after they have been produced. Would a better way be to amend the script/programme which produces them? It would probably be cleaner and more efficient. But I am assuming that's not possible for a reason?

jellybeard999 · 30 Apr 2020 at 12:50

Hades said:
Glad it worked. When you say it is slow, how long does it take? Also how many files are there?

201 in the example month, I stopped it after a few minutes and it had only done single digit appends, but that's probably our systems/network, rather than anything else. I'll wait until our IT looks at their remote solution, as I anticipate that will improve things. It's not just this that's slow! I could run it locally, which would definitely be far faster, but this isn't possible as others may need to run the same process.

Hades said:
I'll take a look at those two other scripts when I have time. But if I understand correctly the first one simply removes a column from a CSV and the second seems to remove the N/A string? You want them changed to loop around all files in a directory instead of working on a single file?

That's correct.

Hades said:
Also... just one last thing (Columbo stylee ); you are doing a lot of manipulation of files here. It is processing the files after they have been produced. Would a better way be to amend the script/programme which produces them? It would probably be cleaner and more efficient. But I am assuming that's not possible for a reason?

That was my starting point ideally. The system can run scripts in it's own language, but there are limitations, thankfully, it can also call python scripts. Unfortunately, I don't really know python

I'm a bit more familiar with SQL and VBA, but only from playing around with existing scripts and manipulating, but that's generally enough for me to get started. Hopefully I'll get a bit of basic knowledge of Python in the same way.

Hades · 30 Apr 2020 at 13:00

Have you considered calling VB from a simple python script? Then you're back to using a language you know (althoug personally I far prefer Python).

jellybeard999 said:
201 in the example month, I stopped it after a few minutes and it had only done single digit appends, but that's probably our systems/network, rather than anything else. I'll wait until our IT looks at their remote solution, as I anticipate that will improve things. It's not just this that's slow! I could run it locally, which would definitely be far faster, but this isn't possible as others may need to run the same process.

I'll provide a version with some logging enabled so you can check whether it's actually doing anything or not. It's possible that it hit an unexpected condition and not working. I'll provide that shortly.

Also here are your two other scripts:

Note:
1) I commented out the print of row count because it was behaving unexpectedly when loopping around all files (it just kept printing more and more "\r" characters). But if you let me know what you are trying to achieve with that part of the script then I can update it.
2) I always recommend saving the output into a different folder, otherwise if you run it a second time then it also processes the output files again. So the below uses two different directories.

Code:

import csv
from os import listdir

inputPath = 'c:\\temp\\testdata'
outputPath = 'c:\\temp'
cols_to_remove = [0] # Column indexes to be removed (starts at 0)

cols_to_remove = sorted(cols_to_remove, reverse=True) # Reverse so we remove from the end first
row_count = 0 # Current amount of rows processed

for filename in listdir(inputPath):
    with open(inputPath + '\\' + filename, "r") as source:
        reader = csv.reader(source)
        with open(outputPath + '\\output_' + filename, "w", newline='') as result:
            writer = csv.writer(result)
            for row in reader:
                #row_count += 1
                #print('\\r{0}'.format(row_count), end='') # Print rows processed
                for col_index in cols_to_remove:
                    del row[col_index]
                writer.writerow(row)

Code:

from os import listdir

# Constants
inputPath  = "c:\\temp\\testdata"
outputPath = "c:\\temp"


for filename in listdir(inputPath):

    # Read input file
    with open(inputPath + "/" + filename, "r") as f:
        data = f.read()
    f.close()

    # Replace
    data = data.replace(',N/A,N/A,N/A,N/A',',,,,')

    # Write output file
    with open(outputPath + "/output_" + filename, "w") as f:
        f.write(data)
    f.close()

Hades · 30 Apr 2020 at 13:26

Below is a version of the original script with logging added. Let me know if it's getting stuck anywhere. It also has timestamps on the output so you should be able to see where it is taking time. You can turn logging on and off by setting verbose = True or False (capital T and capital F).

Code:

from os import listdir
import operator
from datetime import datetime


# Only change this line if the master template location changes
masterTemplateDir = 'c:\\temp\\TEMPLATE'

# Add or remove all the other template directories here
directoryList = {
                    'c:\\temp\\TEMPLATE4',
                    'c:\\temp\\TEMPLATE6'
                    }

# Only change this line if the output directory location changes
outputDirectory = 'c:\\temp'

# Only change this line to turn log output on (True) or off (False)
verbose = True



#######################################################################################################################
# DO NOT CHANGE ANYTHING BELOW THIS LINE
#######################################################################################################################
#
# Function to allow logging
#
def log(message):
    if verbose == True:
        print(datetime.now().strftime("%H:%M:%S"), message)

#
# Get a list of all the files in the master template directory and add it to the master dictionary.
#
log("--- STARTED ---")
log("Using master templates from directory: " + masterTemplateDir)
log("Using custom templates from directories: " + str(directoryList))

masterTemplateDict = dict()
otherTemplateDict  = dict()

masterTemplateList = listdir(masterTemplateDir)
for filename in masterTemplateList:
    masterTemplateDict[filename] = masterTemplateDir + '\\' + filename
log("Master template directory dictionary built")

#
# Now build a dictionary for every other directory.
# Add these individual dictionaries to a larger dictionary.
# This is done so that we can easily add or remove directories in the future
#
count = 0
for directoryName in directoryList:
    directoryDict = dict()
    fileList = listdir(directoryName)
    for filename in fileList:
        directoryDict[filename] = directoryName + '\\' + filename
    otherTemplateDict[count] = directoryDict
    count+=1
log("Custom template directory dictionary built")

#
# Use the first directory to drive the list of files needed.
# Ignore any part of the name to the left.
#
for searchFilename in otherTemplateDict[0]:
    referenceName = searchFilename[searchFilename.find(" ")+1:]
    log("Searching for files with reference: " + referenceName)

    #
    # Build a new dictionary with each of the template names in it.
    # This includes the master template names and also the templates with similar reference names in them.
    # These do not need to be in order yet. They will be sorted later.
    #
    log("Creating output dictionary")
    outputDict = masterTemplateDict.copy()
    log("  Created and master templates added")
    for templateDictKey in otherTemplateDict:                      # Iterating around the dict of dicts
        for filename in otherTemplateDict[templateDictKey]:        # Iterating around the inner dict
            #
            # If the reference name matches the one we are looking for then add it to the output dictionary
            #
            if filename.endswith(referenceName):
                outputDict[filename] = otherTemplateDict[templateDictKey][filename]

            sortedOutputDict = sorted(outputDict.items(),key=operator.itemgetter(0))

            #
            # We now have a sorted list of filenames in a dictionary.
            # Use the list to concatenate all of the files in the dictionary.
            #
            outputFilename = outputDirectory + "\\MergedTemplate " + referenceName
            log("Creating output file " + outputFilename)
            with open(outputFilename, 'w') as fileOut:
                for file in sortedOutputDict:
                    filenameLong  = file[1]
                    log("  Merging " + filenameLong)
                    with open(filenameLong, 'r') as fileIn:
                        for record in fileIn:
                            fileOut.write(record)
                        fileOut.write("\n")

log("--- ENDED ---")

jellybeard999 · 4 May 2020 at 14:23

Thanks for those. Al sorted now.

I can't seem to find the log file, so not sure where, or whether, it is being output.

My testing suggests that the bottleneck is the network. It runs in seconds on my C drive, so I'll wait and see how IT manage to progress things.

Hades · 4 May 2020 at 19:19

The log just prints to the standout. So if running from a command prompt it will just write to th e terminal. But presumably the calling application is trapping that output.

If you do still need it then I could amend it to write to an actual file of your choosing instead.

jellybeard999 · 4 May 2020 at 19:43

Hades said:
The log just prints to the standout. So if running from a command prompt it will just write to th e terminal. But presumably the calling application is trapping that output.

If you do still need it then I could amend it to write to an actual file of your choosing instead.

Ah, I presumed the log would dump to a text file somewhere. Everythings working for now, so I'll revisit the log in future if necessary. Thanks again for the help

jellybeard999 · 15 May 2020 at 16:54

Hi @Hades

Thanks again for all your help on this. Just wondered if you had any ideas whether you had any insights into the below

I have some numeric values output as 2 decimal places, and some with 0 decimal places in the initial files. I've output these as text essentially in the source data, but somewhere in the process, the values are being changed to numeric and losing the dps.

Template 4

REDACTED DATA ... ,"145.09"
REDACTED DATA ... ,"2"

Becomes

REDACTED DATA ... ,145.09
REDACTED DATA ... ,2.0

As the first value is an amount, I'd like to keep that with 2 decimal places, but the second is a count, so would prefer that to be 0 decimal places.

Template 6

REDACTED DATA ... ,"52.00","10.40","62.40"

Becomes

REDACTED DATA ... ,52.0,10.4,62.4

These are all amounts, so again, I'd like to keep them with 2 decimal places.

Hades · 15 May 2020 at 23:25

Sorry I don't quite understand. I do underastand you're having a problem with some data not having the decimal places you expect. But what I'm not following is the end to end process which is producing the files. So let's take the first example of Template 4:

You start with:
REDACTED DATA ... ,"145.09"
REDACTED DATA ... ,"2"

Becdomes:
REDACTED DATA ... ,145.09
REDACTED DATA ... ,2.0

But what sits in between those two? I assume there is a process which runs? What is it? Is it one of my Python scripts above or is it something else?
If there are several processes between them then can you check what the output is from each stepm to see which step changes it?
I assume the "becomes" value is in a file?

jellybeard999 · 18 May 2020 at 11:48

@Hades

Sorry for the poor explanation.

It's the script in post 12 which first affects this. The first python script which is run in my process. It then remains the same throughout the other scripts. Python - Append CSVs

eg
2 becomes 2.0
152.00 becomes 152.0
145.09 remains 145.09
145.999999999999 remains 145.999999999999

It appears as though the numbers are formatted to minimum of 1 dp.

The source data is either 0dp or 2dp.

I want the 2 to stay as 2 and the 152.00 to stay as 152.00.

alec · 18 May 2020 at 15:21

there is nothing I can see in post #12 script that would cause this numerical issue.
It takes the lines as strings and copies them verbatim.

What are you using to view the resulting CSVs? Maybe that program is trying to be too smart and assigns datatype to columns automatically?

Hades · 18 May 2020 at 16:27

Yes as @alec says, my script simply copies each file without changing any data. It doesn't even know it's a CSV file. It just takes each character and copies it. To be absolutely sure I just tried some test data on my machine and it copied it exactly without changing the decimal places.

It is likely that whatever tool you are using to view the file (excel?) is presenting the data in a different way. You can check this by opening the file in Windows Notepad. Notepad will not apply any formatting to the data. Does it look normal when using Notepad?