Python - Append CSVs

jellybeard999 · 29 Apr 2020 at 11:44

Hi all,

I've managed to use python to append all CSVs in a folder from a script nabbed online, but am struggling to pull the below together. I wonder if anyone here is able to advise?

I have some standard template CSVs and some bespoke CSVs relating to individuals.

Below is an example of the folder structure, though this can be changed if necessary. There are hundreds of refs. All files are fairly small.

I'm using python as we use some software, which is able to run python scripts as part of the routine, to carry out actions outside of the system.

SERVER PATH\TEMPLATE
Template1.csv
Template2.csv
Template3.csv
Template5.csv

SERVER PATH\TEMPLATE4
Template4 REF1.csv
Template4 REF2.csv
Template4 REF3.csv
Template4 REF4.csv

SERVER PATH\TEMPLATE6
Template6 REF1.csv
Template6 REF2.csv
Template6 REF3.csv
Template6 REF4.csv

I'm trying to append the standard templates in every case (ie 1,2,3 and 5) and slot in the specific templates where relevant.

i.e append
Template1.csv
Template2.csv
Template3.csv
Template4 REF1.csv
Template5.csv
Template6 REF1.csv
Output - Appended REF1.csv

Template1.csv
Template2.csv
Template3.csv
Template4 REF2.csv
Template5.csv
Template6 REF2.csv
Output - Appended REF2.csv

Any ideas?

Thanks for any help

jellybeard999 · 29 Apr 2020 at 15:19

Hi,

You are correct in what I'm trying to achieve, but the Template4 csv goes between Template3 and Template5, so they are in order.

I can rejig the folder structure however makes things work more efficiently/simply.

Thanks for the reply. If it makes it easier to understand, I'll try to set out what I'm hoping to achieve;

I have a dataset, which is processed through a system, formatted, and output as a single CSV of transactions processed within a month. This has various columns, including name and reference number. The name and reference are concatenated to create a new column called "identifier". This is because separately there may be duplicate names or references, but not in combination. The transactions may differ each month, so new usernames/references may appear, and some usernames/references may not be used within the period.

This will be a monthly exercise.

I have used python to split this csv, so that there is a csv per "identifier" as below. This creates a file eg "Template6 A Smith 1234.csv", which is all transactions processed by A Smith, with reference 1234.

Code:

import pandas as pd

data = pd.read_csv('\\\\SERVER\\PATH\\Transactions.csv')

data_category_range = data['IDENTIFIER'].unique()
data_category_range = data_category_range.tolist()

for i,value in enumerate(data_category_range):
    data[data['IDENTIFIER'] == value].to_csv(r'\\SERVER\PATH\Template6\Template6 '+str(value)+r'.csv',index = False, na_rep = 'N/A')

Template4 is essentially the same, but with different data, but there will always be a corresponding "Template4 A Smith 1234.csv" for every "Template6 A Smith 1234.csv".

Template 2 is summarised global data, so the same for every username, but will change monthly.

Template 1, 3 and 5 are fixed data, used to set the data out in a proforma layout ... headers, standard wording etc.

So I'm hoping to end up with an append of

Template1.csv
Template2.csv
Template3.csv
Template4 A Smith 1234.csv
Template5.csv
Template6 A Smith 1234.csv

In that order. These are the actual filenames. Obviously the names and references are made up, but the format is correct (initial surname nnnn).

Template 6 is always the last file. I can set each template in separate folders if it is of some benefit.

If in a given month, there are 500 "identifiers", I want to end up with 500 files appended as above. This number will vary.

If required for some sort of lookup, I could easily generate a list .csv of the expected "identifiers" for that month.

I hope that all makes sense.

Happy to go backover if it's easier to have a single large file with all (in this example) 500 identifiers for Template4.csv and Template6.csv and extract the data from that.

jellybeard999 · 29 Apr 2020 at 21:56

Thanks very much. I'll give it a whirl in the morning

jellybeard999 · 30 Apr 2020 at 09:23

Thanks very much, that works

It's running quite slowly for me, probably due to a) working form home through a VPN and b) the fileserver being really slow generally. The paths are all on a server, on a slow connection, designed more for serving word documents than any sort of heavy workload. Work are looking at setting up some sort of remote desktop solution to connect directly to the server our software sits on, so hopefully things will run faster then.

From checking the first file produced, it looks like it gets me where I wanted though

jellybeard999 · 30 Apr 2020 at 10:29

/Colombo ... Just one more thing!

I have two scripts which I have used previously, which work in this case, but for a single file only. Is there a simple way to adapt these to work on every csv file in a folder, rather than a specific file?

Template4 folder

Code:

import csv

input_file = '\\\\SERVER\\PATH\\A.csv'
output_file = '\\\\SERVER\\PATH\\B.csv'
cols_to_remove = [0] # Column indexes to be removed (starts at 0)

cols_to_remove = sorted(cols_to_remove, reverse=True) # Reverse so we remove from the end first
row_count = 0 # Current amount of rows processed

with open(input_file, "r") as source:
    reader = csv.reader(source)
    with open(output_file, "w", newline='') as result:
        writer = csv.writer(result)
        for row in reader:
            row_count += 1
            print('\\r{0}'.format(row_count), end='') # Print rows processed
            for col_index in cols_to_remove:
                del row[col_index]
            writer.writerow(row)

Template6 folder

Code:

# Constants
PATH = "\\\\SERVER\\PATH"
FILE = "X.csv"
OUTPUTFILE = "Y.csv"

# Read imput file
with open(PATH + "/" + FILE, "r") as f:
    data = f.read()
f.close()

# Replace
data = data.replace(',N/A,N/A,N/A,N/A',',,,,')

# Write output file
with open(PATH + "/" + OUTPUTFILE, "w") as f:
    f.write(data)
f.close()

jellybeard999 · 30 Apr 2020 at 12:50

Hades said:
Glad it worked. When you say it is slow, how long does it take? Also how many files are there?

201 in the example month, I stopped it after a few minutes and it had only done single digit appends, but that's probably our systems/network, rather than anything else. I'll wait until our IT looks at their remote solution, as I anticipate that will improve things. It's not just this that's slow! I could run it locally, which would definitely be far faster, but this isn't possible as others may need to run the same process.

Hades said:
I'll take a look at those two other scripts when I have time. But if I understand correctly the first one simply removes a column from a CSV and the second seems to remove the N/A string? You want them changed to loop around all files in a directory instead of working on a single file?

That's correct.

Hades said:
Also... just one last thing (Columbo stylee ); you are doing a lot of manipulation of files here. It is processing the files after they have been produced. Would a better way be to amend the script/programme which produces them? It would probably be cleaner and more efficient. But I am assuming that's not possible for a reason?

That was my starting point ideally. The system can run scripts in it's own language, but there are limitations, thankfully, it can also call python scripts. Unfortunately, I don't really know python

I'm a bit more familiar with SQL and VBA, but only from playing around with existing scripts and manipulating, but that's generally enough for me to get started. Hopefully I'll get a bit of basic knowledge of Python in the same way.

jellybeard999 · 4 May 2020 at 14:23

Thanks for those. Al sorted now.

I can't seem to find the log file, so not sure where, or whether, it is being output.

My testing suggests that the bottleneck is the network. It runs in seconds on my C drive, so I'll wait and see how IT manage to progress things.

jellybeard999 · 4 May 2020 at 19:43

Hades said:
The log just prints to the standout. So if running from a command prompt it will just write to th e terminal. But presumably the calling application is trapping that output.

If you do still need it then I could amend it to write to an actual file of your choosing instead.

Ah, I presumed the log would dump to a text file somewhere. Everythings working for now, so I'll revisit the log in future if necessary. Thanks again for the help

jellybeard999 · 15 May 2020 at 16:54

Hi @Hades

Thanks again for all your help on this. Just wondered if you had any ideas whether you had any insights into the below

I have some numeric values output as 2 decimal places, and some with 0 decimal places in the initial files. I've output these as text essentially in the source data, but somewhere in the process, the values are being changed to numeric and losing the dps.

Template 4

REDACTED DATA ... ,"145.09"
REDACTED DATA ... ,"2"

Becomes

REDACTED DATA ... ,145.09
REDACTED DATA ... ,2.0

As the first value is an amount, I'd like to keep that with 2 decimal places, but the second is a count, so would prefer that to be 0 decimal places.

Template 6

REDACTED DATA ... ,"52.00","10.40","62.40"

Becomes

REDACTED DATA ... ,52.0,10.4,62.4

These are all amounts, so again, I'd like to keep them with 2 decimal places.

jellybeard999 · 18 May 2020 at 11:48

@Hades

Sorry for the poor explanation.

It's the script in post 12 which first affects this. The first python script which is run in my process. It then remains the same throughout the other scripts. Python - Append CSVs

eg
2 becomes 2.0
152.00 becomes 152.0
145.09 remains 145.09
145.999999999999 remains 145.999999999999

It appears as though the numbers are formatted to minimum of 1 dp.

The source data is either 0dp or 2dp.

I want the 2 to stay as 2 and the 152.00 to stay as 152.00.

jellybeard999 · 21 May 2020 at 11:56

Hmmm. I couldn't see anything obvious, but as I'm not particularly proficient, thought there might be something hidden in there.

I've been using notepad to view, as I'm already aware excel does it's own thing and formats as it sees fit.

Could the below have anything to do with it?

Code:

import pandas as pd

data = pd.read_csv('SOURCE\Template4 Master.csv')

data_category_range = data['IDENTIFIER'].unique()
data_category_range = data_category_range.tolist()

for i,value in enumerate(data_category_range):
    data[data['IDENTIFIER'] == value].to_csv(r'SOURCE\Template4 '+str(value)+r'.csv',index = False, na_rep = 'N/A')

jellybeard999 · 23 May 2020 at 21:14

Sorry. I had another look and it was earlier in the process that the issue occurs, with the pandas script I nabbed from elsewhere.

This script takes a couple of CSVs and splits them out by the identifier reference.

I then rejoin these using the identifier field, so I have a combined file per reference.

I already had this part working before I started the OP, but hadn't realised the decimal issue.

All your scripts worked fine! This issue seems to be all my doing

jellybeard999 · 29 May 2020 at 12:47

Thanks for all your help @Hades

All working correctly as expected. Now onto excel and Outlook macros for the next stage!

jellybeard999 · 29 May 2020 at 14:00

Back again! Thought i might as well post here, but I can move to Windows Software if needs be. I'm hoping some peeps here are as well versed with VBA as Python!

I have pulled together a macro which does everything I want, except it seems to somehow mess dates up of it's own accord. I've stripped back to the bare bones of a macro below, which doesn't seem to have anything in re values/dates etc, and can't understand what is going wrong. Google doesn't seem to help.

Based on code found here - https://stackoverflow.com/a/52619462

I basically want it to open the CSV, format the file (stripped those lines out and still have the same issue, so it's this part that's the concern) and save as XLSX.

Code:

Sub CSVtoXLSB2()
Dim wb As Workbook
Dim CSVPath As String
Dim XLSXPath As String
Dim sProcessFile As String

CSVPath = "C:\CSV\"
XLSXPath = "C:\XLSX\"
sProcessFile = Dir(CSVPath & "*.csv")
Do Until sProcessFile = ""   ' Loop until no file found.
    Set wb = Application.Workbooks.Open(CSVPath & sProcessFile)
    wb.SaveAs XLSXPath & Split(wb.Name, ".")(0) & ".xlsx", FileFormat _
        :=51
    wb.Close
    sProcessFile = Dir()   ' Get next entry.
Loop
Set wb = Nothing
End Sub

Opening the CSV in excel manually gives

Code:

      A                                                  B
1    Monthly Transactions
2
3    Period From                          01/04/2020
4    Period To                              29/04/2020

But for whatever reason, the the above macro gives me a file showing

Code:

      A                                                  B
1    Monthly Transactions
2
3    Period From                          04/01/2020
4    Period To                              29/04/2020

It appears the date values have changed, as formatting etc has no bearing.

The same occurs further down in the data, but then fixes itself after a few instances sometimes. eg (different examples).

If it means anything, those which are messed up are displayed as ######## until the column width is adjusted, but the dates below display correctly.

Code:

Transaction Date
04/06/2020
04/06/2020
04/06/2020
04/06/2020
04/07/2020
04/07/2020
04/08/2020
04/10/2020
04/10/2020
04/12/2020
13/04/2020
13/04/2020
13/04/2020
15/04/2020
15/04/2020
15/04/2020
17/04/2020
17/04/2020
20/04/2020
21/04/2020
22/04/2020
22/04/2020
23/04/2020
23/04/2020
26/04/2020
28/04/2020
28/04/2020
29/04/2020
29/04/2020

Code:

Transaction Date
04/03/2020
04/06/2020
04/08/2020
04/08/2020

Anything obvious at play here?

EDIT - Just noticed, are the dates that are being changed all those 12 or fewer days - ie the number of months. I presume these are being read as american dates, but the other left as text, as it's not compliant with the american date format.

Why though? This doesn't happen opening manually! My windows region is set to UK.

No further forward on the cause of this, but the "american dates" are being read as number values and the "UK dates" as text values.

EDIT 2 - Changed the export from the source system to dd-mmm-yyyy, then added a cell format change to the excel macro which has resolved the issue, but still keen to understand why this occurs if anyone knows

jellybeard999 · 4 Jun 2020 at 15:09

Thank all for the help and feedback. I'm sure there is a better way of doing things and I've made this more drawn out than it needs to be.

I'm not a programmer.... But the problem is, no one else involved, including the IT bods, is a programmer and the actual programmers don't have the resource to work on this, so it's a case of try to sort it out, or it doesn't get done!

I'm at the point where it works and have resolved the issues in a fashion, so I think I'll leave it be. I've learned a few things along the way and it gets the end result I'm after, so it's not been a complete disaster!