Python - Append CSVs

jellybeard999 · 21 May 2020 at 11:56

Hmmm. I couldn't see anything obvious, but as I'm not particularly proficient, thought there might be something hidden in there.

I've been using notepad to view, as I'm already aware excel does it's own thing and formats as it sees fit.

Could the below have anything to do with it?

Code:

import pandas as pd

data = pd.read_csv('SOURCE\Template4 Master.csv')

data_category_range = data['IDENTIFIER'].unique()
data_category_range = data_category_range.tolist()

for i,value in enumerate(data_category_range):
    data[data['IDENTIFIER'] == value].to_csv(r'SOURCE\Template4 '+str(value)+r'.csv',index = False, na_rep = 'N/A')

Hades · 23 May 2020 at 00:53

That snippet of code was not from post 12 so I'm a little confused at which stage the decimal places are behaving unexpectedly.

I'm not a pandas expert (never used them to be honest) so it's possible the issue could be there. I'll have a bit of a think when I have time.

jellybeard999 · 23 May 2020 at 21:14

Sorry. I had another look and it was earlier in the process that the issue occurs, with the pandas script I nabbed from elsewhere.

This script takes a couple of CSVs and splits them out by the identifier reference.

I then rejoin these using the identifier field, so I have a combined file per reference.

I already had this part working before I started the OP, but hadn't realised the decimal issue.

All your scripts worked fine! This issue seems to be all my doing

Hades · 24 May 2020 at 16:54

I've copied your pandas script and replicated the same issue myself. The reason it is doing this is because pandas is treating the data as numeric and then adjusting the decimal precision. The solution is to get it to treat all of the columns as strings instead. Then it shouldn't manipulate the data. After a bit of searching it seems you can specify the datatype of one or more columns when you read the master csv (using ',dtype=str'). I've tried this myself and it seems to fix it for me.

New script below. Does this also fix it for you?

Code:

import pandas

import pandas as pd

data = pd.read_csv('SOURCE\Template4 Master.csv',dtype=str)

data_category_range = data['IDENTIFIER'].unique()
data_category_range = data_category_range.tolist()

for i,value in enumerate(data_category_range):
    data[data['IDENTIFIER'] == value].to_csv(r'SOURCE\Template4 '+str(value)+r'.csv',index = False, na_rep = 'N/A')

ingrowzelqui · 27 May 2020 at 10:08

Thanks for the code, guys. I will try it on my own.

jellybeard999 · 29 May 2020 at 12:47

Thanks for all your help @Hades

All working correctly as expected. Now onto excel and Outlook macros for the next stage!

Hades · 29 May 2020 at 13:08

Glad it's working

jellybeard999 · 29 May 2020 at 14:00

Back again! Thought i might as well post here, but I can move to Windows Software if needs be. I'm hoping some peeps here are as well versed with VBA as Python!

I have pulled together a macro which does everything I want, except it seems to somehow mess dates up of it's own accord. I've stripped back to the bare bones of a macro below, which doesn't seem to have anything in re values/dates etc, and can't understand what is going wrong. Google doesn't seem to help.

Based on code found here - https://stackoverflow.com/a/52619462

I basically want it to open the CSV, format the file (stripped those lines out and still have the same issue, so it's this part that's the concern) and save as XLSX.

Code:

Sub CSVtoXLSB2()
Dim wb As Workbook
Dim CSVPath As String
Dim XLSXPath As String
Dim sProcessFile As String

CSVPath = "C:\CSV\"
XLSXPath = "C:\XLSX\"
sProcessFile = Dir(CSVPath & "*.csv")
Do Until sProcessFile = ""   ' Loop until no file found.
    Set wb = Application.Workbooks.Open(CSVPath & sProcessFile)
    wb.SaveAs XLSXPath & Split(wb.Name, ".")(0) & ".xlsx", FileFormat _
        :=51
    wb.Close
    sProcessFile = Dir()   ' Get next entry.
Loop
Set wb = Nothing
End Sub

Opening the CSV in excel manually gives

Code:

      A                                                  B
1    Monthly Transactions
2
3    Period From                          01/04/2020
4    Period To                              29/04/2020

But for whatever reason, the the above macro gives me a file showing

Code:

      A                                                  B
1    Monthly Transactions
2
3    Period From                          04/01/2020
4    Period To                              29/04/2020

It appears the date values have changed, as formatting etc has no bearing.

The same occurs further down in the data, but then fixes itself after a few instances sometimes. eg (different examples).

If it means anything, those which are messed up are displayed as ######## until the column width is adjusted, but the dates below display correctly.

Code:

Transaction Date
04/06/2020
04/06/2020
04/06/2020
04/06/2020
04/07/2020
04/07/2020
04/08/2020
04/10/2020
04/10/2020
04/12/2020
13/04/2020
13/04/2020
13/04/2020
15/04/2020
15/04/2020
15/04/2020
17/04/2020
17/04/2020
20/04/2020
21/04/2020
22/04/2020
22/04/2020
23/04/2020
23/04/2020
26/04/2020
28/04/2020
28/04/2020
29/04/2020
29/04/2020

Code:

Transaction Date
04/03/2020
04/06/2020
04/08/2020
04/08/2020

Anything obvious at play here?

EDIT - Just noticed, are the dates that are being changed all those 12 or fewer days - ie the number of months. I presume these are being read as american dates, but the other left as text, as it's not compliant with the american date format.

Why though? This doesn't happen opening manually! My windows region is set to UK.

No further forward on the cause of this, but the "american dates" are being read as number values and the "UK dates" as text values.

EDIT 2 - Changed the export from the source system to dd-mmm-yyyy, then added a cell format change to the excel macro which has resolved the issue, but still keen to understand why this occurs if anyone knows

gpuerrilla · 29 May 2020 at 21:42

What version of Office are you using?

I have had it where the

Code:

FileFormat _         :=51

did something weird on me once. I would also drop the _ newlines if you can as that line is short enough and readable.

Hades · 1 Jun 2020 at 10:15

I'm afraid I don't have excel at home. I do have it at work but I won't have time there to look into it and learn VBA. It is very likely the cause of the issue is the way that VBA or excel treats dates and defaults to a specific format. I would suggest looking into how excel and VBA handles date processing and whether you can add some code to specify a format instead of using the default.

But you now have several steps all processing data in different ways and probably using several different languages. If this is one end to end workflow it feels quite disjointed and prone to unexpected behaviour in one of the steps. It might be worth taking a step back and looking at whether there is a better overall approach to achieving what you are trying to do?

gpuerrilla · 1 Jun 2020 at 15:23

Yeah I agree, fewer languages and steps is the way to go here.

jellybeard999 · 4 Jun 2020 at 15:09

Thank all for the help and feedback. I'm sure there is a better way of doing things and I've made this more drawn out than it needs to be.

I'm not a programmer.... But the problem is, no one else involved, including the IT bods, is a programmer and the actual programmers don't have the resource to work on this, so it's a case of try to sort it out, or it doesn't get done!

I'm at the point where it works and have resolved the issues in a fashion, so I think I'll leave it be. I've learned a few things along the way and it gets the end result I'm after, so it's not been a complete disaster!

gpuerrilla · 4 Jun 2020 at 20:11

Glad you sorted it for them @jellybeard999 you earned a well deserved cold one from them!

Hades · 5 Jun 2020 at 16:50

Yes, well done for persevering. I personally think learning to program is valuable even for people who don'#t use it as their main job. It can make your job much easier in some situations.