Splitting text into files?

Suspended
Joined
17 Oct 2011
Posts
5,707
Location
Buckingamshire
I have a .txt that is full of individual messages. The messages are clearly demarked with a unique start and end - #*Conversation*# and #*End of Conversation*#.

What I want to do is to copy out each message between that start and end text and save it to individual text files.

What's the best way of going about that please?
 
Few options here but you#re going to need a basic grasp of Regex and access to Linux tools (GnuWin32 does just that).

Alternatively could you post the file? Pastebin and mark as private or something.
 
Thanks for those examples. I shall have a play about with those and see if I can get them to work. I'm amazed there isn't a Windows tool that does this. Provide the source, detail the start and end delimiters and specify the output type and off it goes...
 
You could probably do it with PowerShell or Python if you're willing to install it (takes about 20 seconds).

Failing that give me sample of the exact format (doesn't have to be real data, just the right format and whatever) and I'll knock something up.
 
Thanks Pho.

So the data is resident in a .txt and the delimiters are:

#*Conversation*#
#*End of Conversation*#

I just need everything between each instance of those copied into a new .txt until it reaches the bottom of the original and stops. That's it. Thanks :)
 
Right here you go. Try this :)..
PHP:
Usage: TextFileSplitter.exe "<InputFile>" "<SectionStartString>" "<SectionEndString>" "<OutputFolder>"
E.g.,:
PHP:
TextFileSplitter.exe "input.txt" "#*Conversation*#" "#*End of Conversation*#" "."

This will scan input.txt (in the same folder as the exe, but it could be anywhere), it will look for #*Conversation*# as the start of file and #*End of Conversation*# as the end of file marker. It will output the results to "." (the same folder as the exe) with filenames of input-X.txt where X is the count of all output files so far.

Obviously make sure you have a backup of these files before hand. It took about ~12 seconds on my PC to process a file containing 112,996 lines which split out into ~5,100 text files :D.

I've pretty much been coding non-stop since 2pm on Sunday. Yes, that's nearly 30 hours straight bar the odd drink/food break so forgive my sloppyness :).

Download: https://www.dropbox.com/s/3t7y212wcmuh9pf/TextFileSplitter.zip?dl=0. Let me know if it works.

Source:
PHP:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;

namespace TextFileSplitter
{
    class Program
    {
        static void Main(string[] args)
        {
            var exe = Path.GetFileName(System.Reflection.Assembly.GetEntryAssembly().Location);

            if (args.Length < 4)
            {
                Console.WriteLine("Usage: {0} <InputFile> <SectionStartString> <SectionEndString> <OutputFolder>", exe);
                return;
            }

            // Inputs
            var inputFile = args[0].Trim();
            var sectionStartString = args[1].Trim();
            var sectionEndString = args[2].Trim();
            var outputFolder = args[3].Trim();

            var inputFileFilename = Path.GetFileNameWithoutExtension(inputFile);
            var inputFilenameExtension = Path.GetExtension(inputFile);

            // Quick and dirty validation
            var validation = validateInput(inputFile, sectionStartString, sectionEndString, outputFolder);
            if (validation.Any())
            {
                foreach (var validationError in validation)
                    Console.WriteLine("Error: " + validationError);
                
                return;
            }

            // Go go go
            using (FileStream fsInput = File.Open(inputFile, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
            using (BufferedStream bsInput = new BufferedStream(fsInput))
            using (StreamReader srInput = new StreamReader(bsInput, Encoding.UTF8))
            {
                var outputting = false;
                var outputFileCount = 0;
                String outputFilename = string.Empty;

                // Output streams
                FileStream fsOutput = null;
                BufferedStream bsOutput = null;
                StreamWriter swOutput = null;

                try
                {
                    string line;
                    int lineCount = 0;
                    while ((line = srInput.ReadLine()) != null)
                    {
                        ++lineCount;
                        if (++lineCount % 10 == 0)
                            Console.WriteLine("Processing line {0}..", lineCount);

                        var lineIsSectionStartString = line.Trim().Equals(sectionStartString, StringComparison.InvariantCultureIgnoreCase);
                        if (lineIsSectionStartString)
                        {
                            outputting = true;

                            // Initialise output streams
                            outputFilename = Path.Combine(outputFolder, string.Format("{0}-{1}{2}", inputFileFilename, ++outputFileCount, inputFilenameExtension));
                            fsOutput = File.Open(outputFilename, FileMode.Create, FileAccess.Write, FileShare.ReadWrite);
                            bsOutput = new BufferedStream(fsOutput);
                            swOutput = new StreamWriter(bsOutput, Encoding.UTF8);
                            swOutput.AutoFlush = false;

                            Console.WriteLine("Outputting to {0}", outputFilename);
                        }
                        else if (outputting && line.Trim().Equals(sectionEndString, StringComparison.InvariantCultureIgnoreCase))
                        {
                            // Flush
                            swOutput.Flush();
                            bsOutput.Flush();
                            fsOutput.Flush();

                            Console.WriteLine("Closing {0}", outputFilename);

                            // Close output streams
                            outputting = false;
                            swOutput.Close();
                            bsOutput.Close();
                            fsOutput.Close();
                        }

                        if (outputting && !lineIsSectionStartString)
                            swOutput.WriteLine(line);
                    }
                    Console.WriteLine("Finished processling {0} lines", lineCount);
                }
                finally
                {
                    try { swOutput.Close(); } catch (Exception) {}
                    try { bsOutput.Close(); } catch (Exception) {}
                    try { fsOutput.Close(); } catch (Exception) {}
                }
            }
        }

        private static IEnumerable<string> validateInput(string inputFile, string sectionStartString, string sectionEndString, string outputFolder)
        {
            if (!File.Exists(inputFile))
                yield return string.Format("{0} doesn't exist.", inputFile);
            if (string.IsNullOrEmpty(sectionStartString))
                yield return string.Format("section start string empty.", inputFile);
            if (string.IsNullOrEmpty(sectionEndString))
                yield return string.Format("section end string empty.", inputFile);
            if (string.IsNullOrEmpty(outputFolder))
                yield return string.Format("output folder missing.", inputFile);
        }
    }
}
 
Right, tried it and not sure if I am doing something wrong, but running the executable causes a command prompt window to appear for a second then disappear. The file input.txt is in the same directory as the executable.
 
Last edited:
OK, so I took a quick look at the posted program/ source code :)
First, a small bugfix:
PHP:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;

namespace TextFileSplitter
{
    class Program
    {
        static void Main(string[] args)
        {
            var exe = Path.GetFileName(System.Reflection.Assembly.GetEntryAssembly().Location);

            if (args.Length < 4)
            {
                Console.WriteLine("Usage: {0} <InputFile> <SectionStartString> <SectionEndString> <OutputFolder>", exe);
                return;
            }

            // Inputs
            var inputFile = args[0].Trim();
            var sectionStartString = args[1].Trim();
            var sectionEndString = args[2].Trim();
            var outputFolder = args[3].Trim();

            var inputFileFilename = Path.GetFileNameWithoutExtension(inputFile);
            var inputFilenameExtension = Path.GetExtension(inputFile);

            // Quick and dirty validation
            var validation = validateInput(inputFile, sectionStartString, sectionEndString, outputFolder);
            if (validation.Any())
            {
                foreach (var validationError in validation)
                    Console.WriteLine("Error: " + validationError);

                return;
            }

            // Go go go
            using (FileStream fsInput = File.Open(inputFile, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
            using (BufferedStream bsInput = new BufferedStream(fsInput))
            using (StreamReader srInput = new StreamReader(bsInput, Encoding.UTF8))
            {
                var outputting = false;
                var outputFileCount = 0;
                String outputFilename = string.Empty;

                // Output streams
                FileStream fsOutput = null;
                BufferedStream bsOutput = null;
                StreamWriter swOutput = null;

                try
                {
                    string line;
                    int lineCount = 0;
                    while ((line = srInput.ReadLine()) != null)
                    {
                        ++lineCount;
                        if (++lineCount % 10 == 0)
                            Console.WriteLine("Processing line {0}..", lineCount);

                        var lineIsSectionStartString = line.Trim().Equals(sectionStartString, StringComparison.InvariantCultureIgnoreCase);
                        if (lineIsSectionStartString)
                        {
                            outputting = true;

                            // Initialise output streams
                            outputFilename = Path.Combine(outputFolder, string.Format("{0}-{1}{2}", inputFileFilename, ++outputFileCount, inputFilenameExtension));
                            fsOutput = File.Open(outputFilename, FileMode.Create, FileAccess.Write, FileShare.ReadWrite);
                            bsOutput = new BufferedStream(fsOutput);
                            swOutput = new StreamWriter(bsOutput, Encoding.UTF8);
                            swOutput.AutoFlush = false;

                            Console.WriteLine("Outputting to {0}", outputFilename);
                        }
                        else if (outputting && line.Trim().Equals(sectionEndString, StringComparison.InvariantCultureIgnoreCase))
                        {
                            // Flush
                            swOutput.Flush();
                            bsOutput.Flush();
                            fsOutput.Flush();

                            Console.WriteLine("Closing {0}", outputFilename);

                            // Close output streams
                            outputting = false;
                            swOutput.Close();
                            bsOutput.Close();
                            fsOutput.Close();
                        }

                        if (outputting && !lineIsSectionStartString)
                            swOutput.WriteLine(line);
                    }
                    Console.WriteLine("Finished processling {0} lines", lineCount);
                }
                finally
                {
                    try { swOutput.Close(); }
                    catch (Exception) { }
                    try { bsOutput.Close(); }
                    catch (Exception) { }
                    try { fsOutput.Close(); }
                    catch (Exception) { }
                }
            }
        }

        private static IEnumerable<string> validateInput(string inputFile, string sectionStartString, string sectionEndString, string outputFolder)
        {
            if (!File.Exists(inputFile))
                yield return string.Format("{0} doesn't exist.", inputFile);
            if (string.IsNullOrEmpty(sectionStartString))
                yield return string.Format("section start string empty.", inputFile);
            if (string.IsNullOrEmpty(sectionEndString))
                yield return string.Format("section end string empty.", inputFile);
            if (string.IsNullOrEmpty(outputFolder))
                yield return string.Format("output folder missing.", inputFile);
            if(!Directory.Exists(outputFolder))
                yield return string.Format("output folder missing.", inputFile);
        }
    }
}

Checking for the existance of the output folder before trying to write to it is a decent sanity check.

Using it, a fuller set of instructions:
Open a command prompt (Start, run and type in cmd )
Navigate to where the exe/ text files are stored- cd C:\SplitMe
Remember to enclose your path in quotes if there's a space in i, e.g. cd "C:\Split Me"
Now, enter this example command:
textfilesplitter SplitMe.txt #*Conversation*# "#*End of Conversation*#" C:\Output
You'll need to change SplitMe.txt to the name of your input file :)

Notice that the #*End of Conversation*# is enclosed in quotes, as there are spaces in it :)

(I'm dead certain there are more bugs in there, that's the one that just caught my eye from a quick code skim)

-Leezer-
 
Last edited:
Right, tried it and not sure if I am doing something wrong, but running the executable causes a command prompt window to appear for a second then disappear. The file input.txt is in the same directory as the executable.

Sorry should have mentioned you need to run it in a command prompt. Open command prompt, cd to wherever the folder is and then run it from within the command prompt :).


OK, so I took a quick look at the posted program/ source code :)
First, a small bugfix:
....
(I'm dead certain there are more bugs in there, that's the one that just caught my eye from a quick code skim)

-Leezer-

Awesome, cheers :). It was only a quick 20min thing so I'm sure there are loads more bugs lurking in there somewhere :D.

So, I was bored :p

I've tacked on a GUI to the code posted above-
http://www.bvecornwall.co.uk/downloads/external/Splitter.zip

No quotes needed, just select a file and output directory and enter your deliminators :)

-Leezer-

Even better, appreciate the effort :).
 
Back
Top Bottom