Convert .doc to simple text using Java

Associate
Joined
6 Nov 2003
Posts
2,138
Location
West London
I'm currently developing an application in Java. Part of the program requires reading the content of Word files and storing as simple text. Can someone give me some pointers on how to implement this, or better yet, an example somewhere?
 
Microsoft word files are not easy to just convert to text. The format is a proprietary binary format, although according to wikipedia microsoft will provide specifications upon request.
 
Providing you have access to the Word Object library one way would be to use the SaveAs routine within MS Word to do the conversion for you. Office automation could be used to programatically create an instance of Word, open the required Word document and then use the ActiveDocument.SaveAs method to save, specifying the file format as text.

All of which requires you to interface with some ActiveX stuff at some point, something I've looked at briefly but never got that far with. You might be able to use a Java-ActiveX bridge of some kind. Or you could write a routine in C++ to use the MSWord objects and get Java to talk to that.

You might also want to take a look at the Jakarta POI project which seems to cover Excel, not sure about Word.

Hope that helps.

Jim
 
You may want to take a look at open office source code and see how they implement it.
 
Back
Top Bottom