With Office 2007, Microsoft decided to change default application formats from the old, proprietary, closed formats (DOC, XLS, and PPT) to new, open and, standardized XML formats (DOCX, XLSX, and PPTX). New formats share some similarities with old Office XML formats (WordML, SpreadsheetML) and some similarities with competing OpenOffice.org OpenDocument formats, but there are many differences. Because new formats will be the default in Office 2007, and Microsoft Office is the most predominant office suite, these formats are destined to be popular and you will probably have to deal with them sooner or later.
This article will explain the basics of the Open XML file format and specifically the XLSX format, the new format for Excel 2007. Presented is a demo application that reads and writes tabular data to and from XLSX files. The application is written in C# using Visual Studio 2005. The XLSX files it creates can be opened using Excel 2007.
Every Open XML file is essentially a ZIP archive containing many other files. Office-specific data is stored in multiple XML files inside that archive. This is in direct contrast with old WordML and SpreadsheetML formats that were single, non-compressed XML files. Although they are more complex, the new approach offers several benefits:
- You don't need to process the entire file to extract specific data.
- Images and multimedia are now encoded in native format, not as text streams.
- Files are smaller as a result of compression and native multimedia storage.
In Microsoft's terminology, an Open XML ZIP file is called a package. Files inside that package are called parts. It is important to know that every part has a defined content type and there are no default type presumptions based on the file extension. Content type can describe anything: application XML, user XML, images, sounds, video, or any other binary objects. Every part must be connected to some other part using a relationship. Inside the package are special XML files with a ".rels" extension that defines the relationship between parts. There is also a start part (sometimes called "root," which is a bit misleading because a graph containing all parts doesn't have to be a tree structure), so the entire structure looks like Figure 1.
Figure 1: Parts and relations inside an XLSX file
To make a long story short, to read the data from an Open XML file you need to:
- Open the package as a ZIP archive: Any standard ZIP library will do.
- Find the parts that contain data you want to read: You can navigate through the relationship graph (more complex), or you can presume that certain parts have a defined name and path (Microsoft can change that in the future).
- Read the parts you are interested in: Use the standard XML library (if they are XML) or some other method (if they are images, sounds, or some other type).
On the other hand, if you want to create a new Open XML file, you need to:
- Create/get all necessary parts: Use a standard XML library (if they are XML), copy them, or use some other method.
- Create all relationships: Create ".rels" files.
- Create content types: Create a "[Content_Types].xml" file.
- Package everything into a ZIP file with the appropriate extension (DOCX, XLSX, or PPTX): Any standard ZIP library will do.
The whole story about packages, parts, content types, and relations is the same for all Open XML documents (regardless of whether they originate in the application). Microsoft refers to this as Open Packaging Conventions.
Excel 2007 Open XML Specifics
Excel 2007 extends on the basis of Open Packaging Conventions by adding its own application-specific XML types. Reference schemas for all XML files used in Office can be downloaded from MSDN, but note that some things are still open to change until the final Excel 2007 release.
You just want to read/write worksheet data, so you need to look in the "\xl\worksheets" folder inside the XLSX file; this is where all the worksheets are located. For every worksheet, there is a separate XML file: "sheet1.xml," "sheet2.xml," and so on. When you open such a file, you will notice that all of the sheet data is inside a <sheetData> element. For every row, there is a <row> element; for every cell, there is a <c> element. Finally, the value of the cell is stored in a <v> element.
However, real-world XML is never simple as schoolbook XML. You will notice that numbers get encoded as numbers inside the <v> element:
However, a string value (like "John"), also gets encoded as a number:
<c r="B1" t="s">
That is because MS Excel uses an internal table of unique strings (for performance reasons). Zero is an index of that string in an internal table of strings and attribute t="s" tells you that the underlying type is a string, not a number. So, where is the table of unique strings located? It is in an "\xl\sharedStrings.xml" XML file, and contains all the strings used in the entire workbook, not just a specific worksheet.
This approach is used for many other things: cell styles, borders, charts, number formats, and so forth. In fact, that becomes the major programming problem when working with XLSX files—updating and maintaining various tables of some unique Excel objects. In this article, you will just read/write data values, but if you require some complex formatting you probably should use some commercial component that does all the tedious work for you.
The demo is a Windows Forms application (see Figure 2), written in C# using Visual Studio 2005. Because there is no support for ZIP files in the .NET Framework 2.0 (only for the ZIP algorithm), the demo uses an open-source ZIP library called SharpZipLib. For demonstration purposes, you will extract entire ZIP files to a TEMP folder, so you can examine the contents of that folder and its files while debugging the demo application. In a real-world application, you may want to avoid extracting to a temporary folder and just read to/write from the ZIP file directly.
For XML processing, the choice is simple. To read XML files, you use the XmlTextReader class; to write, you use the XmlTextWriter class. Both come with the .NET Framework, but you also can use any other XML processing library.
Figure 2: Demo application in action
You want to read a simple "In.xlsx" file (in the "Input" folder) and copy its contents to the DataTable. That file contains a list of people with their first and last names (text values) and their IDs (number values). When the "Read input .xlsx file" button is clicked, the following code is executed:
Nothing unusual happens here. The XLSX file is unzipped to the TEMP folder and the necessary XML parts (now files) are processed. The "sharedStrings.xml" file contains a global table of unique strings whereas the "sheet1.xml" file contains data for the first sheet. Helper methods are pretty straightforward XML reading code—you can download demo application code to examine them in more detail.
If everything is okay, after the button is clicked, all data will show up in the DataGridView.
Now, you want to write data from a DataTable to the "Out.xlsx" file in the "Output" folder. You can change some data or add some new rows in the DataGridView. When the "Write output .xlsx file" button is clicked, the following code is executed:
This time, the code is a bit more complicated. So that you don't generate all necessary parts needed for XLSX file, you should decide to use a template file. You extract the template file to the temporary folder and then just change the XML parts containing a shared string table and worksheet data. All other parts, relationships, and content types stay the same, so you don't need to generate any of that. Note that you use two string tables: a lookup Hashtable for fast searching and an ordinary ArrayList where items are sorted by their index. You could pull it out only with ArrayList, but then you would need to search entire ArrayList every time you added a new string (to check whether it is already there). The CreateStringTables() helper method builds both string tables, the WriteStringTable() helper method writes the string table XML, and the WriteWorksheet() helper method writes worksheet data XML.