Avoiding data problems when importing text-based

It's quite typical to open a spreadsheet from a customer and notice the presence of unusual coding practice for representing levels of categorical factors with text-based features. For example, when combinatorics are used, and category levels can contain a combination of characters, such as AB, AC, AD, CD, etc., I have seen many different variants such as 'A=B', 'A,B', 'A;B', or all zeroes, and an occasional A.

A stepwise list of solutions is offered below for overcoming difficulties when noticing missing text-based feature values or entire text-based features whose values are invariant (all the same values) after importing an Excel spreadsheet.

1. A solution which can remedy 50-75% of data issues related to importing Excel spreadsheets is to turn off the removal of special characters in Explorer's default settings, and then try re-importing the Excel spreadsheet.

First, open Explorer, and select the General Default popup window:

In the default settings window, uncheck the checkbox called "Remove spaces and special symbols on input", shown as follows:

Now try re-importing the Excel spreadhseet and see if the data issues have been resolved.

2.   If the above step did not resolve the data import issue(s), save the Excel file as a tab-delimited text file. Open the text file with the free text editor Notepad++, so you can visually inspect positioning of ASCII characters for tab, line feed, and carriage return.   If there is anything asymmetric about usage of the line feeds or carriage returns, then that will be a problem, so you would need to edit and modify to ensure symmetry over all the lines.

3.   At the bottom of the text file, look for empty rows, or results of calculations, such as column sums, column averages, or standard deviations. Data analysts (users) often assume the spreadsheet provider removed any comments or summary statistics they made have performed at the bottom of the spreadsheet. However, that's an erroneous assumption, which will result in problems after importing the spreadsheet. Delete any empty records, comments, or results of summary statistics found at the bottom of the text file. If you did identify and delete summary statistics, comments, or empty rows found at the bottom of the file, then save the file, and import the file as a tab-delimited text (*.txt) file by ensuring the following file-type choice is selected upon import:

If after importing the text file, the above step worked, then continue to use (import) the text file, and not the original Excel file.   You could save the text file (once opened in Excel) to as an Excel file or a comma-delimited file (.csv) if you wish, but there is no guarantee that backwards compatibility is the solution -- as there can be a range of issues with Excel files.

Now that you have saved the Excel spreadsheet as a tab-delimited text file, you can directly edit/modify the text file using Excel by right clicking on its filename, and then select Open with, and select Excel. This will allow you to see the columns and rows and edit the contents, make changes, and after you save the file, the results wikll be saved in the text file (.txt).

(Use Excel to edit to and save the text file for the steps below)

4. Entire text-based feature empty after import. If an entire text-based categorical feature appears empty in the Explorer data sheet after input, then it is likely due to special characters in feature values.   To minimize the occurrence of this (empty feature after inport), edit the text file using Excel and ensure there are no special characters like a semicolon ";" a comma ",", or equal sign, "=" used in any of the text-based feature values.   If any of these characters are observed, then replace them with a slash "/" or hyphen "-".   Now try importing the text file to see if the text-based feature values are present for this particular feature.

5. Entire text-based feature determined "Invariant" after import.   This can occur when the occurrence of a text value is sparse, and the remainder of records (rows) use e.g. a zero.   For example, if the majority of records contain a zero (0), and there is an occasionl or infrequent "yes", "Y", or "true", then Explorer will assume (by default) that the feature is numeric because of the majority of zeroes and will automatically remove (clean) out the rarely observed text values. To remedy this situtation, edit the text file using Excel and delete all of the zeroes in for this feature, while leaving the original text values.   Now import the text file to see if the original text-based feature values are intact.

The above procedures should solve any issues related to data issues after importing an Excel file with Explorer.

Data Import Tips (large files)

For very large files, save Excel files as comma-delimited (.csv) file, and then import as .csv.   Explorer loads .csv files much faster than Excel or text files.