Malformed XML File Cannot Be Parsed

Part II of my story: How iterative workflow improvements helped me keep my sanity

Saturday, April 9, 2016

In Part I of this essay, I discussed the challenges that a really bad process can pose for your sanity -- or at least for mine. When I first started a new job at a non-profit, I discovered that the process that was in place was really a nightmare for an internal creative department: Late requests, confusing processes, incomplete creative briefs and, honestly, not enough time. So I set out to identify what those problems were and figure out how to come up with more robust processes.

This is the ongoing story of how I solved my problems, from the technical side. I am a former editorial systems editor at the Orlando Sentinel, and a real geek when it comes to problem-solving, so this felt like it was tailor-made for me. I didn't have to hire a consultant or pay thousands of dollars a year in software.

Right now, our cost is $17 per month.

So, as Jacques Pepin always starts his shows: Here's how I did it.

Chapter I: In the Beginning, There Was ...

Eric S. Raymond, in his magisterial essay "The Cathedral and the Bazaar," had a quote about software development that I think is really memorable, and also apropos: "Every good work of software starts by scratching a developer's personal itch."

From my perspective I would extend this to say that every new process starts by something itching you.

So, in Part I, I wrote a little about our original and new setups briefly. Some of you have probably been wanting to read in more detail just what we did. I'll start by recapping the original system:

  1. Artwork request form is an Excel file.
  2. Users have the choice of e-mail, fax or inter-office mail to deliver requests.
  3. I put a sticky note on the front of the request with forward-looking deadlines: Proof out, proof approved, artwork delivered and artwork approved.
  4. I work on the artwork, more or less immediately.
  5. I e-mail the proof to the clients. They e-mail back revisions. Once they approve, I deliver the file.

There are some core weaknesses that you have probably already identified:

First, whenever a form can be delivered through multiple channels, you don't know where to anticipate new requests coming. That means, in effect, that someone has to be continuously monitoring all three channels. That's bad. What's worse, two of the three channels have no redundancy at all, and the third (e-mail) has minimal redundancy because it's a needle-in-haystack problem.

Secondly, the form is an Excel form, which means that it has a bunch of informational-design compromises that I don't like. A great example is that it doesn't have anywhere to mark actual deadlines and approvals, because it's assumed that as soon as it comes in, it has to be worked. It also assumes a pretty minimal workflow, and even at this point, four months into the job, I know that we're going to be expanding the team sooner or later.

Thirdly, the Excel format is not exactly ideal, although it is convenient, because the data it's capturing is unstructured. The work required to make it capture structured data probably isn't worth it, because given the limitations of our IT department (disabled macros, no VBA access, etc.), it's never going to do what I want it to.

There are certainly more issues than what I've mentioned. That's just a starting point, and they were the three that bothered me most.

So it was time for me to do some research.

Lots of off-the-shelf solutions exist to solve the sorts of problems that I have, but they come with a lot of compromises. Even when they solved the problem precisely, and I encountered a handful, often the compromise was an insane cost: A minimum charge of $500 a month or $5,000 a year, packages that required at least 4 or 7 or 15 seat licenses at $10-$50 a month, etc. As I have mentioned before, I work for a non-profit, so cost is really important. Even if I had the budget, I probably couldn't justify it: My department is very much outside of my organization's core competencies, so the incremental improvements to our processes couldn't outweigh the costs.

That meant that I was pretty much settled on something I was comfortable with: Fillable PDF forms. I wanted to distribute an interactive PDF, created in InDesign (solving the form design problem) and marked up with fillable fields in Acrobat.

So now I had a plan! Sort of.

Chapter II: A Man, A Plan, A Canal ...

Fillable PDF forms are really cool. Or, well, they were. I guess they're kind of passé now.

But from a technological standpoint, fillable forms sat athwart a really interesting transitional period. They originated from the period when HTML, and HTML forms especially, were both difficult to design and somewhat unreliable in implementation, and when a lot of firms were looking to digitize forms and looking to use Acrobat in the process.

I'm going to show my age here, but fillable PDFs were still relatively new technology in 2001, when I applied to college. I have a vague recollection that the very first fillable PDF I'd ever seen was the Common Application.

That technology had been adopted rather inconsistently even then. My alma mater, Northwestern University supplied PDFs of their application packet on their Web site, which was great because I had somehow managed to botch tearing the perforated sheets out of the original booklet supplied. But the PDFs weren't fillable. So, because even at 17 my handwriting had crossed the line from "unreadable" into "chicken scratch," I wrote the application out carefully on one copy and then typed it onto a clean copy using my mom's IBM Wheelwriter.

Anyway, to support this functionality, at some point that I don't know and isn't really germane to the task at hand, Adobe added really basic functionality to transmit form data from submittable PDFs via HTTP.

(This functionality is so basic that it's not any standard HTTP form submission: If you're using PHP, you have to read the data out with file_get_contents('php://input'). Ugh.)

Today, Acrobat can even export this data in two formats: FDF (Fillable Document Format) and XFDF (XML Fillable Document Format). It can also submit the PDF itself via e-mail (yech) and can even submit application/x-www-form-urlencoded, although that has its own flaws.

When I worked at the Orlando Sentinel, I developed a bunch of assorted pieces of software in CodeIgniter. It was a great choice for the era (2010-2011) because it was lightweight, could run under PHP4 on our our old sandbox server with no software configuration changes, and abstracted away a bunch of stuff that I hated writing, like database CRUD and HTTP input.

So CodeIgniter came with a few major benefits to me. I would develop a minimalist CodeIgniter backend to catch the form submitted data, which would be in $_POST. (Although, it turned out later, XFDF submissions have to be extracted from php://input directly, because something is wrong with the headers.) Then, the application would do a little massaging to clean up the users' input, add redundancy by storing the submitted requests in a database, and generate an e-mail notification to myself with an XFDF file attached that I could open and print.

I loved this idea because it meant that instead of exchanging hundreds of kilobytes with the form, which was around 600KB in size, I could exchange like 15KB of XML data. Better still, all of our sales and marketing staff were capable of filling out PDF files.

So my implementation plan ended up looking something like this:

  1. PDF with fillable fields handled by end users in Acrobat.
  2. Acrobat submits form data to CodeIgniter.
  3. CodeIgniter massages data a little to my specifications, stores it in the database, and e-mails me the XFDF file as an attachment.
  4. I open the XFDF file. Acrobat automatically imports the fillable PDF on my end and fills in the form fields. I print the file out.

The key to making this work was that I would keep a blank file on my own workstation, with all the same form fields as the original fillable PDF.

This actually didn't take me very long to implement at all in a CodeIgniter backend. Everything except the XFDF-related code is essentially stock CodeIgniter functionality, and XFDF files are pretty much just relatively unstructured XML files. The major hiccup was the insanity involving $_POST not being filled, as I've already mentioned; I spent hours trying to figure out what was wrong. Eventually I discovered, thanks to Dave Merchant's answer on the Adobe Forums, that because Adobe isn't submitting the data as application/x-www-form-urlencoded or multipart/form-data-encoded, I had to pull it out of php://input directly. After all, why would you implement something standard when you're Adobe and you want people to buy LiveCycle instead?

Anyway, as I said, in the end I got the CodeIgniter backend implemented, and tested, and we were able to roll out the fillable file.

For a while, this was nirvana. I got what I wanted and the clients got a structured form to fill out. But I think you're sensing a theme.

Chapter III: Things Fall Apart

Unfortunately for me, this was a somewhat brittle solution even in the best of times.

Just to take one example, I hadn't implemented any sanitization routines on the data that was input, beyond basic MySQL code-injection prevention. After all, I couldn't predict what people would enter or how it would get there (straight entry or copy-and-paste from various applications). Worse, it wasn't actually clear to me that I could even predict what would cause Acrobat to gag, because the specifications were so bad.

Every now and again, someone would enter something (probably via un-sanitized copy-and-paste from a Web page) that caused an error when we opened the XFDF file:

"The malformed XML file cannot be parsed."

To fix this, I would have to edit the raw XML file and find the error. Usually it was something stupid, like something from Microsoft's miserable XML specification, but every now and again I'd spend a while trying to sort it out.

This was, obviously, bad. But it wasn't even the worst of it. There were essentially two chapters that got me thinking that it was time to reconsider this approach -- and they happened fairly close together.

The first was that Acrobat's debugging functionality for submitting web-based forms is quite poor. I've already mentioned that we had a lot of trouble building the system itself. But whenever something went wrong, we encountered a completely different set of problems: Utterly opaque failure messages.

In order to even remotely see what was going on under the hood, I would have to simultaneously dive into the MySQL database, the server logs, and any debugging I could squeeze out of CodeIgniter. In the end, I actually wrote end points for my pseudo-API that would, separately, capture and submit the data, so I could take a look at what was being POST'ed, make changes, and submit manually.

But even that didn't come close to the pain of the second major point of failure.

You see, we've already established that Adobe's requests don't really conform to a standard HTML form submission. (Why? you might be asking. Look, if I knew, this series of articles would be many thousands of words shorter.)

One day, our IT department rolled out a new firewall system across the entire company. This caused the fillable forms to simply fail... with only a "Your request has failed." error. And of course, because that's Murphy's Law, it failed on the day that requests were due to be submitted.

So I called IT. We verified that my website (that would be wesmeltzer.com) was whitelisted through the firewall. We verified that the traffic was traveling over Port 80, which was permitted for whitelisted sites. After listening to the guys in IT debug a lot of stuff that I didn't understand, eventually, they tracked down some sort of security feature that prevented outbound requests that looked like they were coming from bots.

That is the status of non-standard requests in 2016, then: Anything sufficiently unorthodox looks like it might be a security compromise.

So now, I was completely dead in the water. I asked my clients to start printing and faxing the requests to us, so at least we could get something working.

It was time for a radical new solution.

Read more in Part III.