I feel surprised by how succinct, easy-to-understand, and sensible the policy (M-23-22) is:
> Default to HTML: HyperText Markup Language (HTML) is the standard for publishing documents designed to be displayed in a web browser. HTML provides numerous advantages (e.g., easier to make accessible, friendlier to assistive technology, more dynamic and responsive, easier to maintain). When developing information for the web, agencies should default to creating and publishing content in an HTML format in lieu of publishing content in other electronic document formats that are designed for printing or preserving and protecting the content and layout of the document (e.g., PDF and DOCX formats). An agency should develop online content in a non-HTML format only if necessitated by a specific user need.
Hmmm ... accessibility is essential, but PDF is far better for static documents: There's no straightfoward, standard way to read an html document on another platform. Also, the html document may not be readable in 10+ years (unlike most PDFs), and updates are too fluid and hard to track.
I think the general problem is that the end-user doesn't control an html document, e.g., for annotation, as a local record, etc.
...What are you talking about? HTML files are readable on basically every platform, even moreso because they are fundamentally text files (unlike PDFs, which are binaries). PDFs need special software, html can be read on the command line. Likewise, HTML is dead simple to edit and annotate.
Seriously, name a single device that has PDF support that doesn't allow you to view HTML.
I think you're conflating "html" and "things stored on a server", because all of your objections apply to pdfs stored on a server. The ability to save and annotate pdfs is not an inherent feature of the file format, they exist because the format is such a PITA to interact with that specialized programs have to be written. HTML can be saved just as easily, and usually is (on archive.org).
1. Saving as "Webpage, Single File" (.mhtml): Neither Firefox nor Chrome even showed up in the list of available apps to open it.
2. Saving as "Webpage, Complete": Opened in Chrome but images were broken. Also very difficult to open with the default file browser because it uses a flat folder view and the sidecar folder pollutes the file list.
I was hoping this would work, perhaps you will have different findings. I agree that HTML is the superior format in theory but usability in practice is often lacking. I'm resigned to using both depending on context.
Yes, that's the kind of issue I was talking about. I wish it were otherwise. As a nearby comment pointed out, epub is a potential solution (and I wish arXiv embraced it - without my knowing their other requirements or epub's accessibility features). It's essentially packaged html.
How do I save an HTML document locally, and annotate it, in an easily sharable form, and in a form that is stable - i.e., in a way that will be readable and useable in 20-50 years?
Basically any HTML document from 20-30 years ago (can't go any further because it didn't exist 50 years ago) will be completely readable and usable. The only issue is people creating content (not styling) in formats besides HTML.
As far as annotations, you can use the native <ruby>[1] tag, or strikethough, but if you mean "literally drawing on the text" then, yeah, you're looking for an image format at that point (which is fundamentally what PDF is), but we shouldn't default to storing text in image formats just because of one specific use case. (Also, as I said above, the only reason tools exist to easily do that in PDFs exist is because everyone insists on using a format that's hard to edit. )
Also, note that the context I was responding to was US legal documents, not something more presentation-heavy.
You say it as if pdf is somehow better. To begin with it's a proprietary format. If Adobe goes bankrupt or obscure tomorrow, pdf will go out of use as a failed technology.
I mean, how do I save it locally on one platform and read it on any platform? Or share it with someone else to read (without them downloading software)? I.e., we don't have a standard, local, single-file html format.
We could have such a format if browser and os vendors were interested in supporting such a use case. Unfortunately, they aren't.
On the browser side, supporting all-in-one html files can be as simple a reading a single multipart-encoded page. Heck, if they support automatically serializing all external resources as datauris when saving pages, then most browsers will be able to open them without any modification.
On the OS side, operating systems can treat html files as first class citizens; execute them in an offline sandbox (most operating systems have embedded webviews), then extract icon, title, description and other metadata to present to the user. An icon the consists of a blank page with a small browser icon in the corner doesn't tell me anything about what the page is about. This needs to change.
In short, html can be easily made nicer to deal with locally thanks to all the parts already being in place. The problem is that no one (tech giants, os vendors) are interested in doing this.
.mhtml (or .mhtm) is that format. It's an archive containing an HTML file along with all the resources it references (JavaScript, CSS, and images). These browsers support it: Internet Explorer, Edge, Opera, Chrome, Yandex, and Vivaldi. Create one by saving the web page and choosing the .mhtml format. Safari supports another format called webarchive.
> I mean, how do I save it locally on one platform and read it on any platform?
Ctrl/Meta/Cmd + S should do the trick, or "File > Save page", and you get a HTML file you can open in any browser. If there is images, they'll most likely be loaded remotely, or worst case not load at all. But the rest of the structure is there.
A web page is much more than one file. Also, I'm looking for something with end-user control, where they can save the current document statically and long-term.
Despite all our advances, we lack an editable, local, multimedia, platform (and form-factor) independent, self-contained file - essentially a word-processing file for the 21st century (and I mean it's almost a quarter-century overdue). epub has that potential as a format, and being based on web standards it has capability, a universe of supporting tools and technology, and easy adoption to different applications.
But I haven't heard anyone else express that particular interest, and as of a few years ago epub doesn't allow annotations and is not stable (i.e., I don't know that today's epub file will be readable in 20 or 50 years) - two essential requirements for a serious local content, imho.
And even if it meets those specifications, we need epub editors that are the equivalent of word processsors for non-technical users.
> Default to HTML: HyperText Markup Language (HTML) is the standard for publishing documents designed to be displayed in a web browser. HTML provides numerous advantages (e.g., easier to make accessible, friendlier to assistive technology, more dynamic and responsive, easier to maintain). When developing information for the web, agencies should default to creating and publishing content in an HTML format in lieu of publishing content in other electronic document formats that are designed for printing or preserving and protecting the content and layout of the document (e.g., PDF and DOCX formats). An agency should develop online content in a non-HTML format only if necessitated by a specific user need.
https://www.whitehouse.gov/omb/management/ofcio/delivering-a...