Content

File downloads in real browsers

Because file download is a tricky story (at least with HtmlUnit), lets start from the beginning with some info about file download in real browsers.
When surfing the internet, you usually opening url's pointing to content (files) containing (X)Html. And your favorite browser 'eats' this stuff and renders the content on your screen. There are also some more file types your browser is able to handle like CSS or JavaScript stuff. And of course all kind of images, sound and video files. This is done behind the scenes, and you can see all the fancy stuff the internet has to offer.
Only in three cases, your browser hands over the control about the content handling to you

  • the file type (mime type) is not supported by your browser (e.g. an excel/word file) or
  • the Content-Disposition Header of the response flags the response as attachment.
  • The last case is triggered from the client (browser) side - if an anchor, having the 'download' property set, is clicked, the content is saved as a local file.

In all cases, the content is saved as file to your local disk and depending on some user settings this is done automatically or by presenting a file dialog to let you save the file wherever you like.

File downloads with HtmlUnit

Because HtmlUnit nature as headless browser, there is no real rendering. But like real browsers HtmlUnit 'understands' the pages delivered by the server. If the response type is supported the content will be available as HtmlPage (or XHtmlPage) including access to all the embedded stuff like javascript, css and images. Like real browsers, there is also support for text only content (TextPage) and plain XML content (XmlPage).

But HtmlUnit can't offer a file dialog because the main use cases are automatic testing and scraping. Based on this the current implementation offers two ways to handle downloads:

UnexpectedPage (default)

HtmlUnit handles all unknown content similar to known content - the content is wrapped in a page object and the page is placed inside the window. For all unknown content HtmlUnit uses the UnexpectedPage. In most cases the UnexpectedPage replaces the current (Html)page in the current window, but in some cases a new additional window gets opened (e.g. clicking an anchor with a download attribute defined).

You can then access the plain content stream from the enclosed UnexpectedPage.

try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX)) {
    HtmlPage page = webClient.getPage(uri);
    WebWindow window = page.getEnclosingWindow();

    .... // click some anchor/button that forces a file download

    UnexpectedPage downloadPage = (UnexpectedPage) window.getEnclosedPage();

    try (InputStream downloadedContent = downloadPage.getInputStream()) {

        // e.g. save the input to a local file
        .... 
}

In the case the download is placed inside a new window you can do something like this:

WebWindow newWindow = webClient().getWebWindows().get(webClient().getWebWindows().size() -1);

AttachmentHandler

If you don't like the default behavior, you can implement your own way of processing. This is done by registering your own implementation of the AttachmentHandler interface in the WebClient.

The AttachmentHandler based download support works in addition to the default UnexpectedPage based support. This means, if your AttachmentHandler does not handle the content, the webClient will fall back to the default solution and place an UnexpectedPage inside the window. You can overwrite the method isAttachment() in your AttachmentHandler to only handle dedicated responses - the default implementation only detects responses having a content-disposition header of type 'attachment'.

The method

boolean handleAttachment(final WebResponse response)
is called if the response was detected as attachment (see above isAttachment()). You can process the attachment in your implementation (e.g. by saving it to a file) or simply return false. Based on the result of the call, the AttachmentHandler supports two operation modes.
  1. true signals the rest of the code the response is handled by your code; there is NO replacement of the current page with an UnexpectedPage and also the method
    void handleAttachment(Page page)
    from you AttachmentHandler is NOT called.
  2. By returning false, the response will be processed further like this:
    • at first a new window will be created,
    • next the UnexpectedPage will be build,
    • then the method
      void handleAttachment(Page page)
      from you AttachmentHandler is called and
    • finally the UnexpectedPage will be placed inside the new window.

The following example code collects the attachment-response inside a list without replacing the content of the current window.

final List<WebResponse> attachments = new ArrayList<>();

try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX)) {

        client.setAttachmentHandler(new AttachmentHandler() {
            @Override
            public boolean handleAttachment(final WebResponse response) {
                attachments.add(response);
                return true;
            }

            @Override
            public void handleAttachment(final Page page) {
                throw new IllegalAccessError("handleAttachment(Page) called");
            }
        });

    // start browsing
    HtmlPage page = webClient.getPage(uri);

    ....
}

For more details check the AttachmentHandler javadoc.

There is one special case - if you click an anchor with the download attribute set, the processing of the request is passed to the AttachmentHandler (if there is one) without calling the isAttachment() method. The response is always treated as attachment in this case.