Python Pandas - Working with HTML Data



The Pandas library provides extensive functionalities for handling data from various formats. One such format is HTML (HyperText Markup Language), which is a commonly used format for structuring web content. The HTML files may contain tabular data, which can be extracted and analyzed using the Pandas library.

An HTML table is a structured format used to represent tabular data in rows and columns within a webpage. Extracting this tabular data from an HTML is possible by using the pandas.read_html() function. Writing the Pandas DataFrame back to an HTML table is also possible using the DataFrame.to_html() method.

In this tutorial, we will learn about how to work with HTML data using Pandas, including reading HTML tables and writing the Pandas DataFrames to HTML tables.

Reading HTML Tables from a URL

The pandas.read_html() function is used for reading tables from HTML files, strings, or URLs. It automatically parses <table> elements in HTML and returns a list of pandas.DataFrame objects.

Example

Here is the basic example of reading the data from a URL using the pandas.read_html() function.

import pandas as pd

# Read HTML table from a URL
url = "https://www.tutorialspoint.com/sql/sql-clone-tables.htm"
tables = pd.read_html(url)

# Access the first table from the URL
df = tables[0]

# Display the resultant DataFrame
print('Output First DataFrame:', df.head())

Following is the output of the above code −

Output First DataFrame:
ID NAME AGE ADDRESS SALARY
0 1 Ramesh 32 Ahmedabad 2000.0
1 2 Khilan 25 Delhi 1500.0
2 3 Kaushik 23 Kota 2000.0
3 4 Chaitali 25 Mumbai 6500.0
4 5 Hardik 27 Bhopal 8500.0

Reading HTML Data from a String

Reading HTML data directly from a string can be possible by using the Python's io.StringIO module.

Example

The following example demonstrates how to read the HTML string using StringIO without saving to a file.

import pandas as pd
from io import StringIO

# Create an HTML string
html_str = """
<table>
   <tr><th>C1</th><th>C2</th><th>C3</th></tr>
   <tr><td>a</td><td>b</td><td>c</td></tr>
   <tr><td>x</td><td>y</td><td>z</td></tr>
</table>
"""

# Read the HTML string
dfs = pd.read_html(StringIO(html_str))
print(dfs[0])

Following is the output of the above code −


C1 C2 C3
0 a b c
1 x y z

Example

This is an alternative way of reading the HTML string with out using the io.StringIO module. Here we will save the HTML string into a temporary file and read it using the pandas.read_html() function.

import pandas as pd

# Create an HTML string
html_str = """
<table>
   <tr><th>C1</th><th>C2</th><th>C3</th></tr>
   <tr><td>a</td><td>b</td><td>c</td></tr>
   <tr><td>x</td><td>y</td><td>z</td></tr>
</table>
"""

# Save to a temporary file and read
with open("temp.html", "w") as f:
    f.write(html_str)

df = pd.read_html("temp.html")[0]
print(df)

Following is the output of the above code −


C1 C2 C3
0 a b c
1 x y z

Handling Multiple Tables from an HTML file

While reading an HTML file of containing multiple tables, we can handle it by using the match parameter of the pandas.read_html() function to read a table that has specific text.

Example

The following example reads a table that has a specific text from the HTML file of having multiple tables using the match parameter.

import pandas as pd

# Read tables from a SQL tutorial
url = "https://www.tutorialspoint.com/sql/sql-clone-tables.htm"
tables = pd.read_html(url, match='Field')

# Access the table
df = tables[0]
print(df.head())

Following is the output of the above code −


Field Type Null Key Default Extra
1 ID int(11) NO PRI NaN NaN
2 NAME varchar(20) NO NaN NaN NaN
3 AGE int(11) NO NaN NaN NaN
4 ADDRESS char(25) YES NaN NaN NaN
5 SALARY decimal(18,2) YES NaN NaN NaN

Writing DataFrames to HTML

Pandas DataFrame objects can be converted to HTML tables using the DataFrame.to_html() method. This method returns a string if the parameter buf is set to None.

Example

The following example demonstrates how to write a Pandas DataFrame to an HTML Table using the DataFrame.to_html() method.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])

# Convert the DataFrame to HTML table
html = df.to_html()

# Display the HTML string
print(html)

Following is the output of the above code −

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>A</th>
      <th>B</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>1</td>
      <td>2</td>
    </tr>
    <tr>
      <th>1</th>
      <td>3</td>
      <td>4</td>
    </tr>
  </tbody>
</table>
Advertisements
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy