Skip to content

DOC: min_itemsize for HDFStore append for encoded strings #14601

@johanneshk

Description

@johanneshk

I'm confused about how to preset min_itemsizes for appending to an HDFStore. Say DataFrame a and b in the MWE below is user-provided, so it can contain any character and the encoding is unknown. Appending a works, but appending b fails even though:

In [4]: len('香')
Out[4]: 1

So far I simply used str.len().max() on the string columns to the the numbers for min_itemsize, but this does not work in the example here. This MWE is of course simplified, but I guess I'm wondering:

  • how does pytables come up with the string length?
  • how should I determine the string length? Considering the encoding is unknown, but pytables assumes some encoding / pytables converts the strings to some other object?

In this toy example I could encode the string as utf-8 to get the correct length, but this isn't a general approach:

In [5]: len('香'.encode('utf-8'))
Out[5]: 3

MWE:

import pandas as pd
                                                      
a = pd.DataFrame([['a', 'b']], columns = ['A', 'B'])
b = pd.DataFrame([['香', 'b']], columns = ['A', 'B'])

store = pd.HDFStore('/tmp/tmpstore')

store.append('df', a, min_itemsizes={'A': 1, 'B': 1})
store.append('df', b, min_itemsizes={'A': 1, 'B': 1}) # fails

Expected Output

ValueError: Trying to store a string with len [3] in [values_block_0] column but
this column has a limit of [1]!
Consider using min_itemsize to preset the sizes on these columns
Closing remaining open files:/tmp/tmpstore...done

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.28-2-MANJARO machine: x86_64 processor: byteorder: little LC_ALL: en_US.UTF-8 LANG: en_DE.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.0
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.23.5
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.8
patsy: None
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.3
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: None
pandas_datareader: None

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    pFad - Phonifier reborn

    Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

    Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


    Alternative Proxies:

    Alternative Proxy

    pFad Proxy

    pFad v3 Proxy

    pFad v4 Proxy