spark auto infer col(all string) types

2024-07-29 spark Comments Word Count: 62words Read Count: 1minutes

if we have many columns in one table, and all the columns are String. The problem is how to infer the real types.

sample_json = '''
    {
        "a": "123",
        "b": true,
        "c": "2022-01-01T18:00:00+8:00[Asia/Shanghai]"
    }
'''

# provide the sample json and infer types by it
schemas = schema_of_json(sample_json)

spark = ...
df = ...
df.select(from_json(payload_json, schemas).alias("data")).select("data.*")
    .withColumn("created_time", to_timestamp(regexp_replace(col("a"), r'\.\d+|\[.*\]', ''), "yyyy-MM-dd'T'HH:mm:ssXXX"))
    ...

Permalink: https://blog.tianshiming.com/2024/07/spark-auto-infer-col-type/

License：CC BY 4.0 CN

Mingtianshiming5@outlook.com

Finally