Tepetzintla Zacatlan Nahuatl Endangered Language
Summary: Audio corpus of Zacatlán-Ahuacatlán-Tepetzintla (Puebla) Nahuatl speech (Glottocode: zaca1241; ISO 639-3: nhi)
License: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
Downloads (use a mirror closer to you):
info.pdf [94K] (Document with information about this corpus ) Mirrors: [US] [EU] [CN]
Tepetzintla.zip [38G] (Speech data of Tepetzintla Zacatlan Nahuatl, recorded in 48kHz, 16-bit ) Mirrors: [US] [EU] [CN]
Tepetzintla-Zacatlan-Nahuatl_Collaborators.txt [6.1K] (List of all native speaker collaborators for this corpus ) Mirrors: [US] [EU] [CN]
Tepetzintla-Zacatlan-Nahuatl_File-list.txt [61K] (List of all filenames with duration ) Mirrors: [US] [EU] [CN]
Plant-collections_Tepetzintla.csv [19K] ( List of all plant observations with observation number, family, scientific name, date collected, name of person who identified the plant ) Mirrors: [US] [EU] [CN]
Plant-Labels_Tepetzintla-Zacatlan-ethnobotanical-field-trips_2023-10-22.pdf [717K] (Labels for the 81 plant observations the audio of which is included in this corpus ) Mirrors: [US] [EU] [CN]
About this resource:
The (ethno)botanical labels for 242 all plant collections in the municipality of Tepetzintla are included as reference (see pdf file named: Plant-Labels_Tepetzintla-Zacatlan-ethnobotanical-field-trips_2023-10-22.pdf). A reduced comma delimited csv file contains the metadata for collection number, family, scientific name, date collected, and name of person who identified the plant. As plants continue to be identified with their scientific names from the field photos taken, these files will be updated. Note again that only collections from 84091 to 84239 are accompanied by field recordings.
Please note that this initial OpenSLR deposit focuses on the audio corpus. Five future enhancements to this resource are envisioned at this present time: (1) Completed metadata, particularly a description of the content of each of the 578 recordings; (2) 10 hours of transcription by hand in ELAN, material that will provide the initial basis for transfer ASR; (3) A final deposit of the results of ASR transcriptions; (4) Corrections to the ASR transcriptions carried out by Amith and native speakers; (5) Reference to the ASR end2end recipe (GitHub) used to generate the ASR transcriptions.
The fieldwork for developing this corpus was supported by NSF Dynamic Language Infrastructure grant #2123578 entitled “Collaborative Research: Improving Techniques of Automatic Speech Recognition and Transfer Learning using Documentary Linguistic Corpora” (Jonathan D. Amith, PI). The speech processing facet of this research (Award #2123624) will be carried out by Shinji Watanabe (PI) and his team at Carnegie Mellon University.
All material is made available under the Creative Common license CC BY-SA (Attribution-ShareAlike). Please cite or use any material as follows (Corresponding author is Jonathan D. Amith firstname.lastname@example.org).
Amith, Jonathan D., Amelia Domínguez Alcántara, Ceferino Salgado Castañeda, Ángeles Márquez Hernández, and Osbel López-Francisco, 2023, Audio corpus of Zacatlán-Ahuacatlán-Tepetzintla (Puebla) Nahuatl speech (Glottocode: zaca1241; ISO 639-3: nhi). Accessed [date] at https://www.openslr.org/148.